Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-12-08 Thread Peter Geoghegan
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan  wrote:
> Attached draft patch attempts to explain things in this area within
> the nbtree README. There is a much shorter comment about it within
> vacuumlazy.c. I am concerned about GiST index-only scans themselves,
> of course, but I discovered this issue when thinking carefully about
> the concurrency rules for VACUUM -- I think it's valuable to formalize
> and justify the general rules that index access methods must follow.

I pushed a commit that described how this works for nbtree, in the README file.

I think that there might be an even more subtle race condition in
nbtree itself, though, during recovery. We no longer do a "pin scan"
during recovery these days (see commits 9f83468b, 3e4b7d87, and
687f2cd7 for full information). I think that it might be necessary to
do that, just for the benefit of index-only scans -- if it's necessary
during original execution, then why not during recovery?

The work to remove "pin scans" was justified by pointing out that we
no longer use various kinds of snapshots during recovery, but it said
nothing about index-only scans, which need the TID recycling interlock
(i.e. need to hold onto a leaf page while accessing the heap in sync)
even with an MVCC snapshot. It's easy to imagine how it might have
been missed: nobody ever documented the general issue with index-only
scans, until now. Commit 2ed5b87f recognized they were unsafe for the
optimization that it added (to avoid blocking VACUUM), but never
explained why they were unsafe.

Going back to doing pin scans during recovery seems deeply
unappealing, especially to fix a totally narrow race condition.

-- 
Peter Geoghegan




Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-11-30 Thread Peter Geoghegan
On Tue, Nov 30, 2021 at 5:09 PM Peter Geoghegan  wrote:
> I believe that there have been several historic reasons why we need a
> cleanup lock during nbtree VACUUM, and that there is only one
> remaining reason for it today. So the history is unusually complicated.

Minor correction: we actually also have to worry about plain index
scans that don't use an MVCC snapshot, which is possible within
nbtree. It's quite likely when using logical replication, actually.
See the patch for more.

Like with the index-only scan case, a non-MVCC snapshot + plain nbtree
index scan cannot rely on heap access within the index scan node -- it
won't reliably notice that any newer heap tuples (that are really the
result of concurrent TID recycling) are not actually visible to its
MVCC snapshot -- because there isn't an MVCC snapshot. The only
difference in the index-only scan scenario is that we use the
visibility map (not the heap) -- which is racey in a way that makes
our MVCC snapshot (IOSs always have an MVCC snapshot) an ineffective
protection.

In summary, to be safe against confusion from concurrent TID recycling
during index/index-only scans, we can do either of the following
things:

1. Hold a pin of our leaf page while accessing the heap -- that'll
definitely conflict with the cleanup lock that nbtree VACUUM will
inevitably try to acquire on our leaf page.

OR:

2. Hold an MVCC snapshot, AND do an actual heap page access during the
plain index scan -- do both together.

With approach 2, our plain index scan must determine visibility using
real XIDs (against something like a dirty snapshot), rather than using
a visibility map bit. That is also necessary because the VM might
become invalid or ambiguous, in a way that's clearly not possible when
looking at full heap tuple headers with XIDs -- concurrent recycling
becomes safe if we know that we'll reliably notice it and not give
wrong answers.

Does that make sense?

-- 
Peter Geoghegan




Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-11-30 Thread Peter Geoghegan
On Fri, Nov 5, 2021 at 3:26 AM Andrey Borodin  wrote:
> > 4 нояб. 2021 г., в 20:58, Peter Geoghegan  написал(а):
> > That's a pretty unlikely scenario. And even
> > if it happened it would only happen once (until the next time we get
> > unlucky). What are the chances of somebody noticing a more or less
> > once-off, slightly wrong answer?
>
> I'd say next to impossible, yet not impossible. Or, perhaps, I do not see 
> protection from this.

I think that that's probably all correct -- I would certainly make the
same guess. It's very unlikely to happen, and when it does happen it
happens only once.

> Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. 
> AFAIU it should take cleanup lock on buffer too?

No, because there is no heap vacuuming involved (because that doesn't
happen outside lazyvacuum.c). The work that VACUUM does inside
lazy_vacuum_heap_rel() is part of the problem here -- we need an
interlock between that work, and index-only scans. Making LP_DEAD
items in heap pages LP_UNUSED is only ever possible during a VACUUM
operation (I'm sure you know why). AFAICT there would be no bug at all
without that detail.

I believe that there have been several historic reasons why we need a
cleanup lock during nbtree VACUUM, and that there is only one
remaining reason for it today. So the history is unusually complicated. But
AFAICT it's always some kind of "interlock with heapam VACUUM" issue,
with TID recycling, with no protection from our MVCC snapshot. I would
say that that's the "real problem" here, when I get to first principles.

Attached draft patch attempts to explain things in this area within
the nbtree README. There is a much shorter comment about it within
vacuumlazy.c. I am concerned about GiST index-only scans themselves,
of course, but I discovered this issue when thinking carefully about
the concurrency rules for VACUUM -- I think it's valuable to formalize
and justify the general rules that index access methods must follow.

We can talk about this some more in NYC. See you soon!
--
Peter Geoghegan
From ea6612300e010f1f2b02119b5a0de95e46d1157d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Wed, 3 Nov 2021 14:38:15 -0700
Subject: [PATCH v1] nbtree README: Improve VACUUM interlock section.

Also document a related issue for index-only scans in vacuumlazy.c.

Author: Peter Geoghegan 
Discussion: https://postgr.es/m/CAH2-Wz=PqOziyRSrnN5jAtfXWXY7-BJcHz9S355LH8Dt=5qxWQ@mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c |  10 ++
 src/backend/access/nbtree/README | 145 ---
 2 files changed, 75 insertions(+), 80 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 282b44f87..8bfe196bf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2384,6 +2384,16 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD items on the page that were determined to be LP_DEAD items back
  * when the same page was visited by lazy_scan_prune() (i.e. those whose TID
  * was recorded in the dead_items array at the time).
+ *
+ * We can opportunistically set the visibility map bit for the page here,
+ * which is valuable when lazy_scan_prune couldn't earlier on, owing only to
+ * the fact that there were LP_DEAD items that we'll now mark as unused.  This
+ * is why index AMs that support index-only scans have to hold a pin on an
+ * index page as an interlock against VACUUM while accessing the visibility
+ * map (which is really just a dense summary of visibility information in the
+ * heap).  If they didn't do this then there would be rare race conditions
+ * where a heap TID that is actually dead appears alive to an unlucky
+ * index-only scan.
  */
 static int
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2a7332d07..c6f04d856 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -89,25 +89,28 @@ Page read locks are held only for as long as a scan is examining a page.
 To minimize lock/unlock traffic, an index scan always searches a leaf page
 to identify all the matching items at once, copying their heap tuple IDs
 into backend-local storage.  The heap tuple IDs are then processed while
-not holding any page lock within the index.  We do continue to hold a pin
-on the leaf page in some circumstances, to protect against concurrent
-deletions (see below).  In this state the scan is effectively stopped
-"between" pages, either before or after the page it has pinned.  This is
-safe in the presence of concurrent insertions and even page splits, because
-items are never moved across pre-existing page boundaries --- so the scan
-cannot miss any items it should have seen, nor accidentally return the same
-item twice.  The scan must remember the page's right-link at the time it
-was scanned, since that is the page to 

Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-11-05 Thread Andrey Borodin



> 4 нояб. 2021 г., в 20:58, Peter Geoghegan  написал(а):
> That's a pretty unlikely scenario. And even
> if it happened it would only happen once (until the next time we get
> unlucky). What are the chances of somebody noticing a more or less
> once-off, slightly wrong answer?

I'd say next to impossible, yet not impossible. Or, perhaps, I do not see 
protection from this.

Moreover there's a "microvacuum". It kills tuples with BUFFER_LOCK_SHARE. AFAIU 
it should take cleanup lock on buffer too?

Best regards, Andrey Borodin.



Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-11-04 Thread Peter Geoghegan
On Thu, Nov 4, 2021 at 8:52 AM Andrey Borodin  wrote:
> Let's enumerate steps how things can go wrong.
>
> Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on 
> index_page1 pointing to heap_tuple1 on page1
>
> Backend2: Remove index_tuple1 and heap_tuple1
>
> Backend3: Mark page1 all-visible
> Backend1: Thinks that page1 is all-visible and shows index_tuple1 as visible
>
> To avoid this Backend1 must hold pin on index_page1 until it's done with 
> checking visibility, and Backend2 must do LockBufferForCleanup(index_page1). 
> Do I get things right?

Almost. Backend3 is actually Backend2 here (there is no 3) -- it runs
VACUUM throughout.

Note that it's not particularly likely that Backend2/VACUUM will "win"
this race, because it typically has to do much more work than
Backend1. It has to actually remove the index tuples from the leaf
page, then any other index work (for this and other indexes). Then it
has to arrive back in vacuumlazy.c to set the VM bit in
lazy_vacuum_heap_page(). That's a pretty unlikely scenario. And even
if it happened it would only happen once (until the next time we get
unlucky). What are the chances of somebody noticing a more or less
once-off, slightly wrong answer?

-- 
Peter Geoghegan




Re: Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-11-04 Thread Andrey Borodin
  04.11.2021, 04:33, "Peter Geoghegan" :But what about index-only scans, which GiST also supports? I thinkthat the rules are different there, even though index-only scans usean MVCC snapshot.Let's enumerate steps how things can go wrong.Backend1: Index-Only scan returns tid and xs_hitup with index_tuple1 on index_page1 pointing to heap_tuple1 on page1Backend2: Remove index_tuple1 and heap_tuple1Backend3: Mark page1 all-visibleBackend1: Thinks that page1 is all-visible and shows index_tuple1 as visible To avoid this Backend1 must hold pin on index_page1 until it's done with checking visibility, and Backend2 must do LockBufferForCleanup(index_page1). Do I get things right? Best regards, Andrey Borodin. 




Why doesn't GiST VACUUM require a super-exclusive lock, like nbtree VACUUM?

2021-11-03 Thread Peter Geoghegan
The code in gistvacuum.c is closely based on similar code in nbtree.c,
except that it only acquires an exclusive lock -- not a
super-exclusive lock. I suspect that that's because it seemed
unnecessary; nbtree plain index scans have their own special reasons
for this, that don't apply to GiST. Namely: nbtree plain index scans
that don't use an MVCC snapshot clearly need some other interlock to
protect against concurrent recycling of pointed-to-by-leaf-page TIDs.
And so as a general rule nbtree VACUUM needs a full
super-exclusive/cleanup lock, just in case there is a plain index scan
that uses some other kind of snapshot (logical replication, say).

To say the same thing another way: nbtree follows "the third rule"
described by "62.4. Index Locking Considerations" in the docs [1], but
GiST does not. The idea that GiST's behavior is okay here does seem
consistent with what the same docs go on to say about it: "When using
an MVCC-compliant snapshot, there is no problem because the new
occupant of the slot is certain to be too new to pass the snapshot
test".

But what about index-only scans, which GiST also supports? I think
that the rules are different there, even though index-only scans use
an MVCC snapshot.

The (admittedly undocumented) reason why we can never drop the leaf
page pin for an index-only scan in nbtree (but can do so for plain
index scans) also relates to heap interlocking -- but with a twist.
Here's the twist: the second heap pass by VACUUM can set visibility
map bits independently of the first (once LP_DEAD items from the first
pass over the heap are set to LP_UNUSED, which renders the page
all-visible) -- this all happens at the end of
lazy_vacuum_heap_page(). That's why index-only scans can't just assume
that VACUUM won't have deleted the TID from the leaf page they're
scanning immediately after they're done reading it. VACUUM could even
manage to set the visibility map bit for a relevant heap page inside
lazy_vacuum_heap_page(), before the index-only scan can read the
visibility map. If that is allowed to happen, the index-only would
give wrong answers if one of the TID references held in local memory
by the index-only scan happens to be marked LP_UNUSED inside
lazy_vacuum_heap_page(). IOW, it looks like we run the risk of a
concurrently recycled dead-to-everybody TID becoming visible during
GiST index-only scans, just because we have no interlock.

In summary:

UUIC this is only safe for nbtree because 1.) It acquires a
super-exclusive lock when vacuuming leaf pages, and 2.) Index-only
scans never drop their pin on the leaf page when accessing the
visibility map "in-sync" with the scan (of course we hope not to
access the heap proper at all for index-only scans). These precautions
are both necessary to make the race condition I describe impossible,
because they ensure that VACUUM cannot reach lazy_vacuum_heap_page()
until after our index-only scan reads the visibility map (and then has
to read the heap for at least that one dead-to-all TID, discovering
that the TID is dead to its snapshot). Why wouldn't GiST need to take
the same precautions, though?

[1] https://www.postgresql.org/docs/devel/index-locking.html
--
Peter Geoghegan




Re: Questions/Observations related to Gist vacuum

2020-01-12 Thread Dilip Kumar
On Thu, Jan 9, 2020 at 4:41 PM Mahendra Singh Thalor  wrote:
>
> On Mon, 9 Dec 2019 at 14:37, Amit Kapila  wrote:
> >
> > On Mon, Dec 9, 2019 at 2:27 PM Amit Kapila  wrote:
> > >
> > > I have modified the patch for the above points and additionally ran
> > > pgindent.  Let me know what you think about the attached patch?
> > >
> >
> > A new version with a slightly modified commit message.
>
> I reviewed v4 patch and below is the one review comment:
>
> + * These are used to memorize all internal and empty leaf pages. They are
> + * used for deleting all the empty pages.
>   */
> After dot, there should be 2 spaces. Earlier, there was 2 spaces.
>
> Other than that patch looks fine to me.
>
Thanks for the comment. Amit has already taken care of this before pushing it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2020-01-09 Thread Mahendra Singh Thalor
On Mon, 9 Dec 2019 at 14:37, Amit Kapila  wrote:
>
> On Mon, Dec 9, 2019 at 2:27 PM Amit Kapila 
wrote:
> >
> > I have modified the patch for the above points and additionally ran
> > pgindent.  Let me know what you think about the attached patch?
> >
>
> A new version with a slightly modified commit message.

I reviewed v4 patch and below is the one review comment:

+ * These are used to memorize all internal and empty leaf pages. They
are
+ * used for deleting all the empty pages.
  */
After dot, there should be 2 spaces. Earlier, there was 2 spaces.

Other than that patch looks fine to me.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


Re: Questions/Observations related to Gist vacuum

2019-12-09 Thread Dilip Kumar
On Mon, Dec 9, 2019 at 2:37 PM Amit Kapila  wrote:
>
> On Mon, Dec 9, 2019 at 2:27 PM Amit Kapila  wrote:
> >
> > I have modified the patch for the above points and additionally ran
> > pgindent.  Let me know what you think about the attached patch?
> >
>
> A new version with a slightly modified commit message.

Your changes look fine to me.  Thanks!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-12-09 Thread Amit Kapila
On Mon, Dec 9, 2019 at 2:27 PM Amit Kapila  wrote:
>
> I have modified the patch for the above points and additionally ran
> pgindent.  Let me know what you think about the attached patch?
>

A new version with a slightly modified commit message.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v4-0001-Delete-empty-pages-in-each-pass-during-GIST-VACUUM.patch
Description: Binary data


Re: Questions/Observations related to Gist vacuum

2019-12-09 Thread Amit Kapila
On Fri, Oct 25, 2019 at 9:22 PM Masahiko Sawada  wrote:
>
> On Wed, Oct 23, 2019 at 8:14 PM Amit Kapila  wrote:
> >
> > On Tue, Oct 22, 2019 at 2:17 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Oct 22, 2019 at 10:53 AM Amit Kapila  
> > > wrote:
> > >
> > > I have modified as we discussed.  Please take a look.
> > >
> >
> > Thanks, I haven't reviewed this yet, but it seems to be on the right
> > lines.  Sawada-San, can you please prepare the next version of the
> > parallel vacuum patch on top of this patch and enable parallel vacuum
> > for Gist indexes?
>
> Yeah I've sent the latest patch set that is built on top of this
> patch[1]. BTW I looked at this patch briefly but it looks good to me.
>

Today, I have looked at this patch and found a few things that need to
be changed:

1.
 static void gistvacuum_delete_empty_pages(IndexVacuumInfo *info,
-   GistBulkDeleteResult *stats);
-static bool gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
+   GistVacState *stats);

I think stats is not a good name for GistVacState.  How about vstate?

2.
+ /* we don't need the internal and empty page sets anymore */
+ MemoryContextDelete(vstate.page_set_context);

After memory context delete, we can reset this and other related
variables as we were doing without the patch.

3.  There are a couple of places in code (like comments, README) that
mentions the deletion of empty pages in the second stage of the
vacuum.  We should change all such places.

I have modified the patch for the above points and additionally ran
pgindent.  Let me know what you think about the attached patch?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Delete-empty-pages-in-each-pass-during-GIST-VACUUM.patch
Description: Binary data


Re: Questions/Observations related to Gist vacuum

2019-10-25 Thread Masahiko Sawada
On Wed, Oct 23, 2019 at 8:14 PM Amit Kapila  wrote:
>
> On Tue, Oct 22, 2019 at 2:17 PM Dilip Kumar  wrote:
> >
> > On Tue, Oct 22, 2019 at 10:53 AM Amit Kapila  
> > wrote:
> > >
> > > > Basically, only IndexBulkDeleteResult is now shared across the stage
> > > > so we can move all members to GistVacState and completely get rid of
> > > > GistBulkDeleteResult?
> > > >
> > >
> > > Yes, something like that would be better.  Let's try and see how it comes 
> > > out.
> > I have modified as we discussed.  Please take a look.
> >
>
> Thanks, I haven't reviewed this yet, but it seems to be on the right
> lines.  Sawada-San, can you please prepare the next version of the
> parallel vacuum patch on top of this patch and enable parallel vacuum
> for Gist indexes?

Yeah I've sent the latest patch set that is built on top of this
patch[1]. BTW I looked at this patch briefly but it looks good to me.

[1] 
https://www.postgresql.org/message-id/CAD21AoBMo9dr_QmhT%3DdKh7fmiq7tpx%2ByLHR8nw9i5NZ-SgtaVg%40mail.gmail.com

Regards,

--
Masahiko Sawada




Re: Questions/Observations related to Gist vacuum

2019-10-22 Thread Dilip Kumar
On Tue, Oct 22, 2019 at 10:53 AM Amit Kapila  wrote:
>
> On Tue, Oct 22, 2019 at 10:50 AM Dilip Kumar  wrote:
> >
> > On Tue, Oct 22, 2019 at 9:10 AM Amit Kapila  wrote:
> > >
> > > On Fri, Oct 18, 2019 at 4:51 PM Dilip Kumar  wrote:
> > > >
> > > > I have prepared a first version of the patch.  Currently, I am
> > > > performing an empty page deletion for all the cases.
> > > >
> > >
> > > Few comments:
> > > --
> > > 1.
> > > -/*
> > > - * State kept across vacuum stages.
> > > - */
> > >  typedef struct
> > >  {
> > > - IndexBulkDeleteResult stats; /* must be first */
> > > + IndexBulkDeleteResult *stats; /* kept across vacuum stages. */
> > >
> > >   /*
> > > - * These are used to memorize all internal and empty leaf pages in the 
> > > 1st
> > > - * vacuum stage.  They are used in the 2nd stage, to delete all the empty
> > > - * pages.
> > > + * These are used to memorize all internal and empty leaf pages. They are
> > > + * used for deleting all the empty pages.
> > >   */
> > >   IntegerSet *internal_page_set;
> > >   IntegerSet *empty_leaf_set;
> > >
> > > Now, if we don't want to share the remaining stats across
> > > gistbulkdelete and gistvacuumcleanup, isn't it better to keep the
> > > information of internal and empty leaf pages as part of GistVacState?
> >
> > Basically, only IndexBulkDeleteResult is now shared across the stage
> > so we can move all members to GistVacState and completely get rid of
> > GistBulkDeleteResult?
> >
>
> Yes, something like that would be better.  Let's try and see how it comes out.
I have modified as we discussed.  Please take a look.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v2-0001-delete-empty-page-in-gistbulkdelete.patch
Description: Binary data


Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Amit Kapila
On Tue, Oct 22, 2019 at 10:50 AM Dilip Kumar  wrote:
>
> On Tue, Oct 22, 2019 at 9:10 AM Amit Kapila  wrote:
> >
> > On Fri, Oct 18, 2019 at 4:51 PM Dilip Kumar  wrote:
> > >
> > > I have prepared a first version of the patch.  Currently, I am
> > > performing an empty page deletion for all the cases.
> > >
> >
> > Few comments:
> > --
> > 1.
> > -/*
> > - * State kept across vacuum stages.
> > - */
> >  typedef struct
> >  {
> > - IndexBulkDeleteResult stats; /* must be first */
> > + IndexBulkDeleteResult *stats; /* kept across vacuum stages. */
> >
> >   /*
> > - * These are used to memorize all internal and empty leaf pages in the 1st
> > - * vacuum stage.  They are used in the 2nd stage, to delete all the empty
> > - * pages.
> > + * These are used to memorize all internal and empty leaf pages. They are
> > + * used for deleting all the empty pages.
> >   */
> >   IntegerSet *internal_page_set;
> >   IntegerSet *empty_leaf_set;
> >
> > Now, if we don't want to share the remaining stats across
> > gistbulkdelete and gistvacuumcleanup, isn't it better to keep the
> > information of internal and empty leaf pages as part of GistVacState?
>
> Basically, only IndexBulkDeleteResult is now shared across the stage
> so we can move all members to GistVacState and completely get rid of
> GistBulkDeleteResult?
>

Yes, something like that would be better.  Let's try and see how it comes out.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Dilip Kumar
On Tue, Oct 22, 2019 at 9:10 AM Amit Kapila  wrote:
>
> On Fri, Oct 18, 2019 at 4:51 PM Dilip Kumar  wrote:
> >
> > I have prepared a first version of the patch.  Currently, I am
> > performing an empty page deletion for all the cases.
> >
>
> Few comments:
> --
> 1.
> -/*
> - * State kept across vacuum stages.
> - */
>  typedef struct
>  {
> - IndexBulkDeleteResult stats; /* must be first */
> + IndexBulkDeleteResult *stats; /* kept across vacuum stages. */
>
>   /*
> - * These are used to memorize all internal and empty leaf pages in the 1st
> - * vacuum stage.  They are used in the 2nd stage, to delete all the empty
> - * pages.
> + * These are used to memorize all internal and empty leaf pages. They are
> + * used for deleting all the empty pages.
>   */
>   IntegerSet *internal_page_set;
>   IntegerSet *empty_leaf_set;
>
> Now, if we don't want to share the remaining stats across
> gistbulkdelete and gistvacuumcleanup, isn't it better to keep the
> information of internal and empty leaf pages as part of GistVacState?

Basically, only IndexBulkDeleteResult is now shared across the stage
so we can move all members to GistVacState and completely get rid of
GistBulkDeleteResult?

IndexBulkDeleteResult *stats
IntegerSet *internal_page_set;
IntegerSet *empty_leaf_set;
MemoryContext page_set_context;


> Also, I think it is better to call gistvacuum_delete_empty_pages from
> function gistvacuumscan as that will avoid it calling from multiple
> places.
Yeah we can do that.
>
> 2.
> - gist_stats->page_set_context = NULL;
> - gist_stats->internal_page_set = NULL;
> - gist_stats->empty_leaf_set = NULL;
>
> Why have you removed this initialization?
This was post-cleanup reset since we were returning the gist_stats so
it was better to clean up but now we are not returning it out so I
though we don't need to clean this.  But, I think now we can free the
memory gist_stats itself.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Amit Kapila
On Fri, Oct 18, 2019 at 4:51 PM Dilip Kumar  wrote:
>
> I have prepared a first version of the patch.  Currently, I am
> performing an empty page deletion for all the cases.
>

Few comments:
--
1.
-/*
- * State kept across vacuum stages.
- */
 typedef struct
 {
- IndexBulkDeleteResult stats; /* must be first */
+ IndexBulkDeleteResult *stats; /* kept across vacuum stages. */

  /*
- * These are used to memorize all internal and empty leaf pages in the 1st
- * vacuum stage.  They are used in the 2nd stage, to delete all the empty
- * pages.
+ * These are used to memorize all internal and empty leaf pages. They are
+ * used for deleting all the empty pages.
  */
  IntegerSet *internal_page_set;
  IntegerSet *empty_leaf_set;

Now, if we don't want to share the remaining stats across
gistbulkdelete and gistvacuumcleanup, isn't it better to keep the
information of internal and empty leaf pages as part of GistVacState?
Also, I think it is better to call gistvacuum_delete_empty_pages from
function gistvacuumscan as that will avoid it calling from multiple
places.

2.
- gist_stats->page_set_context = NULL;
- gist_stats->internal_page_set = NULL;
- gist_stats->empty_leaf_set = NULL;

Why have you removed this initialization?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Dilip Kumar
On Mon, Oct 21, 2019 at 2:58 PM Andrey Borodin  wrote:
>
>
>
> > 21 окт. 2019 г., в 11:12, Dilip Kumar  написал(а):
> >
> > On Mon, Oct 21, 2019 at 2:30 PM Andrey Borodin  wrote:
> >>
> >> I've took a look into the patch, and cannot understand one simple thing...
> >> We should not call gistvacuum_delete_empty_pages() for same gist_stats 
> >> twice.
> >> Another way once the function is called we should somehow update or zero 
> >> empty_leaf_set.
> >> Does this invariant hold in your patch?
> >>
> > Thanks for looking into the patch.   With this patch now
> > GistBulkDeleteResult is local to single gistbulkdelete call or
> > gistvacuumcleanup.  So now we are not sharing GistBulkDeleteResult,
> > across the calls so I am not sure how it will be called twice for the
> > same gist_stats?  I might be missing something here?
>
> Yes, you are right, sorry for the noise.
> Currently we are doing both gistvacuumscan() and 
> gistvacuum_delete_empty_pages() in both gistbulkdelete() and 
> gistvacuumcleanup(). Is it supposed to be so?

There was an issue discussed in parallel vacuum thread[1], and for
solving that it has been discussed in this thread[2] that we can
delete empty pages in bulkdelete phase itself.  But, that does not
mean that we can remove that from the gistvacuumcleanup phase.
Because if the gistbulkdelete is not at all called in the vacuum pass
then gistvacuumcleanup, will perform both gistvacuumscan and
gistvacuum_delete_empty_pages.  In short, In whichever pass, we detect
the empty page in the same pass we delete the empty page.

Functions gistbulkdelete() and gistvacuumcleanup() look very similar
and share some comments. This is what triggered my attention.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JEQ2y3uNucNopDjK8pse6xSe5%3D_oknoWfRQvAF%3DVqsBA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/69EF7B88-F3E7-4E09-824D-694CF39E5683%40iki.fi

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Andrey Borodin



> 21 окт. 2019 г., в 11:12, Dilip Kumar  написал(а):
> 
> On Mon, Oct 21, 2019 at 2:30 PM Andrey Borodin  wrote:
>> 
>> I've took a look into the patch, and cannot understand one simple thing...
>> We should not call gistvacuum_delete_empty_pages() for same gist_stats twice.
>> Another way once the function is called we should somehow update or zero 
>> empty_leaf_set.
>> Does this invariant hold in your patch?
>> 
> Thanks for looking into the patch.   With this patch now
> GistBulkDeleteResult is local to single gistbulkdelete call or
> gistvacuumcleanup.  So now we are not sharing GistBulkDeleteResult,
> across the calls so I am not sure how it will be called twice for the
> same gist_stats?  I might be missing something here?

Yes, you are right, sorry for the noise.
Currently we are doing both gistvacuumscan() and 
gistvacuum_delete_empty_pages() in both gistbulkdelete() and 
gistvacuumcleanup(). Is it supposed to be so? Functions gistbulkdelete() and 
gistvacuumcleanup() look very similar and share some comments. This is what 
triggered my attention.

Thanks!

--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud





Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Dilip Kumar
On Mon, Oct 21, 2019 at 2:30 PM Andrey Borodin  wrote:
>
> Hi!
>
> > 18 окт. 2019 г., в 13:21, Dilip Kumar  написал(а):
> >
> > On Fri, Oct 18, 2019 at 10:55 AM Amit Kapila  
> > wrote:
> >>
> >>
> >> I think we can do it in general as adding some check for parallel
> >> vacuum there will look bit hackish.
> > I agree with that point.
> > It is not clear if we get enough
> >> benefit by keeping it for cleanup phase of the index as discussed in
> >> emails above.  Heikki, others, let us know if you don't agree here.
> >
> > I have prepared a first version of the patch.  Currently, I am
> > performing an empty page deletion for all the cases.
>
> I've took a look into the patch, and cannot understand one simple thing...
> We should not call gistvacuum_delete_empty_pages() for same gist_stats twice.
> Another way once the function is called we should somehow update or zero 
> empty_leaf_set.
> Does this invariant hold in your patch?
>
Thanks for looking into the patch.   With this patch now
GistBulkDeleteResult is local to single gistbulkdelete call or
gistvacuumcleanup.  So now we are not sharing GistBulkDeleteResult,
across the calls so I am not sure how it will be called twice for the
same gist_stats?  I might be missing something here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Andrey Borodin
Hi!

> 18 окт. 2019 г., в 13:21, Dilip Kumar  написал(а):
> 
> On Fri, Oct 18, 2019 at 10:55 AM Amit Kapila  wrote:
>> 
>> 
>> I think we can do it in general as adding some check for parallel
>> vacuum there will look bit hackish.
> I agree with that point.
> It is not clear if we get enough
>> benefit by keeping it for cleanup phase of the index as discussed in
>> emails above.  Heikki, others, let us know if you don't agree here.
> 
> I have prepared a first version of the patch.  Currently, I am
> performing an empty page deletion for all the cases.

I've took a look into the patch, and cannot understand one simple thing...
We should not call gistvacuum_delete_empty_pages() for same gist_stats twice.
Another way once the function is called we should somehow update or zero 
empty_leaf_set.
Does this invariant hold in your patch?

Best regards, Andrey Borodin.



Re: Questions/Observations related to Gist vacuum

2019-10-21 Thread Dilip Kumar
On Mon, Oct 21, 2019 at 11:23 AM Amit Kapila  wrote:
>
> On Fri, Oct 18, 2019 at 10:48 AM Amit Kapila  wrote:
> >
> > Thanks for the test.  It shows that prior to patch the memory was
> > getting leaked in TopTransactionContext during multi-pass vacuum and
> > after patch, there is no leak.  I will commit the patch early next
> > week unless Heikki or someone wants some more tests.
> >
>
> Pushed.
Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-20 Thread Amit Kapila
On Fri, Oct 18, 2019 at 10:48 AM Amit Kapila  wrote:
>
> Thanks for the test.  It shows that prior to patch the memory was
> getting leaked in TopTransactionContext during multi-pass vacuum and
> after patch, there is no leak.  I will commit the patch early next
> week unless Heikki or someone wants some more tests.
>

Pushed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-18 Thread Dilip Kumar
On Fri, Oct 18, 2019 at 10:55 AM Amit Kapila  wrote:
>
> On Fri, Oct 18, 2019 at 9:41 AM Dilip Kumar  wrote:
> >
> > On Wed, Oct 16, 2019 at 7:22 PM Heikki Linnakangas  wrote:
> > >
> > > On 16 October 2019 12:57:03 CEST, Amit Kapila  
> > > wrote:
> > > >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> > > >wrote:
> > > >> All things
> > > >> considered, I'm not sure which is better.
> > > >
> > > >Yeah, this is a tough call to make, but if we can allow it to delete
> > > >the pages in bulkdelete conditionally for parallel vacuum workers,
> > > >then it would be better.
> > >
> > > Yeah, if it's needed for parallel vacuum, maybe that tips the scale.
> >
> > Are we planning to do this only if it is called from parallel vacuum
> > workers or in general?
> >
>
> I think we can do it in general as adding some check for parallel
> vacuum there will look bit hackish.
I agree with that point.
 It is not clear if we get enough
> benefit by keeping it for cleanup phase of the index as discussed in
> emails above.  Heikki, others, let us know if you don't agree here.

I have prepared a first version of the patch.  Currently, I am
performing an empty page deletion for all the cases.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


delete_emptypages_in_gistbulkdelete_v1.patch
Description: Binary data


Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Amit Kapila
On Fri, Oct 18, 2019 at 9:41 AM Dilip Kumar  wrote:
>
> On Wed, Oct 16, 2019 at 7:22 PM Heikki Linnakangas  wrote:
> >
> > On 16 October 2019 12:57:03 CEST, Amit Kapila  
> > wrote:
> > >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> > >wrote:
> > >> All things
> > >> considered, I'm not sure which is better.
> > >
> > >Yeah, this is a tough call to make, but if we can allow it to delete
> > >the pages in bulkdelete conditionally for parallel vacuum workers,
> > >then it would be better.
> >
> > Yeah, if it's needed for parallel vacuum, maybe that tips the scale.
>
> Are we planning to do this only if it is called from parallel vacuum
> workers or in general?
>

I think we can do it in general as adding some check for parallel
vacuum there will look bit hackish.  It is not clear if we get enough
benefit by keeping it for cleanup phase of the index as discussed in
emails above.  Heikki, others, let us know if you don't agree here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Amit Kapila
On Fri, Oct 18, 2019 at 9:34 AM Dilip Kumar  wrote:
>
> On Thu, Oct 17, 2019 at 6:32 PM Dilip Kumar  wrote:
> >
> > On Thu, 17 Oct 2019, 14:59 Amit Kapila,  wrote:
> >>
> >> On Thu, Oct 17, 2019 at 1:47 PM Dilip Kumar  wrote:
> >> >
> >> > On Thu, Oct 17, 2019 at 12:27 PM Heikki Linnakangas  
> >> > wrote:
> >> > >
> >> > > Thanks! Looks good to me. Did either of you test it, though, with a
> >> > > multi-pass vacuum?
> >> >
> >> > From my side, I have tested it with the multi-pass vacuum using the
> >> > gist index and after the fix, it's using expected memory context.
> >> >
> >>
> >> I have also verified that, but I think what additionally we can test
> >> here is that without the patch it will leak the memory in
> >> TopTransactionContext (CurrentMemoryContext), but after patch it
> >> shouldn't leak it during multi-pass vacuum.
> >>
> >> Make sense to me, I will test that by tomorrow.
>
> I have performed the test to observe the memory usage in the
> TopTransactionContext using the MemoryContextStats function from gdb.
>
> For testing this, in gistvacuumscan every time, after it resets the
> page_set_context, I have collected the sample.  I have collected 3
> samples for both the head and the patch.  We can clearly see that on
> the head the memory is getting accumulated in the
> TopTransactionContext whereas with the patch there is no memory
> getting accumulated.
>

Thanks for the test.  It shows that prior to patch the memory was
getting leaked in TopTransactionContext during multi-pass vacuum and
after patch, there is no leak.  I will commit the patch early next
week unless Heikki or someone wants some more tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Dilip Kumar
On Wed, Oct 16, 2019 at 7:22 PM Heikki Linnakangas  wrote:
>
> On 16 October 2019 12:57:03 CEST, Amit Kapila  wrote:
> >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> >wrote:
> >> All things
> >> considered, I'm not sure which is better.
> >
> >Yeah, this is a tough call to make, but if we can allow it to delete
> >the pages in bulkdelete conditionally for parallel vacuum workers,
> >then it would be better.
>
> Yeah, if it's needed for parallel vacuum, maybe that tips the scale.

Are we planning to do this only if it is called from parallel vacuum
workers or in general?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Dilip Kumar
On Thu, Oct 17, 2019 at 6:32 PM Dilip Kumar  wrote:
>
> On Thu, 17 Oct 2019, 14:59 Amit Kapila,  wrote:
>>
>> On Thu, Oct 17, 2019 at 1:47 PM Dilip Kumar  wrote:
>> >
>> > On Thu, Oct 17, 2019 at 12:27 PM Heikki Linnakangas  
>> > wrote:
>> > >
>> > > On 17/10/2019 05:31, Amit Kapila wrote:
>> > > >
>> > > > The patch looks good to me.  I have slightly modified the comments and
>> > > > removed unnecessary initialization.
>> > > >
>> > > > Heikki, are you fine me committing and backpatching this to 12?  Let
>> > > > me know if you have a different idea to fix.
>> > >
>> > > Thanks! Looks good to me. Did either of you test it, though, with a
>> > > multi-pass vacuum?
>> >
>> > From my side, I have tested it with the multi-pass vacuum using the
>> > gist index and after the fix, it's using expected memory context.
>> >
>>
>> I have also verified that, but I think what additionally we can test
>> here is that without the patch it will leak the memory in
>> TopTransactionContext (CurrentMemoryContext), but after patch it
>> shouldn't leak it during multi-pass vacuum.
>>
>> Make sense to me, I will test that by tomorrow.

I have performed the test to observe the memory usage in the
TopTransactionContext using the MemoryContextStats function from gdb.

For testing this, in gistvacuumscan every time, after it resets the
page_set_context, I have collected the sample.  I have collected 3
samples for both the head and the patch.  We can clearly see that on
the head the memory is getting accumulated in the
TopTransactionContext whereas with the patch there is no memory
getting accumulated.

head:
TopTransactionContext: 1056832 total in 2 blocks; 3296 free (5
chunks); 1053536 used
  GiST VACUUM page set context: 112 total in 0 blocks (0 chunks); 0
free (0 chunks); 112 used
Grand total: 1056944 bytes in 2 blocks; 3296 free (5 chunks); 1053648 used

TopTransactionContext: 1089600 total in 4 blocks; 19552 free (14
chunks); 1070048 used
  GiST VACUUM page set context: 112 total in 0 blocks (0 chunks); 0
free (0 chunks); 112 used
Grand total: 1089712 bytes in 4 blocks; 19552 free (14 chunks); 1070160 used

TopTransactionContext: 1122368 total in 5 blocks; 35848 free (20
chunks); 1086520 used
  GiST VACUUM page set context: 112 total in 0 blocks (0 chunks); 0
free (0 chunks); 112 used
Grand total: 1122480 bytes in 5 blocks; 35848 free (20 chunks); 1086632 used


With Patch:
TopTransactionContext: 1056832 total in 2 blocks; 3296 free (1
chunks); 1053536 used
  GiST VACUUM page set context: 112 total in 0 blocks (0 chunks); 0
free (0 chunks); 112 used
Grand total: 1056944 bytes in 2 blocks; 3296 free (1 chunks); 1053648 used

TopTransactionContext: 1056832 total in 2 blocks; 3296 free (1
chunks); 1053536 used
  GiST VACUUM page set context: 112 total in 0 blocks (0 chunks); 0
free (0 chunks); 112 used
Grand total: 1056944 bytes in 2 blocks; 3296 free (1 chunks); 1053648 used

TopTransactionContext: 1056832 total in 2 blocks; 3296 free (1
chunks); 1053536 used
  GiST VACUUM page set context: 112 total in 0 blocks (0 chunks); 0
free (0 chunks); 112 used
Grand total: 1056944 bytes in 2 blocks; 3296 free (1 chunks); 1053648 used

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Dilip Kumar
On Thu, 17 Oct 2019, 14:59 Amit Kapila,  wrote:

> On Thu, Oct 17, 2019 at 1:47 PM Dilip Kumar  wrote:
> >
> > On Thu, Oct 17, 2019 at 12:27 PM Heikki Linnakangas 
> wrote:
> > >
> > > On 17/10/2019 05:31, Amit Kapila wrote:
> > > >
> > > > The patch looks good to me.  I have slightly modified the comments
> and
> > > > removed unnecessary initialization.
> > > >
> > > > Heikki, are you fine me committing and backpatching this to 12?  Let
> > > > me know if you have a different idea to fix.
> > >
> > > Thanks! Looks good to me. Did either of you test it, though, with a
> > > multi-pass vacuum?
> >
> > From my side, I have tested it with the multi-pass vacuum using the
> > gist index and after the fix, it's using expected memory context.
> >
>
> I have also verified that, but I think what additionally we can test
> here is that without the patch it will leak the memory in
> TopTransactionContext (CurrentMemoryContext), but after patch it
> shouldn't leak it during multi-pass vacuum.
>
> Make sense to me, I will test that by tomorrow.


Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Amit Kapila
On Thu, Oct 17, 2019 at 1:47 PM Dilip Kumar  wrote:
>
> On Thu, Oct 17, 2019 at 12:27 PM Heikki Linnakangas  wrote:
> >
> > On 17/10/2019 05:31, Amit Kapila wrote:
> > >
> > > The patch looks good to me.  I have slightly modified the comments and
> > > removed unnecessary initialization.
> > >
> > > Heikki, are you fine me committing and backpatching this to 12?  Let
> > > me know if you have a different idea to fix.
> >
> > Thanks! Looks good to me. Did either of you test it, though, with a
> > multi-pass vacuum?
>
> From my side, I have tested it with the multi-pass vacuum using the
> gist index and after the fix, it's using expected memory context.
>

I have also verified that, but I think what additionally we can test
here is that without the patch it will leak the memory in
TopTransactionContext (CurrentMemoryContext), but after patch it
shouldn't leak it during multi-pass vacuum.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Dilip Kumar
On Thu, Oct 17, 2019 at 12:27 PM Heikki Linnakangas  wrote:
>
> On 17/10/2019 05:31, Amit Kapila wrote:
> > On Wed, Oct 16, 2019 at 11:20 AM Dilip Kumar  wrote:
> >>
> >> On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
> >>>
> >>> On 15/10/2019 09:37, Amit Kapila wrote:
>  While reviewing a parallel vacuum patch [1], we noticed a few things
>  about $SUBJECT implemented in commit -
>  7df159a620b760e289f1795b13542ed1b3e13b87.
> 
>  1. A new memory context GistBulkDeleteResult->page_set_context has
>  been introduced, but it doesn't seem to be used.
> >>>
> >>> Oops. internal_page_set and empty_leaf_set were supposed to be allocated
> >>> in that memory context. As things stand, we leak them until end of
> >>> vacuum, in a multi-pass vacuum.
> >>
> >> Here is a patch to fix this issue.
> >
> > The patch looks good to me.  I have slightly modified the comments and
> > removed unnecessary initialization.
> >
> > Heikki, are you fine me committing and backpatching this to 12?  Let
> > me know if you have a different idea to fix.
>
> Thanks! Looks good to me. Did either of you test it, though, with a
> multi-pass vacuum?

>From my side, I have tested it with the multi-pass vacuum using the
gist index and after the fix, it's using expected memory context.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-17 Thread Heikki Linnakangas

On 17/10/2019 05:31, Amit Kapila wrote:

On Wed, Oct 16, 2019 at 11:20 AM Dilip Kumar  wrote:


On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:


On 15/10/2019 09:37, Amit Kapila wrote:

While reviewing a parallel vacuum patch [1], we noticed a few things
about $SUBJECT implemented in commit -
7df159a620b760e289f1795b13542ed1b3e13b87.

1. A new memory context GistBulkDeleteResult->page_set_context has
been introduced, but it doesn't seem to be used.


Oops. internal_page_set and empty_leaf_set were supposed to be allocated
in that memory context. As things stand, we leak them until end of
vacuum, in a multi-pass vacuum.


Here is a patch to fix this issue.


The patch looks good to me.  I have slightly modified the comments and
removed unnecessary initialization.

Heikki, are you fine me committing and backpatching this to 12?  Let
me know if you have a different idea to fix.


Thanks! Looks good to me. Did either of you test it, though, with a 
multi-pass vacuum?


- Heikki




Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Dilip Kumar
On Thu, Oct 17, 2019 at 9:15 AM Amit Kapila  wrote:
>
> On Wed, Oct 16, 2019 at 7:21 PM Heikki Linnakangas  wrote:
> >
> > On 16 October 2019 12:57:03 CEST, Amit Kapila  
> > wrote:
> > >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> > >wrote:
> > >> All things
> > >> considered, I'm not sure which is better.
> > >
> > >Yeah, this is a tough call to make, but if we can allow it to delete
> > >the pages in bulkdelete conditionally for parallel vacuum workers,
> > >then it would be better.
> >
> > Yeah, if it's needed for parallel vacuum, maybe that tips the scale.
> >
>
> makes sense.  I think we can write a patch for it and prepare the
> parallel vacuum patch on top of it.  Once the parallel vacuum is in a
> committable shape, we can commit the gist-index related patch first
> followed by parallel vacuum patch.

+1
I can write a patch for the same.

> > Hopefully, multi-pass vacuums are rare in practice. And we should lift the 
> > current 1 GB limit on the dead TID array, replacing it with something more 
> > compact and expandable, to make multi-pass vacuums even more rare. So I 
> > don't think we need to jump through many hoops to optimize the multi-pass 
> > case.
> >
>
> Yeah, that will be a good improvement.
+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Amit Kapila
On Wed, Oct 16, 2019 at 7:21 PM Heikki Linnakangas  wrote:
>
> On 16 October 2019 12:57:03 CEST, Amit Kapila  wrote:
> >On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
> >wrote:
> >> All things
> >> considered, I'm not sure which is better.
> >
> >Yeah, this is a tough call to make, but if we can allow it to delete
> >the pages in bulkdelete conditionally for parallel vacuum workers,
> >then it would be better.
>
> Yeah, if it's needed for parallel vacuum, maybe that tips the scale.
>

makes sense.  I think we can write a patch for it and prepare the
parallel vacuum patch on top of it.  Once the parallel vacuum is in a
committable shape, we can commit the gist-index related patch first
followed by parallel vacuum patch.

> Hopefully, multi-pass vacuums are rare in practice. And we should lift the 
> current 1 GB limit on the dead TID array, replacing it with something more 
> compact and expandable, to make multi-pass vacuums even more rare. So I don't 
> think we need to jump through many hoops to optimize the multi-pass case.
>

Yeah, that will be a good improvement.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Amit Kapila
On Wed, Oct 16, 2019 at 11:20 AM Dilip Kumar  wrote:
>
> On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
> >
> > On 15/10/2019 09:37, Amit Kapila wrote:
> > > While reviewing a parallel vacuum patch [1], we noticed a few things
> > > about $SUBJECT implemented in commit -
> > > 7df159a620b760e289f1795b13542ed1b3e13b87.
> > >
> > > 1. A new memory context GistBulkDeleteResult->page_set_context has
> > > been introduced, but it doesn't seem to be used.
> >
> > Oops. internal_page_set and empty_leaf_set were supposed to be allocated
> > in that memory context. As things stand, we leak them until end of
> > vacuum, in a multi-pass vacuum.
>
> Here is a patch to fix this issue.
>

The patch looks good to me.  I have slightly modified the comments and
removed unnecessary initialization.

Heikki, are you fine me committing and backpatching this to 12?  Let
me know if you have a different idea to fix.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Fix-memory-leak-introduced-in-commit-7df159a620.patch
Description: Binary data


Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Heikki Linnakangas
On 16 October 2019 12:57:03 CEST, Amit Kapila  wrote:
>On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas 
>wrote:
>> All things
>> considered, I'm not sure which is better.
>
>Yeah, this is a tough call to make, but if we can allow it to delete
>the pages in bulkdelete conditionally for parallel vacuum workers,
>then it would be better.

Yeah, if it's needed for parallel vacuum, maybe that tips the scale.

Hopefully, multi-pass vacuums are rare in practice. And we should lift the 
current 1 GB limit on the dead TID array, replacing it with something more 
compact and expandable, to make multi-pass vacuums even more rare. So I don't 
think we need to jump through many hoops to optimize the multi-pass case.

- Heikki




Re: Questions/Observations related to Gist vacuum

2019-10-16 Thread Amit Kapila
On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
>
> On 15/10/2019 09:37, Amit Kapila wrote:
> > 2. Right now, in gistbulkdelete we make a note of empty leaf pages and
> > internals pages and then in the second pass during gistvacuumcleanup,
> > we delete all the empty leaf pages.  I was thinking why unlike nbtree,
> > we have delayed the deletion of empty pages till gistvacuumcleanup.  I
> > don't see any problem if we do this during gistbulkdelete itself
> > similar to nbtree, also I think there is some advantage in marking the
> > pages as deleted as early as possible.  Basically, if the vacuum
> > operation is canceled or errored out between gistbulkdelete and
> > gistvacuumcleanup, then I think the deleted pages could be marked as
> > recyclable very early in next vacuum operation.  The other advantage
> > of doing this during gistbulkdelete is we can avoid sharing
> > information between gistbulkdelete and gistvacuumcleanup which is
> > quite helpful for a parallel vacuum as the information is not trivial
> > (it is internally stored as in-memory Btree).   OTOH, there might be
> > some advantage for delaying the deletion of pages especially in the
> > case of multiple scans during a single VACUUM command.  We can
> > probably delete all empty leaf pages in one go which could in some
> > cases lead to fewer internal page reads.  However, I am not sure if it
> > is really advantageous to postpone the deletion as there seem to be
> > some downsides to it as well. I don't see it documented why unlike
> > nbtree we consider delaying deletion of empty pages till
> > gistvacuumcleanup, but I might be missing something.
>
> Hmm. The thinking is/was that removing the empty pages is somewhat
> expensive, because it has to scan all the internal nodes to find the
> downlinks to the to-be-deleted pages. Furthermore, it needs to scan all
> the internal pages (or at least until it has found all the downlinks),
> regardless of how many empty pages are being deleted. So it makes sense
> to do it only once, for all the empty pages. You're right though, that
> there would be advantages, too, in doing it after each pass.
>

I was thinking more about this and it seems that there could be more
cases where delaying the delete mark for pages can further delay the
recycling of pages.  It is quite possible that immediately after bulk
delete the value of nextFullXid (used as deleteXid) is X and during
vacuum clean up it can be X + N where the chances of N being large is
more when there are multiple passes of vacuum scan.  Now, if we would
have set the value of deleteXid as X, then there are more chances for
the next vacuum to recycle it.  I am not sure but it might be that in
the future, we could come up with something (say if we can recompute
RecentGlobalXmin again) where we can recycle pages of first index scan
in the next scan of the index during single vacuum operation.

This is just to emphasize the point that doing the delete marking of
pages in the same pass has advantages, otherwise, I understand that
there are advantages in delaying it as well.

> All things
> considered, I'm not sure which is better.
>

Yeah, this is a tough call to make, but if we can allow it to delete
the pages in bulkdelete conditionally for parallel vacuum workers,
then it would be better.

I think we have below option w.r.t Gist indexes for parallel vacuum
a. don't allow Gist Index to participate in parallel vacuum
b. allow delete of leaf pages in bulkdelete for parallel worker
c. always allow deleting leaf pages in bulkdelete
d. Invent some mechanism to share all the Gist stats via shared memory

(a) is not a very good option, but it is a safe option as we can
extend it in the future and we might decide to go with it especially
if we can't decide among any other options. (b) would serve the need
but would add some additional checks in gistbulkdelete and will look
more like a hack.  (c) would be best, but I think it will be difficult
to be sure that is a good decision for all type of cases. (d) can
require a lot of effort and I am not sure if it is worth adding
complexity in the proposed patch.

Do you have any thoughts on this?

Just to give you an idea of the current parallel vacuum patch, the
master backend scans the heap and forms the dead tuple array in dsm
and then we launch one worker for each index based on the availability
of workers and share the dead tuple array with each worker.  Each
worker performs bulkdelete for the index.  In the end, we perform
cleanup of all the indexes either via worker or master backend based
on some conditions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Questions/Observations related to Gist vacuum

2019-10-15 Thread Dilip Kumar
On Tue, Oct 15, 2019 at 7:13 PM Heikki Linnakangas  wrote:
>
> On 15/10/2019 09:37, Amit Kapila wrote:
> > While reviewing a parallel vacuum patch [1], we noticed a few things
> > about $SUBJECT implemented in commit -
> > 7df159a620b760e289f1795b13542ed1b3e13b87.
> >
> > 1. A new memory context GistBulkDeleteResult->page_set_context has
> > been introduced, but it doesn't seem to be used.
>
> Oops. internal_page_set and empty_leaf_set were supposed to be allocated
> in that memory context. As things stand, we leak them until end of
> vacuum, in a multi-pass vacuum.

Here is a patch to fix this issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


user_correct_memorycontext_gist_stat.patch
Description: Binary data


Re: Questions/Observations related to Gist vacuum

2019-10-15 Thread Heikki Linnakangas

On 15/10/2019 09:37, Amit Kapila wrote:

While reviewing a parallel vacuum patch [1], we noticed a few things
about $SUBJECT implemented in commit -
7df159a620b760e289f1795b13542ed1b3e13b87.

1. A new memory context GistBulkDeleteResult->page_set_context has
been introduced, but it doesn't seem to be used.


Oops. internal_page_set and empty_leaf_set were supposed to be allocated 
in that memory context. As things stand, we leak them until end of 
vacuum, in a multi-pass vacuum.



2. Right now, in gistbulkdelete we make a note of empty leaf pages and
internals pages and then in the second pass during gistvacuumcleanup,
we delete all the empty leaf pages.  I was thinking why unlike nbtree,
we have delayed the deletion of empty pages till gistvacuumcleanup.  I
don't see any problem if we do this during gistbulkdelete itself
similar to nbtree, also I think there is some advantage in marking the
pages as deleted as early as possible.  Basically, if the vacuum
operation is canceled or errored out between gistbulkdelete and
gistvacuumcleanup, then I think the deleted pages could be marked as
recyclable very early in next vacuum operation.  The other advantage
of doing this during gistbulkdelete is we can avoid sharing
information between gistbulkdelete and gistvacuumcleanup which is
quite helpful for a parallel vacuum as the information is not trivial
(it is internally stored as in-memory Btree).   OTOH, there might be
some advantage for delaying the deletion of pages especially in the
case of multiple scans during a single VACUUM command.  We can
probably delete all empty leaf pages in one go which could in some
cases lead to fewer internal page reads.  However, I am not sure if it
is really advantageous to postpone the deletion as there seem to be
some downsides to it as well. I don't see it documented why unlike
nbtree we consider delaying deletion of empty pages till
gistvacuumcleanup, but I might be missing something.


Hmm. The thinking is/was that removing the empty pages is somewhat 
expensive, because it has to scan all the internal nodes to find the 
downlinks to the to-be-deleted pages. Furthermore, it needs to scan all 
the internal pages (or at least until it has found all the downlinks), 
regardless of how many empty pages are being deleted. So it makes sense 
to do it only once, for all the empty pages. You're right though, that 
there would be advantages, too, in doing it after each pass. All things 
considered, I'm not sure which is better.


- Heikki




Questions/Observations related to Gist vacuum

2019-10-15 Thread Amit Kapila
While reviewing a parallel vacuum patch [1], we noticed a few things
about $SUBJECT implemented in commit -
7df159a620b760e289f1795b13542ed1b3e13b87.

1. A new memory context GistBulkDeleteResult->page_set_context has
been introduced, but it doesn't seem to be used.
2. Right now, in gistbulkdelete we make a note of empty leaf pages and
internals pages and then in the second pass during gistvacuumcleanup,
we delete all the empty leaf pages.  I was thinking why unlike nbtree,
we have delayed the deletion of empty pages till gistvacuumcleanup.  I
don't see any problem if we do this during gistbulkdelete itself
similar to nbtree, also I think there is some advantage in marking the
pages as deleted as early as possible.  Basically, if the vacuum
operation is canceled or errored out between gistbulkdelete and
gistvacuumcleanup, then I think the deleted pages could be marked as
recyclable very early in next vacuum operation.  The other advantage
of doing this during gistbulkdelete is we can avoid sharing
information between gistbulkdelete and gistvacuumcleanup which is
quite helpful for a parallel vacuum as the information is not trivial
(it is internally stored as in-memory Btree).   OTOH, there might be
some advantage for delaying the deletion of pages especially in the
case of multiple scans during a single VACUUM command.  We can
probably delete all empty leaf pages in one go which could in some
cases lead to fewer internal page reads.  However, I am not sure if it
is really advantageous to postpone the deletion as there seem to be
some downsides to it as well. I don't see it documented why unlike
nbtree we consider delaying deletion of empty pages till
gistvacuumcleanup, but I might be missing something.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JEQ2y3uNucNopDjK8pse6xSe5%3D_oknoWfRQvAF%3DVqsBA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: GiST VACUUM

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas  wrote:
> That's probably how it's going to go, but hey, doesn't hurt to ask :-).

I think that it would be fine to be conservative with nbtree, and only
target the master branch. The problem is annoying, certainly, but it's
not likely to make a huge difference for most real world workloads.
OTOH, perhaps the risk is so low that we might as well target
backbranches.

How do you feel about it?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-07-24 Thread Heikki Linnakangas

On 24/07/2019 21:02, Peter Geoghegan wrote:

On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas  wrote:

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?


I was hoping that you'd work on it.   :-)


That's probably how it's going to go, but hey, doesn't hurt to ask :-).


Any reason to think that it's much different to what you've done with GiST?


No, it should be very similar.

- Heikki




Re: GiST VACUUM

2019-07-24 Thread Peter Geoghegan
On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas  wrote:
> Pushed this now, to master and REL_12_STABLE.
>
> Now, B-tree indexes still have the same problem, in all versions. Any
> volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it.   :-)

Any reason to think that it's much different to what you've done with GiST?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-07-24 Thread Heikki Linnakangas

On 22/07/2019 16:09, Heikki Linnakangas wrote:

Unless something comes up, I'll commit this tomorrow.


Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any 
volunteers to write a similar fix for B-trees?


- Heikki




Re: GiST VACUUM

2019-07-22 Thread Heikki Linnakangas

On 28/06/2019 01:02, Thomas Munro wrote:

On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.


I'm a bit uneasy about making this into a generally available function
for reuse.  I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids.  The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.


Fair point.


I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation.  I was planning to think harder about for 13.
Obviously that doesn't help you now.  So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?


I came up with the attached. Instead of having a general purpose "widen" 
function, it adds GetFullRecentGlobalXmin(), to return a 64-bit version 
of RecentGlobalXmin. That's enough for this Gist vacuum patch.


In addition to that change, I modified the GistPageSetDeleted(), 
GistPageSetDeleteXid(), GistPageGetDeleteXid() inline functions a bit. I 
merged GistPageSetDeleted() and GistPageSetDeleteXid() into a single 
function, to make sure that the delete-XID is always set when a page is 
marked as deleted. And I modified GistPageGetDeleteXid() to return 
NormalTransactionId (or a FullTransactionId version of it, to be 
precise), for Gist pages that were deleted with older PostgreSQL v12 
beta versions. The previous patch tripped an assertion in that case, 
which is not nice for people binary-upgrading from earlier beta versions.


I did some testing of this with XID wraparound, and it seems to work. I 
used the attached bash script for the testing, with the a helper contrib 
module to consume XIDs faster. It's not very well commented and probably 
not too useful for anyone, but I'm attaching it here mainly as a note to 
future-self, if we need to revisit this.


Unless something comes up, I'll commit this tomorrow.

- Heikki
>From bdeb2467211a1ab9e733e070d54dce97c05cf18c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 22 Jul 2019 15:57:01 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c| 40 ---
 src/backend/access/gist/gistget.c | 14 +++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 169bf6fcfed..e9ca4b82527 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -709,14 +709,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -735,9 +736,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal 

Re: GiST VACUUM

2019-07-16 Thread Amit Kapila
On Fri, Jun 28, 2019 at 3:32 AM Thomas Munro  wrote:
>
> On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:
> > * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> > function in varsup.c.
>
> I'm a bit uneasy about making this into a generally available function
> for reuse.  I think we should instead come up with a very small number
> of sources of fxids that known to be free of races because of some
> well explained interlocking.
>

I have two more cases in undo patch series where the same function is
needed and it is safe to use it there.  The first place is twophase.c
for rolling back prepared transactions where we know that we don't
support aborted xacts that are older than 2 billion, so we can rely on
such a function.  We also need it in undodiscard.c to compute the
value of oldestFullXidHavingUnappliedUndo.  See the usage of
GetEpochForXid in unprocessing patches.  Now, we might find a way to
avoid using in one of these places by doing some more work, but not
sure we can avoid from all the three places (one proposed by this
patch and two by undo patchset).

> For example, I believe it is safe to convert an xid obtained from a
> WAL record during recovery, because (for now) recovery is a single
> thread of execution and the next fxid can't advance underneath you.
> So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
> to propose in another thread (though I've forgotten who wrote it,
> maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
> would have written) is a safe blessed source of fxids.  The rationale
> for writing this function instead of an earlier code that looked much
> like your proposed helper function, during EDB-internal review of
> zheap stuff, was this: if we provide a general purpose xid->fxid
> facility, it's virtually guaranteed that someone will eventually use
> it to shoot footwards, by acquiring an xid from somewhere, and then
> some arbitrary time later extending it to a fxid when no interlocking
> guarantees that the later conversion has the correct epoch.
>
> I'd like to figure out how to maintain full versions of the
> procarray.c-managed xid horizons, without widening the shared memory
> representation.  I was planning to think harder about for 13.
> Obviously that doesn't help you now.  So I'm wondering if you should
> just open-code this for now, and put in a comment about why it's safe
> and a note that there'll hopefully be a fxid horizon available in a
> later release?
>

Do you suggest to open code for all the three places for now?  I am
not against open coding the logic for now but not sure if we can
eliminate its need because even if we can do what you are saying with
procarray.c-managed xid horizons, I think we need to do something
about clog.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: GiST VACUUM

2019-07-03 Thread Peter Geoghegan
On Thu, Apr 4, 2019 at 8:15 AM Heikki Linnakangas  wrote:
> I think we should fix this in a similar manner in B-tree, too, but that
> can be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is
> stored in the page opaque field.

What Postgres versions will the B-Tree fix end up targeting? Sounds
like you plan to backpatch all the way?

-- 
Peter Geoghegan




Re: GiST VACUUM

2019-06-27 Thread Thomas Munro
On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas  wrote:
> * I moved the logic to extend a 32-bit XID to 64-bits to a new helper
> function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse.  I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids.  The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation.  I was planning to think harder about for 13.
Obviously that doesn't help you now.  So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

[1] 
https://github.com/EnterpriseDB/zheap/commit/1203c2fa49f5f872f42ea4a02ccba2b191c45f32

-- 
Thomas Munro
https://enterprisedb.com




Re: GiST VACUUM

2019-06-27 Thread Heikki Linnakangas

On 27/06/2019 20:15, Andrey Borodin wrote:

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid 
becomes xid from future

latestRemovedXid = 
XidFromFullTransactionId(latestRemovedFullXid);

ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
   
 xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be 
xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from 
ancient past already.


Good question. No, that can't happen, because this code is in the WAL 
redo function. In a standby, the next XID counter only moves forward 
when a WAL record is replayed that advances it, and all WAL records are 
replayed serially, so that can't happen when we're in the middle of 
replaying this record. A comment on that would be good, though.


When I originally had the check like above in the code that created the 
WAL record, it had exactly that problem, because in the master the next 
XID counter can advance concurrently.


- Heikki




Re: GiST VACUUM

2019-06-27 Thread Andrey Borodin



> 27 июня 2019 г., в 16:38, Heikki Linnakangas  написал(а):
> 
> I haven't done any testing on this yet. Andrey, would you happen to have an 
> environment ready to test this?

Patches do not deadlock and delete pages on "rescan test" - setup that we used 
to detect "left jumps" in during development of physical vacuum. check-world is 
happy on my machine.
I really like that now there is GISTDeletedPageContents and we do not just cast 
*(FullTransactionId *) PageGetContents(page).

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid 
becomes xid from future

latestRemovedXid = 
XidFromFullTransactionId(latestRemovedFullXid);

ResolveRecoveryConflictWithSnapshot(latestRemovedXid,

xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be 
xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from 
ancient past already. 

Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-06-27 Thread Andrey Borodin



> 27 июня 2019 г., в 16:38, Heikki Linnakangas  написал(а):
> 
> I haven't done any testing on this yet. Andrey, would you happen to have an 
> environment ready to test this?

Thanks!

I will do some testing this evening (UTC+5). But I'm not sure I can reliably 
test wraparound of xids...
I will try to break code with usual setup which we used to stress vacuum with 
deletes and inserts.

Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-06-27 Thread Heikki Linnakangas

On 26/06/2019 06:07, Thomas Munro wrote:

On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier  wrote:

On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:

I feel a little uncomfortable with number 0x7fff right in code.


PG_INT32_MAX...


MaxTransactionId / 2?


Yeah, that makes sense.

Here's a new version of the patches. Changes:

* I changed the reuse-logging so that we always write a reuse WAL 
record, even if the deleteXid is very old. I tried to avoid that with 
the check for MaxTransactionId / 2 or 0x7fff, but it had some 
problems. In the previous patch version, it wasn't just an optimization. 
Without the check, we would write 32-bit XIDs to the log that had 
already wrapped around, and that caused the standby to unnecessarily 
wait for or kill backends. But checking for MaxTransaction / 2 isn't 
quite enough: there was a small chance that the next XID counter 
advanced just after checking for that, so that we still logged an XID 
that had just wrapped around. A more robust way to deal with this is to 
log a full 64-bit XID, and check for wraparound at redo in the standby. 
And if we do that, trying to optimize this in the master doesn't seem 
that important anymore. So in this patch version, we always log the 
64-bit XID, and check for the MaxTransaction / 2 when replaying the WAL 
record instead.


* I moved the logic to extend a 32-bit XID to 64-bits to a new helper 
function in varsup.c.


* Instead of storing just a naked FullTransactionId in the "page 
contents" of a deleted page, I created a new struct for that. The effect 
is the same, but I think the struct clarifies what's happening, and it's 
a good location to attach a comment explaining it.


* Fixed the mixup between < and >

I haven't done any testing on this yet. Andrey, would you happen to have 
an environment ready to test this?


- Heikki
>From 7fd5901e15ac9e0f1928eeecbb9ae1212bacf3f3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Apr 2019 18:06:48 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c| 40 ---
 src/backend/access/gist/gistget.c | 14 +++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7da..79030e77153 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -709,14 +709,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -735,9 +736,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -858,12 +856,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	 * leaf/inner is enough to recognize split for root
 	 */
 }
-else if (GistFollowRight(stack->page) ||
-		 stack->parent->lsn < GistPageGetNSN(stack->page))
+else if ((GistFollowRight(stack->page) ||
+		  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+		 GistPageIsDeleted(stack->page))
 {
 	/*
-	 * The page was split while we momentarily unlocked the
-	 * page. Go back to parent.
+	 * The page was split or deleted while we momentarily
+	 * unlocked the page. Go back to parent.
 	 */
 	UnlockReleaseBuffer(stack->buffer);
 	xlocked = false;
@@ -872,18 +871,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 }
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-UnlockReleaseBuffer(stack->buffer);
-xlocked = 

Re: GiST VACUUM

2019-06-25 Thread Thomas Munro
On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier  wrote:
> On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> > I feel a little uncomfortable with number 0x7fff right in code.
>
> PG_INT32_MAX...

MaxTransactionId / 2?

-- 
Thomas Munro
https://enterprisedb.com




Re: GiST VACUUM

2019-06-25 Thread Michael Paquier
On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:
> I feel a little uncomfortable with number 0x7fff right in code.

PG_INT32_MAX...
--
Michael


signature.asc
Description: PGP signature


Re: GiST VACUUM

2019-06-25 Thread Andrey Borodin
Hi!

Thanks for clarification, now I understand these patches better.

> 25 июня 2019 г., в 13:10, Heikki Linnakangas  написал(а):
> 
>> Also, I did not understand this optimization:
>> +/*
>> + * We can skip this if the page was deleted so long ago, that no scan 
>> can possibly
>> + * still see it, even in a standby. One measure might be anything older 
>> than the
>> + * table's frozen-xid, but we don't have that at hand here. But 
>> anything older than
>> + * 2 billion, from the next XID, is surely old enough, because you 
>> would hit XID
>> + * wraparound at that point.
>> + */
>> +nextxid = ReadNextFullTransactionId();
>> +diff = U64FromFullTransactionId(nextxid) - 
>> U64FromFullTransactionId(latestRemovedXid);
>> +if (diff < 0x7fff)
>> +return;
>> Standby can be lagging months from primary, and, theoretically, close
>> the gap in one sudden WAL leap...
> It would still process the WAL one WAL record at a time, even if it's lagging 
> months behind. It can't just jump over 2 billion XIDs.
I feel a little uncomfortable with number 0x7fff right in code.

Thanks!

Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-06-25 Thread Heikki Linnakangas

(Thanks for the reminder on this, Michael!)

On 05/04/2019 08:39, Andrey Borodin wrote:

4 апр. 2019 г., в 20:15, Heikki Linnakangas   написал(а):
I suggest that we do the attached. It fixes this for GiST. The
patch changes expands the "deletion XID" to 64-bits, and changes
where it's stored. Instead of storing it pd_prune_xid, it's stored
in the page contents. Luckily, a deleted page has no real content.


So, we store full xid right after page header?


Yep.


+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+   Assert(PageIsEmpty(page));
+   ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + 
sizeof(FullTransactionId);
+
+   *((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number)
untouched. I do not know is it done for a reason or not
No. Take a look at PageGetItemId() macro, it subtracts one from the 
offset number. But in any case, that's not really relevant here, because 
this patch stores the transaction id directly as the page content. There 
are no itemids at all on a deleted page.



Also, I did not understand this optimization:
+   /*
+* We can skip this if the page was deleted so long ago, that no scan 
can possibly
+* still see it, even in a standby. One measure might be anything older 
than the
+* table's frozen-xid, but we don't have that at hand here. But 
anything older than
+* 2 billion, from the next XID, is surely old enough, because you 
would hit XID
+* wraparound at that point.
+*/
+   nextxid = ReadNextFullTransactionId();
+   diff = U64FromFullTransactionId(nextxid) - 
U64FromFullTransactionId(latestRemovedXid);
+   if (diff < 0x7fff)
+   return;

Standby can be lagging months from primary, and, theoretically, close
the gap in one sudden WAL leap...
It would still process the WAL one WAL record at a time, even if it's 
lagging months behind. It can't just jump over 2 billion XIDs.



Also, I think, that comparison sign should be >, not <.


Ah, good catch! And it shows that this needs more testing..

- Heikki




Re: GiST VACUUM

2019-06-24 Thread Michael Paquier
Heikki,

On Thu, Apr 04, 2019 at 06:15:21PM +0300, Heikki Linnakangas wrote:
> I think we should fix this in a similar manner in B-tree, too, but that can
> be done separately. For B-tree, we need to worry about
> backwards-compatibility, but that seems simple enough; we just need to
> continue to understand old deleted pages, where the deletion XID is stored
> in the page opaque field.

This is an open item present already for a couple of weeks.  Are you
planning to tackle that soon?
--
Michael


signature.asc
Description: PGP signature


Re: GiST VACUUM

2019-04-04 Thread Andrey Borodin
Hi!

> 4 апр. 2019 г., в 20:15, Heikki Linnakangas  написал(а):
> 
> On 25/03/2019 15:20, Heikki Linnakangas wrote:
>> On 24/03/2019 18:50, Andrey Borodin wrote:
>>> I was working on new version of gist check in amcheck and understand one 
>>> more thing:
>>> 
>>> /* Can this page be recycled yet? */
>>> bool
>>> gistPageRecyclable(Page page)
>>> {
>>>  return PageIsNew(page) ||
>>>  (GistPageIsDeleted(page) &&
>>>   TransactionIdPrecedes(GistPageGetDeleteXid(page), 
>>> RecentGlobalXmin));
>>> }
>>> 
>>> Here RecentGlobalXmin can wraparound and page will become unrecyclable for 
>>> half of xid cycle. Can we prevent it by resetting PageDeleteXid to 
>>> InvalidTransactionId before doing RecordFreeIndexPage()?
>>> (Seems like same applies to GIN...)
>> True, and B-tree has the same issue. I thought I saw a comment somewhere
>> in the B-tree code about that earlier, but now I can't find it. I
>> must've imagined it.
>> We could reset it, but that would require dirtying the page. That would
>> be just extra I/O overhead, if the page gets reused before XID
>> wraparound. We could avoid that if we stored the full XID+epoch, not
>> just XID. I think we should do that in GiST, at least, where this is
>> new. In the B-tree, it would require some extra code to deal with
>> backwards-compatibility, but maybe it would be worth it even there.
> 
> I suggest that we do the attached. It fixes this for GiST. The patch changes 
> expands the "deletion XID" to 64-bits, and changes where it's stored. Instead 
> of storing it pd_prune_xid, it's stored in the page contents. Luckily, a 
> deleted page has no real content.

So, we store full xid right after page header?
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+   Assert(PageIsEmpty(page));
+   ((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + 
sizeof(FullTransactionId);
+
+   *((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number) untouched. I do 
not know is it done for a reason or not


Also, I did not understand this optimization:
+   /*
+* We can skip this if the page was deleted so long ago, that no scan 
can possibly
+* still see it, even in a standby. One measure might be anything older 
than the
+* table's frozen-xid, but we don't have that at hand here. But 
anything older than
+* 2 billion, from the next XID, is surely old enough, because you 
would hit XID
+* wraparound at that point.
+*/
+   nextxid = ReadNextFullTransactionId();
+   diff = U64FromFullTransactionId(nextxid) - 
U64FromFullTransactionId(latestRemovedXid);
+   if (diff < 0x7fff)
+   return;

Standby can be lagging months from primary, and, theoretically, close the gap 
in one sudden WAL leap... Also, I think, that comparison sign should be >, not 
<.


Best regards, Andrey Borodin.



Re: GiST VACUUM

2019-04-04 Thread Heikki Linnakangas

On 25/03/2019 15:20, Heikki Linnakangas wrote:

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more 
thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
  return PageIsNew(page) ||
  (GistPageIsDeleted(page) &&
   TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half 
of xid cycle. Can we prevent it by resetting PageDeleteXid to 
InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)


True, and B-tree has the same issue. I thought I saw a comment somewhere
in the B-tree code about that earlier, but now I can't find it. I
must've imagined it.

We could reset it, but that would require dirtying the page. That would
be just extra I/O overhead, if the page gets reused before XID
wraparound. We could avoid that if we stored the full XID+epoch, not
just XID. I think we should do that in GiST, at least, where this is
new. In the B-tree, it would require some extra code to deal with
backwards-compatibility, but maybe it would be worth it even there.


I suggest that we do the attached. It fixes this for GiST. The patch 
changes expands the "deletion XID" to 64-bits, and changes where it's 
stored. Instead of storing it pd_prune_xid, it's stored in the page 
contents. Luckily, a deleted page has no real content.


I think we should fix this in a similar manner in B-tree, too, but that 
can be done separately. For B-tree, we need to worry about 
backwards-compatibility, but that seems simple enough; we just need to 
continue to understand old deleted pages, where the deletion XID is 
stored in the page opaque field.


- Heikki
>From b7897577c83a81ec04394ce7113d1d8a47804086 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 4 Apr 2019 18:06:48 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c| 40 ---
 src/backend/access/gist/gistget.c | 14 +++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c840..028b06b264 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -693,14 +693,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -719,9 +720,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -842,12 +840,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	 * leaf/inner is enough to recognize split for root
 	 */
 }
-else if (GistFollowRight(stack->page) ||
-		 stack->parent->lsn < GistPageGetNSN(stack->page))
+else if ((GistFollowRight(stack->page) ||
+		  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+		 GistPageIsDeleted(stack->page))
 {
 	/*
-	 * The page was split while we momentarily unlocked the
-	 * page. Go back to parent.
+	 * The page was split or deleted while we momentarily
+	 * unlocked the page. Go back to parent.
 	 */
 	UnlockReleaseBuffer(stack->buffer);
 	xlocked = false;
@@ -856,18 +855,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 }
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-UnlockReleaseBuffer(stack->buffer);
-xlocked = false;
-state.stack = stack = stack->parent;
-continue;
-	

Re: GiST VACUUM

2019-03-25 Thread Heikki Linnakangas

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more 
thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
 return PageIsNew(page) ||
 (GistPageIsDeleted(page) &&
  TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half 
of xid cycle. Can we prevent it by resetting PageDeleteXid to 
InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)


True, and B-tree has the same issue. I thought I saw a comment somewhere 
in the B-tree code about that earlier, but now I can't find it. I 
must've imagined it.


We could reset it, but that would require dirtying the page. That would 
be just extra I/O overhead, if the page gets reused before XID 
wraparound. We could avoid that if we stored the full XID+epoch, not 
just XID. I think we should do that in GiST, at least, where this is 
new. In the B-tree, it would require some extra code to deal with 
backwards-compatibility, but maybe it would be worth it even there.


- Heikki



Re: GiST VACUUM

2019-03-24 Thread Andrey Borodin



> 22 марта 2019 г., в 17:03, Heikki Linnakangas  написал(а):
> 

I was working on new version of gist check in amcheck and understand one more 
thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
 TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half 
of xid cycle. Can we prevent it by resetting PageDeleteXid to 
InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-22 Thread Andrey Borodin



> 22 марта 2019 г., в 19:37, Heikki Linnakangas  написал(а):
> 
> On 21/03/2019 19:04, Heikki Linnakangas wrote:
>> Attached is the latest patch version, to be applied on top of the
>> IntegerSet patch.
> 
> And committed. Thanks Andrey!
> 
> - Heikki

Cool! Thank you very much! At the beginning I could not image how many caveats 
are there.

> 22 марта 2019 г., в 18:28, Heikki Linnakangas  написал(а):
> 
> * Currently, a search needs to scan all items on a page. If the keys are 
> small, that can be pretty slow. Divide each page further into e.g. 4 
> sub-pages, with a "bounding box" key for each sub-page, to speed up search.
BTW, I already have intra-page indexing patch. But now it obviously need a 
rebase :) Along with gist amcheck patch.

Thanks!

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas

On 22/03/2019 13:37, Heikki Linnakangas wrote:

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.


And committed. Thanks Andrey!


This caused the buildfarm to go pink... I was able to reproduce it, by 
running the regression tests in one terminal, and repeatedly running 
"VACUUM;" in another terminal. Strange that it seems to happen all the 
time on the buildfarm, but never happened to me locally.


Anyway, I'm investigating.

- Heikki




Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.


And committed. Thanks Andrey!

- Heikki



Re: GiST VACUUM

2019-03-22 Thread Heikki Linnakangas

On 22/03/2019 10:00, Andrey Borodin wrote:

22 марта 2019 г., в 1:04, Heikki Linnakangas 
написал(а):

PS. for Gist, we could almost use the LSN / NSN mechanism to detect
the case that a deleted page is reused: Add a new field to the GiST
page header, to store a new "deleteNSN" field. When a page is
deleted, the deleted page's deleteNSN is set to the LSN of the
deletion record. When the page is reused, the deleteNSN field is
kept unchanged. When you follow a downlink during search, if you
see that the page's deleteNSN > parent's LSN, you know that it was
concurrently deleted and recycled, and should be ignored. That
would allow reusing deleted pages immediately. Unfortunately that
would require adding a new field to the gist page header/footer,
which requires upgrade work :-(. Maybe one day, we'll bite the
bullet. Something to keep in mind, if we have to change the page
format anyway, for some reason.


Yeah, the same day we will get rid of invalid tuples. I can make a
patch for v13. Actually, I have a lot of patches that I want in GiST
in v13. Or v14.


Cool! Here's my wishlist:

* That deleteNSN thing
* Add a metapage to blk #0.
* Add a "level"-field to page header.
* Currently, a search needs to scan all items on a page. If the keys are 
small, that can be pretty slow. Divide each page further into e.g. 4 
sub-pages, with a "bounding box" key for each sub-page, to speed up search.


- Heikki



Re: GiST VACUUM

2019-03-22 Thread Andrey Borodin


> 22 марта 2019 г., в 1:04, Heikki Linnakangas  написал(а):
> ...
> When I started testing this, I quickly noticed that empty pages were not 
> being deleted nearly as much as I expected. I tracked it to this check in 
> gistdeletepage:
> 
>> +   if (GistFollowRight(leafPage)
>> +   || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
>> +   {
>> +   /* Don't mess with a concurrent page split. */
>> +   return false;
>> +   }
> 
> That NSN test was bogus. It prevented the leaf page from being reused, if the 
> parent page was *ever* split after the leaf page was created. I don't see any 
> reason to check the NSN here.
That's true. This check had sense only when parent page was not locked (and 
seems like comparison should be opposite). When both pages are locked, this 
test as no sense at all. Check was incorrectly "fixed" by me when transitioning 
from single-scan delete to two-scan delete during summer 2018. Just wandering 
how hard is it to simply delete a page...

>> Though, I'm not sure it is important for GIN. Scariest thing that can
>> happen: it will return same tid twice. But it is doing bitmap scan,
>> you cannot return same bit twice...
> 
> Hmm. Could it return a completely unrelated tuple?
No, I do not think so, it will do comparisons on posting tree tuples.

> We don't always recheck the original index quals in a bitmap index scan, 
> IIRC. Also, a search might get confused if it's descending down a posting 
> tree, and lands on a different kind of a page, altogether.
Yes, search will return an error, user will reissue query and everything will 
be almost fine.

> PS. for Gist, we could almost use the LSN / NSN mechanism to detect the case 
> that a deleted page is reused: Add a new field to the GiST page header, to 
> store a new "deleteNSN" field. When a page is deleted, the deleted page's 
> deleteNSN is set to the LSN of the deletion record. When the page is reused, 
> the deleteNSN field is kept unchanged. When you follow a downlink during 
> search, if you see that the page's deleteNSN > parent's LSN, you know that it 
> was concurrently deleted and recycled, and should be ignored. That would 
> allow reusing deleted pages immediately. Unfortunately that would require 
> adding a new field to the gist page header/footer, which requires upgrade 
> work :-(. Maybe one day, we'll bite the bullet. Something to keep in mind, if 
> we have to change the page format anyway, for some reason.
Yeah, the same day we will get rid of invalid tuples. I can make a patch for 
v13. Actually, I have a lot of patches that I want in GiST in v13. Or v14.

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-21 Thread Heikki Linnakangas

On 21/03/2019 18:06, Andrey Borodin wrote:

21 марта 2019 г., в 2:30, Heikki Linnakangas 
написал(а): one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a
deleted page is recycled, to prevent the deletion from being
replayed too early in the hot standby. See _bt_getbuf() and
btree_xlog_reuse_page(). I think we need to do something similar in
GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making
the changes is pretty straightforward, but it's a bit cumbersome to
test.


I've tried to deal with it and stuck...


So, I came up with the attached. I basically copy-pasted the page-reuse 
WAL-logging stuff from nbtree.


When I started testing this, I quickly noticed that empty pages were not 
being deleted nearly as much as I expected. I tracked it to this check 
in gistdeletepage:



+   if (GistFollowRight(leafPage)
+   || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+   {
+   /* Don't mess with a concurrent page split. */
+   return false;
+   }


That NSN test was bogus. It prevented the leaf page from being reused, 
if the parent page was *ever* split after the leaf page was created. I 
don't see any reason to check the NSN here. The NSN is normally used to 
detect if a (leaf) page has been concurrently split, when you descend 
the tree. We don't need to care about that here; as long as the 
FOLLOW_RIGHT flag is not set, the page has a downlink, and if we can 
find the downlink and the page is empty, we can delete it.


After removing that bogus NSN check, page reuse become much more 
effective. I've been testing this by running this test script repeatedly:


--
/*
create sequence gist_test_seq;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
*/

insert into gist_point_tbl (id, p)
   select nextval('gist_test_seq'), point(nextval('gist_test_seq'), 
1000 + g) from generate_series(1, 1) g;


delete from gist_point_tbl where id < currval('gist_test_seq') - 2;
vacuum gist_point_tbl;

select pg_table_size('gist_point_tbl'), pg_indexes_size('gist_point_tbl');
--

It inserts a bunch of rows, deletes a bunch of older rows, and vacuums. 
The interesting thing here is that the key values keep "moving", so that 
new tuples are added to different places than where old ones are 
removed. That's the case where page reuse is needed.


Before this patch, the index bloated very quickly. With the patch, it 
still bloats, because we still don't delete internal pages. But it's a 
small fraction of the bloat you got before.


Attached is the latest patch version, to be applied on top of the 
IntegerSet patch.



I think we should make B-tree WAL record for this shared, remove
BlockNumber and other unused stuff, leaving just xid and db oid. And
reuse this record for B-tree, GiST and GIN (yeah, it is not checking
for that conflict).
Good point. For now, I didn't try to generalize this, but perhaps we 
should.



Though, I'm not sure it is important for GIN. Scariest thing that can
happen: it will return same tid twice. But it is doing bitmap scan,
you cannot return same bit twice...


Hmm. Could it return a completely unrelated tuple? We don't always 
recheck the original index quals in a bitmap index scan, IIRC. Also, a 
search might get confused if it's descending down a posting tree, and 
lands on a different kind of a page, altogether.


Alexander, you added the mechanism to GIN recently, to prevent pages 
from being reused too early (commit 52ac6cd2d0). Do we need something 
like B-tree's REUSE_PAGE records in GIN, too, to prevent the same bug 
from happening in hot standby?



PS. for Gist, we could almost use the LSN / NSN mechanism to detect the 
case that a deleted page is reused: Add a new field to the GiST page 
header, to store a new "deleteNSN" field. When a page is deleted, the 
deleted page's deleteNSN is set to the LSN of the deletion record. When 
the page is reused, the deleteNSN field is kept unchanged. When you 
follow a downlink during search, if you see that the page's deleteNSN > 
parent's LSN, you know that it was concurrently deleted and recycled, 
and should be ignored. That would allow reusing deleted pages 
immediately. Unfortunately that would require adding a new field to the 
gist page header/footer, which requires upgrade work :-(. Maybe one day, 
we'll bite the bullet. Something to keep in mind, if we have to change 
the page format anyway, for some reason.


- Heikki
>From d7a77ad483251b62514778d2235516f6f9237ad7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 20:24:44 +0200
Subject: [PATCH 2/2] Delete empty pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make note of empty
pages and internal pages. At second pass we sca

Re: GiST VACUUM

2019-03-21 Thread Andrey Borodin



> 21 марта 2019 г., в 2:30, Heikki Linnakangas  написал(а):
> one remaining issue that needs to be fixed:
> 
> During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
> page is recycled, to prevent the deletion from being replayed too early 
> in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
> think we need to do something similar in GiST.
> 
> I'll try fixing that tomorrow, unless you beat me to it. Making the 
> changes is pretty straightforward, but it's a bit cumbersome to test.

I've tried to deal with it and stuck... I think we should make B-tree WAL 
record for this shared, remove BlockNumber and other unused stuff, leaving just 
xid and db oid.
And reuse this record for B-tree, GiST and GIN (yeah, it is not checking for 
that conflict).

Though, I'm not sure it is important for GIN. Scariest thing that can happen: 
it will return same tid twice. But it is doing bitmap scan, you cannot return 
same bit twice...

Eventually, hash, spgist and others will have same thing too.

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-20 Thread Heikki Linnakangas

On 15/03/2019 20:25, Andrey Borodin wrote:

11 марта 2019 г., в 20:03, Heikki Linnakangas 
написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

One thing still bothers me. Let's assume that we have internal
page with 2 deletable leaves. We lock these leaves in order of
items on internal page. Is it possible that 2nd page have
follow-right link on 1st and someone will lock 2nd page, try to
lock 1st and deadlock with VACUUM?


Hmm. If the follow-right flag is set on a page, it means that its
right sibling doesn't have a downlink in the parent yet.
Nevertheless, I think I'd sleep better, if we acquired the locks in
left-to-right order, to be safe.

>

Actually, I did not found lock coupling in GiST code. But I decided
to lock just two pages at once (leaf, then parent, for every pair).
PFA v22 with this concurrency logic.


Good. I just noticed, that the README actually does say explicitly, that 
the child must be locked before the parent.


I rebased this over the new IntegerSet implementation, from the other 
thread, and did another round of refactoring, cleanups, etc. Attached is 
a new version of this patch. I'm also including the IntegerSet patch 
here, for convenience, but it's the same patch I posted at [1].


It's in pretty good shape, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
page is recycled, to prevent the deletion from being replayed too early 
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
think we need to do something similar in GiST.


I'll try fixing that tomorrow, unless you beat me to it. Making the 
changes is pretty straightforward, but it's a bit cumbersome to test.


[1] 
https://www.postgresql.org/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb...@iki.fi


- Heikki
>From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/2] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb7...@iki.fi
---
 src/backend/lib/Makefile  |2 +-
 src/backend/lib/README|2 +
 src/backend/lib/integerset.c  | 1039 +
 src/include/lib/integerset.h  |   25 +
 src/test/modules/Makefile |1 +
 src/test/modules/test_integerset/.gitignore   |4 +
 src/test/modules/test_integerset/Makefile |   21 +
 src/test/modules/test_integerset/README   |7 +
 .../expected/test_integerset.out  |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |8 +
 .../modules/test_integerset/test_integerset.c |  622 ++
 .../test_integerset/test_integerset.control   |4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+   ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- a/src/backe

Re: GiST VACUUM

2019-03-20 Thread Heikki Linnakangas

On 15/03/2019 20:25, Andrey Borodin wrote:

11 марта 2019 г., в 20:03, Heikki Linnakangas 
написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

One thing still bothers me. Let's assume that we have internal
page with 2 deletable leaves. We lock these leaves in order of
items on internal page. Is it possible that 2nd page have
follow-right link on 1st and someone will lock 2nd page, try to
lock 1st and deadlock with VACUUM?


Hmm. If the follow-right flag is set on a page, it means that its
right sibling doesn't have a downlink in the parent yet.
Nevertheless, I think I'd sleep better, if we acquired the locks in
left-to-right order, to be safe.

>

Actually, I did not found lock coupling in GiST code. But I decided
to lock just two pages at once (leaf, then parent, for every pair).
PFA v22 with this concurrency logic.


Good. I just noticed, that the README actually does say explicitly, that 
the child must be locked before the parent.


I rebased this over the new IntegerSet implementation, from the other 
thread, and did another round of refactoring, cleanups, etc. Attached is 
a new version of this patch. I'm also including the IntegerSet patch 
here, for convenience, but it's the same patch I posted at [1].


It's in pretty good shapre, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted 
page is recycled, to prevent the deletion from being replayed too early 
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I 
think we need to do something similar in GiST.


I'll try fixing that tomorrow, unless you beat me to it. Making the 
changes is pretty straightforward, but it's a bit cumbersome to test.


[1] 
https://www.postgresql.org/message-id/1035d8e6-cfd1-0c27-8902-40d8d45eb...@iki.fi


- Heikki
>From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/2] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb7...@iki.fi
---
 src/backend/lib/Makefile  |2 +-
 src/backend/lib/README|2 +
 src/backend/lib/integerset.c  | 1039 +
 src/include/lib/integerset.h  |   25 +
 src/test/modules/Makefile |1 +
 src/test/modules/test_integerset/.gitignore   |4 +
 src/test/modules/test_integerset/Makefile |   21 +
 src/test/modules/test_integerset/README   |7 +
 .../expected/test_integerset.out  |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |8 +
 .../modules/test_integerset/test_integerset.c |  622 ++
 .../test_integerset/test_integerset.control   |4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-   ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+   ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- a/src/backe

Re: GiST VACUUM

2019-03-15 Thread Jeff Janes
On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas  wrote:

> On 05/03/2019 02:26, Andrey Borodin wrote:
> >> I also tried your amcheck tool with this. It did not report any
> >> errors.
> >>
> >> Attached is also latest version of the patch itself. It is the
> >> same as your latest patch v19, except for some tiny comment
> >> kibitzing. I'll mark this as Ready for Committer in the commitfest
> >> app, and will try to commit it in the next couple of days.
> >
> > That's cool! I'll work on 2nd step of these patchset to make
> > blockset data structure prettier and less hacky.
>
> Committed the first patch. Thanks for the patch!
>

Thank you.  This is a transformational change; it will allow GiST indexes
larger than RAM to be used in some cases where they were simply not
feasible to use before.  On a HDD, it resulted in a 50 fold improvement in
vacuum time, and the machine went from unusably unresponsive to merely
sluggish during the vacuum.  On a SSD (albeit a very cheap laptop one, and
exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30
fold improvement, from a far faster baseline.  Even on an AWS instance with
a "GP2" SSD volume, which normally shows little benefit from sequential
reads, I get a 3 fold speed up.

I also ran this through a lot of crash-recovery testing using simulated
torn-page writes using my traditional testing harness with high concurrency
(AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and
did not encounter any problems.  I tested both with btree_gist on a scalar
int, and on tsvector with each tsvector having 101 tokens.

I did notice that the space freed up in the index by vacuum doesn't seem to
get re-used very efficiently, but that is an ancestral problem independent
of this change.

Cheers,

Jeff


Re: GiST VACUUM

2019-03-15 Thread Andrey Borodin
Hi!

> 11 марта 2019 г., в 20:03, Heikki Linnakangas  написал(а):
> 
> On 10/03/2019 18:40, Andrey Borodin wrote:
>> Here's new version of the patch for actual page deletion.
>> Changes:
>> 1. Fixed possible concurrency issue:
>> We were locking child page while holding lock on internal page. Notes
>> in GiST README recommend locking child before parent. Thus now we
>> unlock internal page before locking children for deletion, and lock
>> it again, check that everything is still on it's place and delete
>> only then.
> 
> Good catch. The implementation is a bit broken, though. This segfaults:
Sorry for the noise. Few lines of old code leaked into the patch between 
testing and mailing.

> BTW, we don't seem to have test coverage for this in the regression tests. 
> The tests that delete some rows, where you updated the comment, doesn't 
> actually seem to produce any empty pages that could be deleted.
I've updated test to delete more rows. But it did not trigger previous bug 
anyway, we had to delete everything for this case.

> 
>> One thing still bothers me. Let's assume that we have internal page
>> with 2 deletable leaves. We lock these leaves in order of items on
>> internal page. Is it possible that 2nd page have follow-right link on
>> 1st and someone will lock 2nd page, try to lock 1st and deadlock with
>> VACUUM?
> 
> Hmm. If the follow-right flag is set on a page, it means that its right 
> sibling doesn't have a downlink in the parent yet. Nevertheless, I think I'd 
> sleep better, if we acquired the locks in left-to-right order, to be safe.
Actually, I did not found lock coupling in GiST code. But I decided to lock 
just two pages at once (leaf, then parent, for every pair). PFA v22 with this 
concurrency logic.

> 
> An easy cop-out would be to use LWLockConditionalAcquire, and bail out if we 
> can't get the lock. But if it's not too complicated, it'd be nice to acquire 
> the locks in the correct order to begin with.
> 
> I did a round of cleanup and moving things around, before I bumped into the 
> above issue. I'm including them here as separate patches, for easier review, 
> but they should of course be squashed into yours. For convenience, I'm 
> including your patches here as well, so that all the patches you need are in 
> the same place, but they are identical to what you posted.
I've rebased all these patches so that VACUUM should work on every commit. 
Let's just squash them for the next iteration, it was quite a messy rebase.
BTW, you deleted numEmptyPages, this makes code cleaner, but this variable 
could stop gistvacuum_recycle_pages() when everything is recycled already.


Thanks!

Best regards, Andrey Borodin.



0001-Add-block-set-data-structure-v22.patch
Description: Binary data


0002-Delete-pages-during-GiST-VACUUM-v22.patch
Description: Binary data


0003-Minor-cleanup-v22.patch
Description: Binary data


0004-Move-the-page-deletion-logic-to-separate-function-v2.patch
Description: Binary data


0005-Remove-numEmptyPages-.-it-s-not-really-needed-v22.patch
Description: Binary data


0006-Misc-cleanup-v22.patch
Description: Binary data


Re: GiST VACUUM

2019-03-11 Thread Heikki Linnakangas
imit members
+ * and checks that block set behavior is similar to Bitmapset
+ */
+static void test_blockset_bms_compliance(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != next_bn)
+		{
+			elog(ERROR,
+ "Bitmapset returned value %X different from block set %X,"
+ " test_limit %d, point index %d",
+ next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+ "Block set did not found present item %X"
+ " test_limit %d, point index %d",
+ next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+ "Block set did agree with bitmapset item %X"
+ " test_limit %d, point index %d",
+ blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
+
+/* 
+ * This test is similar to test_blockset_bms_compliance()
+ * except that it shifts BlockNumbers by one bit to ensure that blockset
+ * operates correctly on values higher that INT32_MAX
+ * This function is copy-pasted from previous with the exception of barrel
+ * shifts for BlockNumbers. I've tried various refactorings, but they all
+ * looked ugly.
+ */
+static void test_blockset_big_block_numbers(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno << 1);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != (next_bn >> 1))
+		{
+			elog(ERROR,
+ "Bitmapset returned value %X different from block set %X,"
+ " test_limit %d, point index %d",
+ next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+ "Block set did not found present item %X"
+ " test_limit %d, point index %d",
+ next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno << 1, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+ "Block set did agree with bitmapset item %X"
+ " test_limit %d, point index %d",
+ blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
diff --git a/src/test/modules/test_blockset/test_blockset.control b/src/test/modules/test_blockset/test_blockset.control
new file mode 100644
index 000..fdb7598c5a7
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for block set library'
+default_version = '1.0'
+module_pathname = '$libdir/test_blockset'
+relocatable = true
-- 
2.20.1

>From 1e477f083cd639117944c7910db8aff0c763b4f6 Mon Sep 17 00:00:00 2001
From: Andrey 
Date: Fri, 8 Mar 2019 21:24:37 +0500
Subject: [PATCH 2/6] Delete pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make notes
of empty pages and internal pages. At second pass we scan through
internal pages looking for references to empty leaf pages.
---
 src/backend/access/gist/README |  14 ++
 src/backend/access/gist/gist.c |  18 ++
 src/backend/access/gist/gistutil.c |   3 +-
 src/backend/access/gist/gistvacuum.c   | 218 -
 src/backend/access/gist/gistxlog.c |  60 +++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h  |   4 +
 src/include/access/gist_private.h  |   7 +-
 src/include/access/gistxlog.h  

Re: GiST VACUUM

2019-03-10 Thread Andrey Borodin

> 5 марта 2019 г., в 18:21, Heikki Linnakangas  написал(а):
> 
> On 05/03/2019 02:26, Andrey Borodin wrote:
>> 
>> That's cool! I'll work on 2nd step of these patchset to make
>> blockset data structure prettier and less hacky.
> 
> Committed the first patch. Thanks for the patch!

That's cool! Thanks!

> I'll change the status of this patch to "Waiting on Author", to reflect the 
> state of the 2nd patch, since you're working on the radix tree blockset stuff.

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes in GiST 
README recommend locking child before parent.
Thus now we unlock internal page before locking children for deletion, and lock 
it again, check that everything is still on it's place and delete only then.
One thing still bothers me. Let's assume that we have internal page with 2 
deletable leaves. We lock these leaves in order of items on internal page. Is 
it possible that 2nd page have follow-right link on 1st and someone will lock 
2nd page, try to lock 1st and deadlock with VACUUM?
2. Added radix-tree-based block set to lib, with tests.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM.patch
Description: Binary data


0001-Add-block-set-data-structure.patch
Description: Binary data


Re: GiST VACUUM

2019-03-05 Thread Heikki Linnakangas

On 05/03/2019 02:26, Andrey Borodin wrote:
I also tried your amcheck tool with this. It did not report any 
errors.


Attached is also latest version of the patch itself. It is the
same as your latest patch v19, except for some tiny comment
kibitzing. I'll mark this as Ready for Committer in the commitfest
app, and will try to commit it in the next couple of days.


That's cool! I'll work on 2nd step of these patchset to make
blockset data structure prettier and less hacky.


Committed the first patch. Thanks for the patch!

I did some last minute copy-editing on the comments, and fixed one 
little thing that we missed earlier: the check for "invalid tuples" that 
might be left over in pg_upgraded pre-9.1 indexes, was lost at some 
point. I added that check back. (It would be nice if GiST indexes had a 
metadata page with a version number, so we could avoid that work in the 
99% of cases that that check is not needed, but that's a different story.)


I'll change the status of this patch to "Waiting on Author", to reflect 
the state of the 2nd patch, since you're working on the radix tree 
blockset stuff.


- Heikki



Re: GiST VACUUM

2019-03-04 Thread Andrey Borodin
Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together 
seems so obvious now, but never actually visited my mind before.

> 4 марта 2019 г., в 18:04, Heikki Linnakangas  написал(а):
> Thanks! As I noted at 
> https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi,
>  the test patch left the index corrupt. I fixed it so that it leaves behind 
> incompletely split pages, without the corruption, see attached. It's similar 
> to yours, but let me recap what it does:
> 
> * Hacks gistbuild(), create 100 empty pages immediately after the root pages. 
> They are leaked, so they won't be reused until the a VACUUM sees them and 
> puts them to the FSM
> 
> * Hacks gistinserttuples(), to leave the split incompleted with 50% 
> probability
> 
> * In vacuum, print a line to the log whenever it needs to "jump left"
> 
> I used this patch, with the attached test script that's similar to yours, but 
> it also tries to verify that the index returns correct results. It prints a 
> result set like this:
> 
>   sum
> -
> -364450
>  364450
> (2 rows)
> 
> If the two numbers add up to 0, the index seems valid. And you should see 
> "RESCAN" lines in the log, to show that jumping left happened. Because the 
> behavior is random and racy, you may need to run the script many times to see 
> both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. 
> Especially the "FollowRight" case happens less frequently than the NSN case, 
> you might need to run the script > 10 times to see it.
Great! I've repeated your tests on my machine, I observe similar frequencies of 
causes of rescan left jumps.

> I also tried your amcheck tool with this. It did not report any errors.
> 
> Attached is also latest version of the patch itself. It is the same as your 
> latest patch v19, except for some tiny comment kibitzing. I'll mark this as 
> Ready for Committer in the commitfest app, and will try to commit it in the 
> next couple of days.

That's cool! I'll work on 2nd step of these patchset to make blockset data 
structure prettier and less hacky.

Best regards, Andrey Borodin.


Re: GiST VACUUM

2019-03-04 Thread Heikki Linnakangas

On 04/01/2019 02:47, Andrey Borodin wrote:

2 янв. 2019 г., в 20:35, Heikki Linnakangas  написал(а):

In patch #1, to do the vacuum scan in physical order:
...
I think this is ready to be committed, except that I didn't do any testing. We 
discussed the need for testing earlier. Did you write some test scripts for 
this, or how have you been testing?

Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an 
insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE 
THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
WARNING:  RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).


Thanks! As I noted at 
https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi, 
the test patch left the index corrupt. I fixed it so that it leaves 
behind incompletely split pages, without the corruption, see attached. 
It's similar to yours, but let me recap what it does:


* Hacks gistbuild(), create 100 empty pages immediately after the root 
pages. They are leaked, so they won't be reused until the a VACUUM sees 
them and puts them to the FSM


* Hacks gistinserttuples(), to leave the split incompleted with 50% 
probability


* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to 
yours, but it also tries to verify that the index returns correct 
results. It prints a result set like this:


   sum
-
 -364450
  364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should 
see "RESCAN" lines in the log, to show that jumping left happened. 
Because the behavior is random and racy, you may need to run the script 
many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED 
BY FollowRight" cases. Especially the "FollowRight" case happens less 
frequently than the NSN case, you might need to run the script > 10 
times to see it.


I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as 
your latest patch v19, except for some tiny comment kibitzing. I'll mark 
this as Ready for Committer in the commitfest app, and will try to 
commit it in the next couple of days.


- Heikki


gist-vacuum-test.sh
Description: application/shellscript
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f52b8f4dc..cad4a2a46e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1187,6 +1187,8 @@ gistinserttuple(GISTInsertState *state, GISTInsertStack *stack,
 			InvalidBuffer, InvalidBuffer, false, false);
 }
 
+static bool HACK = false;
+
 /* 
  * An extended workhorse version of gistinserttuple(). This version allows
  * inserting multiple tuples, or replacing a single tuple with multiple tuples.
@@ -1230,6 +1232,14 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	CheckForSerializableConflictIn(state->r, NULL, stack->buffer);
 
 	/* Insert the tuple(s) to the page, splitting the page if necessary */
+	if (BufferIsValid(leftchild) && HACK)
+	{
+		/* skip actually inserting */
+		splitinfo = NULL;
+		is_split = false;
+	}
+	else
+	{
 	is_split = gistplacetopage(state->r, state->freespace, giststate,
 			   stack->buffer,
 			   tuples, ntup,
@@ -1238,6 +1248,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 			   ,
 			   true,
 			   state->heapRel);
+	}
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1256,7 +1267,12 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	 * didn't have to split, release it ourselves.
 	 */
 	if (splitinfo)
+	{
+		if (random() % 2 == 0)
+			HACK = true;
 		gistfinishsplit(state, stack, giststate, splitinfo, unlockbuf);
+		HACK = false;
+	}
 	else if (unlockbuf)
 		LockBuffer(stack->buffer, GIST_UNLOCK);
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index bd142a3560..fdfc54b009 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -201,6 +201,9 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	for (int i = 0; i < 100; i++)
+		ReleaseBuffer(ReadBuffer(index, P_NEW));
+
 	/*
 	 * Do the heap scan.
 	 */
diff --git a/src/backend/access/gist/gistvac

Re: GiST VACUUM

2019-03-04 Thread Heikki Linnakangas

On 04/01/2019 21:26, Andrey Borodin wrote:

3 янв. 2019 г., в 23:47, Andrey Borodin 
написал(а):



* Bitmapset stores 32 bit signed integers, but BlockNumber is
unsigned. So this will fail with an index larger than 2^31
blocks. That's perhaps academical, I don't think anyone will try
to create a 16 TB GiST index any time soon. But it feels wrong to
introduce an arbitrary limitation like that.


Looks like bitmapset is unsuitable for collecting block numbers due
to the interface. Let's just create custom container for this? Or
there is one other option: for each block number throw away sign
bit and consider page potentially internal and potentially empty
leaf if (blkno & 0x7FF) is in bitmapset. And then we will have
to create at least one 17Tb GiST to check it actually works.


Heikki, how do you think, is implementing our own radix tree for this
is viable solution? I've written working implementation with 4-level
statically typed tree. If we follow this route, probably, there must
be tests for this data structure.


Yeah, seems reasonable.

- Heikki



Re: GiST VACUUM

2019-01-04 Thread Andrey Borodin
3 янв. 2019 г., в 23:47, Andrey Borodin  написал(а):
> 
>> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So 
>> this will fail with an index larger than 2^31 blocks. That's perhaps 
>> academical, I don't think anyone will try to create a 16 TB GiST index any 
>> time soon. But it feels wrong to introduce an arbitrary limitation like that.
> Looks like bitmapset is unsuitable for collecting block numbers due to the 
> interface. Let's just create custom container for this?
> Or there is one other option: for each block number throw away sign bit and 
> consider page potentially internal and potentially empty leaf if (blkno & 
> 0x7FF) is in bitmapset.
> And then we will have to create at least one 17Tb GiST to check it actually 
> works.

Heikki, how do you think, is implementing our own radix tree for this is viable 
solution?
I've written working implementation with 4-level statically typed tree. If we 
follow this route, probably, there must be tests for this data structure.

Best regards, Andrey Borodin.


radix_tree.diff
Description: Binary data


Re: GiST VACUUM

2019-01-03 Thread Andrey Borodin
Cool, thanks!

> 2 янв. 2019 г., в 20:35, Heikki Linnakangas  написал(а):
> 
> In patch #1, to do the vacuum scan in physical order:
> ...
> I think this is ready to be committed, except that I didn't do any testing. 
> We discussed the need for testing earlier. Did you write some test scripts 
> for this, or how have you been testing?
Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch


0002-Test-left-jumps-v18.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
Description: Binary data

To trigger FollowRight GiST sometimes forget to clear follow-right marker 
simulating crash of an insert. This fills logs with "fixing incomplete split" 
messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
WARNING:  RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).

> 
> Patch #2:

> 
> * Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So 
> this will fail with an index larger than 2^31 blocks. That's perhaps 
> academical, I don't think anyone will try to create a 16 TB GiST index any 
> time soon. But it feels wrong to introduce an arbitrary limitation like that.
Looks like bitmapset is unsuitable for collecting block numbers due to the 
interface. Let's just create custom container for this?
Or there is one other option: for each block number throw away sign bit and 
consider page potentially internal and potentially empty leaf if (blkno & 
0x7FF) is in bitmapset.
And then we will have to create at least one 17Tb GiST to check it actually 
works.

> * I was surprised that the bms_make_empty() function doesn't set the 'nwords' 
> field to match the allocated size. Perhaps that was intentional, so that you 
> don't need to scan the empty region at the end, when you scan through all 
> matching bits? Still, seems surprising, and needs a comment at least.
Explicitly set nwords to zero. Looking at the code now, I do not see this 
nwords==0 as a very good idea. Probably, it's effective, but it's hacky, 
creates implicit expectations in code.

> * We're now scanning all internal pages in the 2nd phase. Not only those 
> internal pages that contained downlinks to empty leaf pages. That's probably 
> OK, but the comments need updating on that.
Adjusted comments. That if before loop
> if (vstate.emptyPages > 0)
seems superfluous. But I kept it until we solve the problem with 31-bit 
bitmapset.
> * In general, comments need to be fixed and added in several places. For 
> example, there's no clear explanation on what the "delete XID", stored in 
> pd_prune_xid, means. (I know what it is, because I'm familiar with the same 
> mechanism in B-tree, but it's not clear from the code itself.)
I've added comment to GistPageSetDeleteXid()

* In this check
> if (GistPageIsDeleted(page) && 
> TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
I've switched using RecentGlobalDataXmin to RecentGlobalXmin, because we have 
done so in similar mechanics for GIN (for uniformity with B-tree).


Thanks for working on this!


Best regards, Andrey Borodin.




0002-Delete-pages-during-GiST-VACUUM-v19.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v19.patch
Description: Binary data


Re: GiST VACUUM

2019-01-02 Thread Heikki Linnakangas
+			(opaque->rightlink < orig_blkno))
+		{
+			recurse_to = opaque->rightlink;
+		}
 
-if (RelationNeedsWAL(rel))
-{
-	XLogRecPtr	recptr;
+		/*
+		 * Scan over all items to see which ones need deleted according to the
+		 * callback function.
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-	recptr = gistXLogUpdate(buffer,
-			todelete, ntodelete,
-			NULL, 0, InvalidBuffer);
-	PageSetLSN(page, recptr);
-}
-else
-	PageSetLSN(page, gistGetFakeLSN(rel));
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
+			{
+ItemId		iid = PageGetItemId(page, off);
+IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-END_CRIT_SECTION();
+if (callback(&(idxtuple->t_tid), callback_state))
+	todelete[ntodelete++] = off;
 			}
-
 		}
-		else
+
+		/*
+		 * Apply any needed deletes.  We issue just one WAL record per page,
+		 * so as to minimize WAL traffic.
+		 */
+		if (ntodelete)
 		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			START_CRIT_SECTION();
 
-			maxoff = PageGetMaxOffsetNumber(page);
+			MarkBufferDirty(buffer);
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
 			{
-iid = PageGetItemId(page, i);
-idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-ptr->parentlsn = BufferGetLSNAtomic(buffer);
-ptr->next = stack->next;
-stack->next = ptr;
-
-if (GistTupleIsInvalid(idxtuple))
-	ereport(LOG,
-			(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-	RelationGetRelationName(rel)),
-			 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-			 errhint("Please REINDEX it.")));
+XLogRecPtr	recptr;
+
+recptr = gistXLogUpdate(buffer,
+		todelete, ntodelete,
+		NULL, 0, InvalidBuffer);
+PageSetLSN(page, recptr);
 			}
-		}
+			else
+PageSetLSN(page, gistGetFakeLSN(rel));
 
-		UnlockReleaseBuffer(buffer);
+			END_CRIT_SECTION();
+
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
+			maxoff = PageGetMaxOffsetNumber(page);
+		}
 
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.19.2

>From 62fbe0ce5506e006b92dbfb07aee7414040d982f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 2 Jan 2019 16:00:16 +0200
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v18-heikki

---
 src/backend/access/gist/README |  14 +++
 src/backend/access/gist/gist.c |  18 +++
 src/backend/access/gist/gistutil.c |   3 +-
 src/backend/access/gist/gistvacuum.c   | 152 -
 src/backend/access/gist/gistxlog.c |  60 ++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/backend/nodes/bitmapset.c  |  16 +++
 src/include/access/gist.h  |   3 +
 src/include/access/gist_private.h  |   7 +-
 src/include/access/gistxlog.h  |  10 +-
 src/include/nodes/bitmapset.h  |   1 +
 src/test/regress/expected/gist.out |   4 +-
 src/test/regress/sql/gist.sql  |   4 +-
 13 files changed, 282 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b81..c84359de310 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+--
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+Af

Re: GiST VACUUM

2019-01-02 Thread Heikki Linnakangas

On 28/10/2018 19:32, Andrey Borodin wrote:

Hi everyone!


2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
Andrey, your latest patch does not apply.  I am moving this to the next
CF, waiting for your input.


I'm doing preps for CF.
Here's rebased version.


Thanks, I had another look at these.

In patch #1, to do the vacuum scan in physical order:

* The starting NSN was not acquired correctly for unlogged and temp 
relations. They don't use WAL, so their NSN values are based on the 
'unloggedLSN' counter, rather than current WAL insert pointer. So must 
use gistGetFakeLSN() rather than GetInsertRecPtr() for them. Fixed that.


* Adjusted comments a little bit, mostly by copy-pasting the 
better-worded comments from the corresponding nbtree code, and ran pgindent.


I think this is ready to be committed, except that I didn't do any 
testing. We discussed the need for testing earlier. Did you write some 
test scripts for this, or how have you been testing?



Patch #2:

* Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. 
So this will fail with an index larger than 2^31 blocks. That's perhaps 
academical, I don't think anyone will try to create a 16 TB GiST index 
any time soon. But it feels wrong to introduce an arbitrary limitation 
like that.


* I was surprised that the bms_make_empty() function doesn't set the 
'nwords' field to match the allocated size. Perhaps that was 
intentional, so that you don't need to scan the empty region at the end, 
when you scan through all matching bits? Still, seems surprising, and 
needs a comment at least.


* We're now scanning all internal pages in the 2nd phase. Not only those 
internal pages that contained downlinks to empty leaf pages. That's 
probably OK, but the comments need updating on that.


* In general, comments need to be fixed and added in several places. For 
example, there's no clear explanation on what the "delete XID", stored 
in pd_prune_xid, means. (I know what it is, because I'm familiar with 
the same mechanism in B-tree, but it's not clear from the code itself.)


These can be fixed, they're not show-stoppers, but patch #2 isn't quite 
ready yet.


- Heikki



Re: GiST VACUUM

2018-11-30 Thread Dmitry Dolgov
> On Sun, Oct 28, 2018 at 6:32 PM Andrey Borodin  wrote:
>
> Hi everyone!
>
> > 2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
> > Andrey, your latest patch does not apply.  I am moving this to the next
> > CF, waiting for your input.
>
> I'm doing preps for CF.
> Here's rebased version.

Looks like this patch is waiting for an input since August. Could any of the
reviewers (Heikki?) please take a look at the latest version? In the meantime
I'm moving it to the next CF.



Re: GiST VACUUM

2018-10-28 Thread Andrey Borodin
Hi everyone!

> 2 окт. 2018 г., в 6:14, Michael Paquier  написал(а):
> Andrey, your latest patch does not apply.  I am moving this to the next
> CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v17.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v17.patch
Description: Binary data
 



Re: GiST VACUUM

2018-10-01 Thread Michael Paquier
On Mon, Aug 06, 2018 at 11:12:00PM +0500, Andrey Borodin wrote:
> Done. Added function bms_make_empty(int size)

Andrey, your latest patch does not apply.  I am moving this to the next
CF, waiting for your input.
--
Michael


signature.asc
Description: PGP signature


Re: GiST VACUUM

2018-08-06 Thread Andrey Borodin
Hi!

PFA v16.

> 5 авг. 2018 г., в 21:45, Andrey Borodin  написал(а):
>> 5 авг. 2018 г., в 16:18, Heikki Linnakangas  написал(а):
>> 
>> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
>> bytes per internal page in total, while a bitmap consumes one bit per page, 
>> leaf or internal. If I'm doing
>> my math right, assuming the ratio of leaf pages vs internal pages is
>> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
>> internal page, vs. 200 bits. An array, with 4 bytes per block number,
>> would be the winner here.
> We do not know size of this array beforehand. I can implement normal 
> ArrayList though (with repallocing array) or linked list of chunks. Or 
> anything from data structures zoo.
> Or just stick with bitmap (my preferred way).
Done.
>> 
>>> But I have to note that default growth strategy of Bitmap is not good: we 
>>> will be repallocing byte by byte.
>> 
>> True, that repallocing seems bad. You could force it to be pre-allocated
>> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
> OK, I'll think of proper resize function (ensure capacity, to be precise).
Done. Added function bms_make_empty(int size)

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v16.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v16.patch
Description: Binary data


Re: GiST VACUUM

2018-08-05 Thread Andrey Borodin
Hi!

> 5 авг. 2018 г., в 16:18, Heikki Linnakangas  написал(а):
> 
> On 31/07/18 23:06, Andrey Borodin wrote:
>>> On a typical GiST index, what's the ratio of leaf vs. internal
>>> pages? Perhaps an array would indeed be better.
> >
>> Typical GiST has around 200 tuples per internal page. I've switched
>> to List since it's more efficient than bitmap.
> 
> Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
> bytes per internal page in total, while a bitmap consumes one bit per page, 
> leaf or internal. If I'm doing
> my math right, assuming the ratio of leaf pages vs internal pages is
> 1:200, a List actually consumes more memory than a bitmap; 256 bits per
> internal page, vs. 200 bits. An array, with 4 bytes per block number,
> would be the winner here.
We do not know size of this array beforehand. I can implement normal ArrayList 
though (with repallocing array) or linked list of chunks. Or anything from data 
structures zoo.
Or just stick with bitmap (my preferred way).
> 
>> But I have to note that default growth strategy of Bitmap is not good: we 
>> will be repallocing byte by byte.
> 
> True, that repallocing seems bad. You could force it to be pre-allocated
> by setting the last bit. Or add a function to explicitly enlarge the bitmap.
OK, I'll think of proper resize function (ensure capacity, to be precise).

Best regards, Andrey Borodin.




Re: GiST VACUUM

2018-08-05 Thread Heikki Linnakangas

On 31/07/18 23:06, Andrey Borodin wrote:

On a typical GiST index, what's the ratio of leaf vs. internal
pages? Perhaps an array would indeed be better.

>

Typical GiST has around 200 tuples per internal page. I've switched
to List since it's more efficient than bitmap.


Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per 
page, leaf or internal. If I'm doing

my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

But I have to note that default growth strategy of Bitmap is not 
good: we will be repallocing byte by byte.


True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

- Heikki



Re: GiST VACUUM

2018-07-31 Thread Andrey Borodin
Hi! Thanks for looking into the patch!

> 30 июля 2018 г., в 18:39, Heikki Linnakangas  написал(а):
> 
> On 29/07/18 14:47, Andrey Borodin wrote:
>> Fixed both problems. PFA v14.
> 
> Thanks, took a really quick look at this.
> 
> The text being added to README is outdated for these latest changes.
Fixed.
> 
>> In second step I still use paloc's memory, but only to store two
>> bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
>> physical scan only reads internal pages. I can omit that bitmap, if
>> I'll scan everything. Also, I can replace emptyLeafs bitmap with
>> array\list, but I do not really think it will be big.
> 
> On a typical GiST index, what's the ratio of leaf vs. internal pages? Perhaps 
> an array would indeed be better.
Typical GiST has around 200 tuples per internal page. I've switched to List 
since it's more efficient than bitmap. Is 
> If you have a really large index, the bitmaps can take a fair amount of 
> memory, on top of the memory used for tracking the dead TIDs. I.e. that 
> memory will be in addition to maintenance_work_mem. That's not nice, but I 
> think it's OK in practice, and not worth spending too much effort to 
> eliminate. For a 1 TB index with default 8k block size, the two bitmaps will 
> take 32 MB of memory in total. If you're dealing with a database of that 
> size, you ought to have some memory to spare. But if an array would use less 
> memory, that'd be better.

> 
> If you go with bitmaps, please use the existing Bitmapset instead of rolling 
> your own. Saves some code, and it provides more optimized routines for 
> iterating through all the set bits, too (bms_next_member()). Another 
> possibility would be to use Tidbitmap, in the "lossy" mode, i.e. add the 
> pages with tbm_add_page(). That might save some memory, compared to 
> Bitmapset, if the bitmap is very sparse. Not sure how it compares with a 
> plain array.
Yeah, I've stopped reinventing that bicycle. But I have to note that default 
growth strategy of Bitmap is not good: we will be repallocing byte by byte.

> 
> A straightforward little optimization would be to skip scanning the internal 
> pages, when the first scan didn't find any empty pages. And stop the scan of 
> the internal pages as soon as all the empty pages have been recycled.
Done.

PFA v15.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v15.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v15.patch
Description: Binary data


Re: GiST VACUUM

2018-07-30 Thread Heikki Linnakangas

On 29/07/18 14:47, Andrey Borodin wrote:

Fixed both problems. PFA v14.


Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.


In second step I still use paloc's memory, but only to store two
bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
physical scan only reads internal pages. I can omit that bitmap, if
I'll scan everything. Also, I can replace emptyLeafs bitmap with
array\list, but I do not really think it will be big.


On a typical GiST index, what's the ratio of leaf vs. internal pages? 
Perhaps an array would indeed be better. If you have a really large 
index, the bitmaps can take a fair amount of memory, on top of the 
memory used for tracking the dead TIDs. I.e. that memory will be in 
addition to maintenance_work_mem. That's not nice, but I think it's OK 
in practice, and not worth spending too much effort to eliminate. For a 
1 TB index with default 8k block size, the two bitmaps will take 32 MB 
of memory in total. If you're dealing with a database of that size, you 
ought to have some memory to spare. But if an array would use less 
memory, that'd be better.


If you go with bitmaps, please use the existing Bitmapset instead of 
rolling your own. Saves some code, and it provides more optimized 
routines for iterating through all the set bits, too 
(bms_next_member()). Another possibility would be to use Tidbitmap, in 
the "lossy" mode, i.e. add the pages with tbm_add_page(). That might 
save some memory, compared to Bitmapset, if the bitmap is very sparse. 
Not sure how it compares with a plain array.


A straightforward little optimization would be to skip scanning the 
internal pages, when the first scan didn't find any empty pages. And 
stop the scan of the internal pages as soon as all the empty pages have 
been recycled.


- Heikki



Re: GiST VACUUM

2018-07-29 Thread Andrey Borodin
Hi!

Thank you!

> 29 июля 2018 г., в 14:04, Thomas Munro  
> написал(а):
> 
> On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin  wrote:
>>> 21 июля 2018 г., в 17:11, Andrey Borodin  написал(а):
>>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>> 
>> Just in case, here's second part of patch series with actual page deletion.
> 
> Hi Andrey,
> 
> Cfbot says:
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146
> 
> That's because you declared a new variable after some other
> statements.  You can't do that in old school C89.
Yep, mismerged patch steps and created misplaced declaration

> https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951
> 
> That segfaulted here:
> 
> #0 0x004ab620 in setinternal (state=,
> state=, blkno=368) at gistvacuum.c:44
> 44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);
> 
> internalPagesMap was NULL, or pointed to memory that was too small and
> happened to be near an unmapped region (unlikely?), or was some other
> corrupted address?
Yes, there was conditionally uninitialized variable mapNumPages. I do not know 
why it didn't trigger failure last time, but now it was reproduced on my 
machine reliably.

Fixed both problems. PFA v14.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v14.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v14.patch
Description: Binary data


Re: GiST VACUUM

2018-07-29 Thread Thomas Munro
On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin  wrote:
>> 21 июля 2018 г., в 17:11, Andrey Borodin  написал(а):
>> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>
>
> Just in case, here's second part of patch series with actual page deletion.

Hi Andrey,

Cfbot says:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146

That's because you declared a new variable after some other
statements.  You can't do that in old school C89.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951

That segfaulted here:

#0 0x004ab620 in setinternal (state=,
state=, blkno=368) at gistvacuum.c:44
44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);

internalPagesMap was NULL, or pointed to memory that was too small and
happened to be near an unmapped region (unlikely?), or was some other
corrupted address?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: GiST VACUUM

2018-07-23 Thread Andrey Borodin
Hi!

> 21 июля 2018 г., в 17:11, Andrey Borodin  написал(а):
> 
> <0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

I was considering further decreasing memory footprint by using bloom filters 
instead of bitmap, but it will create seriously more work for cpu to compute 
hashes.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v13.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v13.patch
Description: Binary data


Re: GiST VACUUM

2018-07-21 Thread Andrey Borodin
Hi!

> 19 июля 2018 г., в 23:26, Andrey Borodin  написал(а):
> 
> I'm working on triggering left split during vacuum. Will get back when done. 
> Thanks!

Here's patch including some messy hacks to trigger NSN and FollowRight jumps 
during VACUUM.

To trigger FollowRight GiST sometimes forget to clear follow-right marker 
simulating crash of an insert. This fills logs with "fixing incomplete split" 
messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

gistvacuumcleanup() was constantly generating left jumps because there was used 
0 instead of real start NSN, I moved NSN acquisition to gistvacuumscan(). Also 
fixed some comments.

gistvacuumcleanup() will have same effect as gistbulkdelete(), is it OK?

To reproduce left-jumps run ./rescantest.sh
Script contain variables for my local paths.


Best regards, Andrey Borodin.


0001-Physical-GiST-scan-in-VACUUM-v13.patch
Description: Binary data


Re: GiST VACUUM

2018-07-19 Thread Andrey Borodin



> 19 июля 2018 г., в 16:28, Heikki Linnakangas  написал(а):
> Hmm. So, while we are scanning the right sibling, which was moved to 
> lower-numbered block because of a concurrent split, the original page is 
> split again? That's OK, we've already scanned all the tuples on the original 
> page, before we recurse to deal with the right sibling. (The corresponding 
> B-tree code also releases the lock on the original page when recursing)
Seems right.

> 
> I did some refactoring, to bring this closer to the B-tree code, for the sake 
> of consistency. See attached patch. This also eliminates the 2nd pass by 
> gistvacuumcleanup(), in case we did that in the bulkdelete-phase already.
Thanks!

> 
> There was one crucial thing missing: in the outer loop, we must ensure that 
> we scan all pages, even those that were added after the vacuum started.
Correct. Quite a neat logic behind the order of acquiring npages, comparing and 
vacuuming page. Notes in FIXME look correct except function names.

> There's a comment explaining that in btvacuumscan(). This version fixes that.
> 
> I haven't done any testing on this. Do you have any test scripts you could 
> share?
I use just a simple tests that setup replication and does random inserts and 
vaccums. Not a rocket science, just a mutated script
for i in $(seq 1 12); do 
size=$((100 * 2**$i))
./psql postgres -c "create table x as select cube(random()) c from 
generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "drop table x;"
./psql postgres -c "create table x as select cube(random()) c from 
generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "insert into x select cube(random()) c from 
generate_series(1,$size) y;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "VACUUM FULL x;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "drop table x;"
done

> I think we need some repeatable tests for the concurrent split cases.
It is hard to trigger left splits until we delete pages. I'll try to hack 
gistNewBuffer() to cause something similar.

> Even if it involves gdb or some other hacks that we can't include in the 
> regression test suite, we need something now, while we're hacking on this.
> 
> One subtle point, that I think is OK, but gave me a pause, and probably 
> deserves comment somewhere: A concurrent root split can turn a leaf page into 
> one internal (root) page, and two new leaf pages. The new root page is placed 
> in the same block as the old page, while both new leaf pages go to freshly 
> allocated blocks. If that happens while vacuum is running, might we miss the 
> new leaf pages? As the code stands, we don't do the "follow-right" dance on 
> internal pages, so we would not recurse into the new leaf pages. At first, I 
> thought that's a problem, but I think we can get away with it. The only 
> scenario where a root split happens on a leaf page, is when the index has 
> exactly one page, a single leaf page. Any subsequent root splits will split 
> an internal page rather than a leaf page, and we're not bothered by those. In 
> the case that a root split happens on a single-page index, we're OK, because 
> we will always scan that page either before, or after the split. If we scan 
> the single page before the split, we see all the leaf tuples on that page. If 
> we scan the single page after the split, it means that we start the scan 
> after the split, and we will see both leaf pages as we continue the scan.
Yes, only page 0 may change type, and page 0 cannot split to left.


I'm working on triggering left split during vacuum. Will get back when done. 
Thanks!

Best regards, Andrey Borodin.


Re: GiST VACUUM

2018-07-19 Thread Heikki Linnakangas

On 19/07/18 14:42, Andrey Borodin wrote:


19.07.2018, 15:20, "Heikki Linnakangas" :


On 19/07/18 13:52, Andrey Borodin wrote:

 Hi!

 19 июля 2018 г., в 1:12, Heikki Linnakangas mailto:hlinn...@iki.fi>>
 написал(а):

 Yeah, please, I think this is the way to go.


 Here's v11 divided into proposed steps.


Thanks, one quick question:

 /* We should not unlock buffer if we are going to
jump left */
 if (needScan)
 {
 GistBDItem *ptr = (GistBDItem *)
palloc(sizeof(GistBDItem));
 ptr->buffer = buffer;
 ptr->next = bufferStack;
 bufferStack = ptr;
 }
 else
 UnlockReleaseBuffer(buffer);


Why? I don't see any need to keep the page locked, when we "jump left".


Because it can split to the left again, given that we release lock.


Hmm. So, while we are scanning the right sibling, which was moved to 
lower-numbered block because of a concurrent split, the original page is 
split again? That's OK, we've already scanned all the tuples on the 
original page, before we recurse to deal with the right sibling. (The 
corresponding B-tree code also releases the lock on the original page 
when recursing)


I did some refactoring, to bring this closer to the B-tree code, for the 
sake of consistency. See attached patch. This also eliminates the 2nd 
pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase 
already.


There was one crucial thing missing: in the outer loop, we must ensure 
that we scan all pages, even those that were added after the vacuum 
started. There's a comment explaining that in btvacuumscan(). This 
version fixes that.


I haven't done any testing on this. Do you have any test scripts you 
could share? I think we need some repeatable tests for the concurrent 
split cases. Even if it involves gdb or some other hacks that we can't 
include in the regression test suite, we need something now, while we're 
hacking on this.


One subtle point, that I think is OK, but gave me a pause, and probably 
deserves comment somewhere: A concurrent root split can turn a leaf page 
into one internal (root) page, and two new leaf pages. The new root page 
is placed in the same block as the old page, while both new leaf pages 
go to freshly allocated blocks. If that happens while vacuum is running, 
might we miss the new leaf pages? As the code stands, we don't do the 
"follow-right" dance on internal pages, so we would not recurse into the 
new leaf pages. At first, I thought that's a problem, but I think we can 
get away with it. The only scenario where a root split happens on a leaf 
page, is when the index has exactly one page, a single leaf page. Any 
subsequent root splits will split an internal page rather than a leaf 
page, and we're not bothered by those. In the case that a root split 
happens on a single-page index, we're OK, because we will always scan 
that page either before, or after the split. If we scan the single page 
before the split, we see all the leaf tuples on that page. If we scan 
the single page after the split, it means that we start the scan after 
the split, and we will see both leaf pages as we continue the scan.


- Heikki
>From 9978fd22dd7b52b1b3f509f53fbafa505f68b573 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 19 Jul 2018 15:25:58 +0300
Subject: [PATCH 1/1] Physical GiST scan in VACUUM v12

---
 src/backend/access/gist/gistvacuum.c | 431 ---
 1 file changed, 244 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..180cc6c63a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,38 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   GistNSN startNSN);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	GistNSN		startNSN;
+
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) 

Re: GiST VACUUM

2018-07-19 Thread Andrey Borodin
19.07.2018, 15:20, "Heikki Linnakangas" :On 19/07/18 13:52, Andrey Borodin wrote: Hi! 19 июля 2018 г., в 1:12, Heikki Linnakangas  написал(а): Yeah, please, I think this is the way to go. Here's v11 divided into proposed steps.Thanks, one quick question: /* We should not unlock buffer if we are going to jump left */ if (needScan) { GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem)); ptr->buffer = buffer; ptr->next = bufferStack; bufferStack = ptr; } else UnlockReleaseBuffer(buffer);Why? I don't see any need to keep the page locked, when we "jump left".Because it can split to the left again, given that we release lock.Best regards, Andrey Borodin.

Re: GiST VACUUM

2018-07-19 Thread Heikki Linnakangas

On 19/07/18 13:52, Andrey Borodin wrote:

Hi!


19 июля 2018 г., в 1:12, Heikki Linnakangas 
написал(а):

Yeah, please, I think this is the way to go.


Here's v11 divided into proposed steps.


Thanks, one quick question:


/* We should not unlock buffer if we are going to jump 
left */
if (needScan)
{
GistBDItem *ptr = (GistBDItem *) 
palloc(sizeof(GistBDItem));
ptr->buffer = buffer;
ptr->next = bufferStack;
bufferStack = ptr;
}
else
UnlockReleaseBuffer(buffer);


Why? I don't see any need to keep the page locked, when we "jump left".

- Heikki



Re: GiST VACUUM

2018-07-19 Thread Andrey Borodin
Hi!

> 19 июля 2018 г., в 1:12, Heikki Linnakangas  написал(а):
> 
> Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

In second step I still use paloc's memory, but only to store two bitmaps: 
bitmap of internal pages and bitmap of empty leafs. Second physical scan only 
reads internal pages. I can omit that bitmap, if I'll scan everything.
Also, I can replace emptyLeafs bitmap with array\list, but I do not really 
think it will be big.

Anyway I propose focusing on first step.

Best regards, Andrey Borodin.


0002-Delete-pages-during-GiST-VACUUM-v11.patch
Description: Binary data


0001-Physical-GiST-scan-in-VACUUM-v11.patch
Description: Binary data


Re: GiST VACUUM

2018-07-18 Thread Heikki Linnakangas

On 18/07/18 21:27, Andrey Borodin wrote:

Hi!


18 июля 2018 г., в 16:02, Heikki Linnakangas 
написал(а):

, but I think it would be better to split this into two patches as
follows:

1st patch: Scan the index in physical rather than logical order. No
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first
patch, which just switches to doing physical-order scan, first.


This seems very unproportional division of complexity. First patch
(PFA) is very simple. All work is done in one cycle, without
memorizing anything. Actually, you do not even need to rescan
rightlinks: there may be no splits to the left when no pages are
deleted.


Heh, good point.

I googled around and bumped into an older patch to do this: 
https://www.postgresql.org/message-id/1135121410099068%40web30j.yandex.ru. 
Unfortunately, Костя never got around to update the patch, and it was 
forgotten. But the idea seemed sound even back then.


As noted in that thread, there might be deleted pages in the index in 
some rare circumstances, even though we don't recycled empty pages: if 
the index was upgraded from a very old version, as VACUUM FULL used to 
recycle empty pages, or if you crash just when extending the index, and 
end up with newly-initialized but unused pages that way. So we do need 
to handle the concurrent split scenario, even without empty page recycling.



If you think it is proper way to go - OK, I'll prepare
better version of attached diff (by omitting tail recursion and
adding more comments).


Yeah, please, I think this is the way to go.

- Heikki



  1   2   >