Re: why do hash index builds use smgrextend() for new splitpoint pages

2022-02-27 Thread Amit Kapila
On Sat, Feb 26, 2022 at 9:17 PM Melanie Plageman
 wrote:
>
> On Fri, Feb 25, 2022 at 11:17 PM Amit Kapila  wrote:
> >
> > On Sat, Feb 26, 2022 at 3:01 AM Melanie Plageman
> >  wrote:
> > >
> > > Since _hash_alloc_buckets() WAL-logs the last page of the
> > > splitpoint, is it safe to skip the smgrimmedsync()? What if the last
> > > page of the splitpoint doesn't end up having any tuples added to it
> > > during the index build and the redo pointer is moved past the WAL for
> > > this page and then later there is a crash sometime before this page
> > > makes it to permanent storage. Does it matter that this page is lost? If
> > > not, then why bother WAL-logging it?
> > >
> >
> > I think we don't care if the page is lost before we update the
> > meta-page in the caller because we will try to reallocate in that
> > case. But we do care after meta page update (having the updated
> > information about this extension via different masks) in which case we
> > won't lose this last page because it would have registered the sync
> > request for it via sgmrextend before meta page update.
>
> and could it happen that during smgrextend() for the last page, a
> checkpoint starts and finishes between FileWrite() and
> register_dirty_segment(), then index build finishes, and then a crash
> occurs before another checkpoint completes the pending fsync for that
> last page?
>

Yeah, this seems to be possible and then the problem could be that
index's idea and smgr's idea for EOF could be different which could
lead to a problem when we try to get a new page via _hash_getnewbuf().
If this theory turns out to be true then probably, we can get an error
either because of disk full or the index might request a block that is
beyond EOF as determined by RelationGetNumberOfBlocksInFork() in
_hash_getnewbuf().

Can we try to reproduce this scenario with the help of a debugger to
see if we are missing something?

-- 
With Regards,
Amit Kapila.




Re: why do hash index builds use smgrextend() for new splitpoint pages

2022-02-26 Thread Melanie Plageman
On Fri, Feb 25, 2022 at 11:17 PM Amit Kapila  wrote:
>
> On Sat, Feb 26, 2022 at 3:01 AM Melanie Plageman
>  wrote:
> >
> > Since _hash_alloc_buckets() WAL-logs the last page of the
> > splitpoint, is it safe to skip the smgrimmedsync()? What if the last
> > page of the splitpoint doesn't end up having any tuples added to it
> > during the index build and the redo pointer is moved past the WAL for
> > this page and then later there is a crash sometime before this page
> > makes it to permanent storage. Does it matter that this page is lost? If
> > not, then why bother WAL-logging it?
> >
>
> I think we don't care if the page is lost before we update the
> meta-page in the caller because we will try to reallocate in that
> case. But we do care after meta page update (having the updated
> information about this extension via different masks) in which case we
> won't lose this last page because it would have registered the sync
> request for it via sgmrextend before meta page update.

and could it happen that during smgrextend() for the last page, a
checkpoint starts and finishes between FileWrite() and
register_dirty_segment(), then index build finishes, and then a crash
occurs before another checkpoint completes the pending fsync for that
last page?




Re: why do hash index builds use smgrextend() for new splitpoint pages

2022-02-25 Thread Amit Kapila
On Sat, Feb 26, 2022 at 3:01 AM Melanie Plageman
 wrote:
>
> Since _hash_alloc_buckets() WAL-logs the last page of the
> splitpoint, is it safe to skip the smgrimmedsync()? What if the last
> page of the splitpoint doesn't end up having any tuples added to it
> during the index build and the redo pointer is moved past the WAL for
> this page and then later there is a crash sometime before this page
> makes it to permanent storage. Does it matter that this page is lost? If
> not, then why bother WAL-logging it?
>

I think we don't care if the page is lost before we update the
meta-page in the caller because we will try to reallocate in that
case. But we do care after meta page update (having the updated
information about this extension via different masks) in which case we
won't lose this last page because it would have registered the sync
request for it via sgmrextend before meta page update.

-- 
With Regards,
Amit Kapila.




Re: why do hash index builds use smgrextend() for new splitpoint pages

2022-02-25 Thread Melanie Plageman
On Thu, Feb 24, 2022 at 10:24 PM Amit Kapila  wrote:
>
> On Fri, Feb 25, 2022 at 4:41 AM Melanie Plageman
>  wrote:
> >
> > I'm trying to understand why hash indexes are built primarily in shared
> > buffers except when allocating a new splitpoint's worth of bucket pages
> > -- which is done with smgrextend() directly in _hash_alloc_buckets().
> >
> > Is this just so that the value returned by smgrnblocks() includes the
> > new splitpoint's worth of bucket pages?
> >
> > All writes of tuple data to pages in this new splitpoint will go
> > through shared buffers (via hash_getnewbuf()).
> >
> > I asked this and got some thoughts from Robert in [1], but I still don't
> > really get it.
> >
> > When a new page is needed during the hash index build, why can't
> > _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of
> > _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping?
> >
>
> We allocate the chunk of pages (power-of-2 groups) at the time of
> split which allows them to appear consecutively in an index. This
> helps us to compute the physical block number from bucket number
> easily (BUCKET_TO_BLKNO mapping) with some minimal control
> information.

got it, thanks.

Since _hash_alloc_buckets() WAL-logs the last page of the
splitpoint, is it safe to skip the smgrimmedsync()? What if the last
page of the splitpoint doesn't end up having any tuples added to it
during the index build and the redo pointer is moved past the WAL for
this page and then later there is a crash sometime before this page
makes it to permanent storage. Does it matter that this page is lost? If
not, then why bother WAL-logging it?

- Melanie




Re: why do hash index builds use smgrextend() for new splitpoint pages

2022-02-24 Thread Amit Kapila
On Fri, Feb 25, 2022 at 8:54 AM Amit Kapila  wrote:
>
> On Fri, Feb 25, 2022 at 4:41 AM Melanie Plageman
>  wrote:
> >
> > I'm trying to understand why hash indexes are built primarily in shared
> > buffers except when allocating a new splitpoint's worth of bucket pages
> > -- which is done with smgrextend() directly in _hash_alloc_buckets().
> >
> > Is this just so that the value returned by smgrnblocks() includes the
> > new splitpoint's worth of bucket pages?
> >
> > All writes of tuple data to pages in this new splitpoint will go
> > through shared buffers (via hash_getnewbuf()).
> >
> > I asked this and got some thoughts from Robert in [1], but I still don't
> > really get it.
> >
> > When a new page is needed during the hash index build, why can't
> > _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of
> > _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping?
> >
>
> We allocate the chunk of pages (power-of-2 groups) at the time of
> split which allows them to appear consecutively in an index.
>

I think allocating chunks of pages via "ReadBufferExtended() with
P_NEW" will be time-consuming as compared to what we do now.

-- 
With Regards,
Amit Kapila.




Re: why do hash index builds use smgrextend() for new splitpoint pages

2022-02-24 Thread Amit Kapila
On Fri, Feb 25, 2022 at 4:41 AM Melanie Plageman
 wrote:
>
> I'm trying to understand why hash indexes are built primarily in shared
> buffers except when allocating a new splitpoint's worth of bucket pages
> -- which is done with smgrextend() directly in _hash_alloc_buckets().
>
> Is this just so that the value returned by smgrnblocks() includes the
> new splitpoint's worth of bucket pages?
>
> All writes of tuple data to pages in this new splitpoint will go
> through shared buffers (via hash_getnewbuf()).
>
> I asked this and got some thoughts from Robert in [1], but I still don't
> really get it.
>
> When a new page is needed during the hash index build, why can't
> _hash_expandtable() just call ReadBufferExtended() with P_NEW instead of
> _hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping?
>

We allocate the chunk of pages (power-of-2 groups) at the time of
split which allows them to appear consecutively in an index. This
helps us to compute the physical block number from bucket number
easily (BUCKET_TO_BLKNO mapping) with some minimal control
information.

-- 
With Regards,
Amit Kapila.




why do hash index builds use smgrextend() for new splitpoint pages

2022-02-24 Thread Melanie Plageman
I'm trying to understand why hash indexes are built primarily in shared
buffers except when allocating a new splitpoint's worth of bucket pages
-- which is done with smgrextend() directly in _hash_alloc_buckets().

Is this just so that the value returned by smgrnblocks() includes the
new splitpoint's worth of bucket pages?

All writes of tuple data to pages in this new splitpoint will go
through shared buffers (via hash_getnewbuf()).

I asked this and got some thoughts from Robert in [1], but I still don't
really get it.

When a new page is needed during the hash index build, why can't
_hash_expandtable() just call ReadBufferExtended() with P_NEW instead of
_hash_getnewbuf()? Does it have to do with the BUCKET_TO_BLKNO mapping?

- Melanie

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoa%2BQFFhkHgPxyxv6t8aVU0r7GZmu7z8io4vGG7RHPpGzA%40mail.gmail.com