Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Amit Kapila
On Wed, Mar 15, 2017 at 7:50 PM, Robert Haas  wrote:
> On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost  wrote:
>>> I think we have sufficient comments in code especially on top of
>>> function _hash_alloc_buckets().
>>
>> I don't see any comments regarding how we have to be sure to handle
>> an out-of-space case properly in the middle of a file because we've made
>> it sparse.
>>
>> I do see that mdwrite() should handle an out-of-disk-space case, though
>> that just makes me wonder what's different here compared to normal
>> relations that we don't have an issue with a sparse WAL'd hash index but
>> we can't handle it if a normal relation is sparse.
>
> I agree.  I think that what hash indexes are doing here is
> inconsistent with what we do for btree indexes and the heap.  And I
> don't think it would be bad to fix that.  We could, of course, go the
> other way and do what Tom is suggesting - insist that everybody's got
> to be prepared for sparse files, but I would view that as something of
> a U-turn.  I think hash indexes are like this because nobody's really
> worried about hash indexes because they haven't been crash-safe or
> performant.  Now that we've made them crash safe and are on the way to
> making them performant, fixing other things is, as Tom already said, a
> lot more interesting.
>
> Now, that having been said, I'm not sure it's a good idea to tinker
> with the behavior for v10.  We could change the new-splitpoint code so
> that it loops over all the pages in the new splitpoint and zeroes them
> all, instead of just the last one.
>

Yeah, but I am slightly afraid that apart from consuming too much
space, it will make the operation lot slower when we have to perform
the allocation step.  Another possibility is to do that when we
actually use/allocate the bucket?  As of now, _hash_getnewbuf assumes
that the block is pre-existing and avoid the actual read/extend by
using RBM_ZERO_AND_LOCK mode.   I think we can change it so that it
forces a new allocation (if the block doesn't exist) on need.  I agree
this is somewhat more change than what you have proposed to circumvent
the sparse file behavior, but may not be a ton more work if we decide
to fix and has the advantage of allocating the space on disk on actual
need.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Robert Haas
On Wed, Mar 15, 2017 at 11:02 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> That theory seems inconsistent with how mdextend() works.  My
>> understanding is that we zero-fill the new blocks before populating
>> them with actual data precisely to avoid running out of disk space due
>> to deferred allocation at the OS level.  If we don't care about
>> failures due to deferred allocation at the OS level, we can rip that
>> logic out and improve the performance of relation extension
>> considerably.
>
> See my reply to Stephen.  The fact that this fails to guarantee no
> ENOSPC on COW filesystems doesn't mean that it's not worth doing on
> other filesystems.  We're reducing the risk, not eliminating it,
> but reducing risk is still a worthwhile activity.

Well, then it would presumably be worth reducing for hash indexes, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > That theory seems inconsistent with how mdextend() works.  My
> > understanding is that we zero-fill the new blocks before populating
> > them with actual data precisely to avoid running out of disk space due
> > to deferred allocation at the OS level.  If we don't care about
> > failures due to deferred allocation at the OS level, we can rip that
> > logic out and improve the performance of relation extension
> > considerably.
> 
> See my reply to Stephen.  The fact that this fails to guarantee no
> ENOSPC on COW filesystems doesn't mean that it's not worth doing on
> other filesystems.  We're reducing the risk, not eliminating it,
> but reducing risk is still a worthwhile activity.

Considering how much work we end up doing to extend a relation and how
we know that's been a hotspot, I'm not entirely sure I agree that
avoiding the relativly infrequent out-of-disk-space concern when
extending the relation (instead of letting it happen when we go to
actually write data into the page) really is a good trade-off to make.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Tom Lane
Robert Haas  writes:
> That theory seems inconsistent with how mdextend() works.  My
> understanding is that we zero-fill the new blocks before populating
> them with actual data precisely to avoid running out of disk space due
> to deferred allocation at the OS level.  If we don't care about
> failures due to deferred allocation at the OS level, we can rip that
> logic out and improve the performance of relation extension
> considerably.

See my reply to Stephen.  The fact that this fails to guarantee no
ENOSPC on COW filesystems doesn't mean that it's not worth doing on
other filesystems.  We're reducing the risk, not eliminating it,
but reducing risk is still a worthwhile activity.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Stephen Frost
Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane  wrote:
> > FWIW, I'm not certain that Stephen is correct to claim that we have
> > some concrete problem with sparse files.  We certainly don't *depend*
> > on sparse storage anyplace else, nor write data in a way that would be
> > likely to trigger it; but I'm not aware that we need to work hard to
> > avoid it.
> 
> That theory seems inconsistent with how mdextend() works.  My
> understanding is that we zero-fill the new blocks before populating
> them with actual data precisely to avoid running out of disk space due
> to deferred allocation at the OS level.  If we don't care about
> failures due to deferred allocation at the OS level, we can rip that
> logic out and improve the performance of relation extension
> considerably.  If we do care about failures due to deferred
> allocation, then leaving holes in the file is a bad idea.

That is a fantastic point.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Robert Haas
On Tue, Mar 14, 2017 at 10:30 PM, Amit Kapila  wrote:
> On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas  wrote:
>> On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila  
>> wrote:
>>> We didn't found any issue with the above testing.
>>
>> Great!  I've committed the latest version of the patch, with some
>> cosmetic changes.
>>
>
> Thanks a lot.  I have noticed that the test case patch for "snapshot
> too old" is not committed.  Do you want to leave that due to timing
> requirement or do you want me to submit it separately so that we can
> deal with it separately and may take an input from Kevin?

I've committed that now as well.  Let's see what the buildfarm thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Robert Haas
On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane  wrote:
> FWIW, I'm not certain that Stephen is correct to claim that we have
> some concrete problem with sparse files.  We certainly don't *depend*
> on sparse storage anyplace else, nor write data in a way that would be
> likely to trigger it; but I'm not aware that we need to work hard to
> avoid it.

That theory seems inconsistent with how mdextend() works.  My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level.  If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably.  If we do care about failures due to deferred
allocation, then leaving holes in the file is a bad idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Stephen Frost
Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> FWIW, I'm not certain that Stephen is correct to claim that we have
> some concrete problem with sparse files.  We certainly don't *depend*
> on sparse storage anyplace else, nor write data in a way that would be
> likely to trigger it; but I'm not aware that we need to work hard to
> avoid it.

I don't have any concrete evidence that there is an issue, just a
recollection of bringing up exactly the idea of turning entirely empty
1G segments into sparse files to free that space back to the filesystem
during a VACUUM during a discussion somewhere along the way and being
told that it'd be bad because we would run into issues later when we try
to write those pages back out and discover we're out of disk space.

Naturally, there's a lot to consider with such a change to VACUUM like
how we would WAL that and how we'd make sure that nothing is about to
try and use any of those pages (perhaps start at the end of the segment
and lock the pages, or just try to get an exclusive lock on the table as
we do when we try to truncate the relation because there's free space at
the end?), but it'd be a very interesting project to consider, if we are
willing to accept sparse heap files.

Another interesting idea might be to somehow change the now-empty
relation into just a place-holder that says "assume I'm 1G in size and
empty." but actually have the file be shrunk.

Anyway, I tend to agree with the sentiment that we don't need this patch
to change the behavior here and perhaps it'll be good to see what
happens when people start using these sparsh hash indexes, maybe that
will shed some light on if we have anything to worry about here or not,
and if not then we can consider having sparse files elsewhere.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Tom Lane
Stephen Frost  writes:
> I do see that mdwrite() should handle an out-of-disk-space case, though
> that just makes me wonder what's different here compared to normal
> relations that we don't have an issue with a sparse WAL'd hash index but
> we can't handle it if a normal relation is sparse.

*Any* write has to be prepared to handle errors.  There's always a risk of
EIO, and on a COW filesystem you might well get ENOSPC even when you think
you're overwriting previously-allocated storage.  All that we are doing by
pre-allocating storage is reducing the risks a bit, not guaranteeing that
no error will happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Tom Lane
Robert Haas  writes:
> Now, that having been said, I'm not sure it's a good idea to tinker
> with the behavior for v10.  We could change the new-splitpoint code so
> that it loops over all the pages in the new splitpoint and zeroes them
> all, instead of just the last one.

Why would we do that?  That would change the behavior from something
that's arguably OK (at least given the right filesystem) to something
that's clearly not very OK.

> It's too late to start making significant or controversial design
> changes at this point, and you could argue that this is a preexisting
> design defect which the WAL-logging patch wasn't necessarily obligated
> to fix.

Yes, and it is certainly that, and no this patch wasn't chartered to fix
it.  I don't have a problem with leaving things like this for v10.

FWIW, I'm not certain that Stephen is correct to claim that we have
some concrete problem with sparse files.  We certainly don't *depend*
on sparse storage anyplace else, nor write data in a way that would be
likely to trigger it; but I'm not aware that we need to work hard to
avoid it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Robert Haas
On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost  wrote:
>> I think we have sufficient comments in code especially on top of
>> function _hash_alloc_buckets().
>
> I don't see any comments regarding how we have to be sure to handle
> an out-of-space case properly in the middle of a file because we've made
> it sparse.
>
> I do see that mdwrite() should handle an out-of-disk-space case, though
> that just makes me wonder what's different here compared to normal
> relations that we don't have an issue with a sparse WAL'd hash index but
> we can't handle it if a normal relation is sparse.

I agree.  I think that what hash indexes are doing here is
inconsistent with what we do for btree indexes and the heap.  And I
don't think it would be bad to fix that.  We could, of course, go the
other way and do what Tom is suggesting - insist that everybody's got
to be prepared for sparse files, but I would view that as something of
a U-turn.  I think hash indexes are like this because nobody's really
worried about hash indexes because they haven't been crash-safe or
performant.  Now that we've made them crash safe and are on the way to
making them performant, fixing other things is, as Tom already said, a
lot more interesting.

Now, that having been said, I'm not sure it's a good idea to tinker
with the behavior for v10.  We could change the new-splitpoint code so
that it loops over all the pages in the new splitpoint and zeroes them
all, instead of just the last one.  If we all agree that's how it
should work, then it's probably not a lot of work to make the change.
But if what's needed is anything more than that or we don't all agree,
then we'd better just leave it alone and we can revisit it for v11.
It's too late to start making significant or controversial design
changes at this point, and you could argue that this is a preexisting
design defect which the WAL-logging patch wasn't necessarily obligated
to fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-15 Thread Stephen Frost
Amit,

* Amit Kapila (amit.kapil...@gmail.com) wrote:
> On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost  wrote:
> >  If that's the case then
> > this does seem to at least be less of an issue, though I hope we put in
> > appropriate comments about it.
> 
> I think we have sufficient comments in code especially on top of
> function _hash_alloc_buckets().

I don't see any comments regarding how we have to be sure to handle
an out-of-space case properly in the middle of a file because we've made
it sparse.

I do see that mdwrite() should handle an out-of-disk-space case, though
that just makes me wonder what's different here compared to normal
relations that we don't have an issue with a sparse WAL'd hash index but
we can't handle it if a normal relation is sparse.

Additional comments here would be overkill if we think that the lower
layers are already all set up to handle such a case properly and we
don't have to do anything special, but I had understood that we didn't
generally think that sparse files would just work.  I'm certainly happy
to be wrong there because, if that's the case, it'd be a great way to
improve the problems we have returning large chunks of free space in the
middle of a relation to the OS, but if there is a real concern here then
we need to work out what it is and probably add comments or possibly add
code to address whatever it is.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Amit Kapila
On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Stephen Frost  writes:
>> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> >> It's true that as soon as we need another overflow page, that's going to
>> >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
>> >> index will grow quite a lot.  But any modern filesystem should handle
>> >> that without much difficulty by treating the index as a sparse file.
>>
>> > Uh, last I heard we didn't allow or want sparse files in the backend
>> > because then we have to handle a possible out-of-disk-space failure on
>> > every write.
>>
>> For a hash index, this would happen during a bucket split, which would
>> need to be resilient against out-of-disk-space anyway.
>
> We wouldn't attempt to use the area of the file which is not yet
> allocated except when doing a bucket split?
>

That's right.

>  If that's the case then
> this does seem to at least be less of an issue, though I hope we put in
> appropriate comments about it.
>

I think we have sufficient comments in code especially on top of
function _hash_alloc_buckets().


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Amit Kapila
On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas  wrote:
> On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila  wrote:
>> We didn't found any issue with the above testing.
>
> Great!  I've committed the latest version of the patch, with some
> cosmetic changes.
>

Thanks a lot.  I have noticed that the test case patch for "snapshot
too old" is not committed.  Do you want to leave that due to timing
requirement or do you want me to submit it separately so that we can
deal with it separately and may take an input from Kevin?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Mark Kirkwood

On 15/03/17 06:29, Robert Haas wrote:



Great!  I've committed the latest version of the patch, with some
cosmetic changes.

It would be astonishing if there weren't a bug or two left, but I
think overall this is very solid work, and I think it's time to put
this out there and see how things go.



Awesome, great work!

Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> It's true that as soon as we need another overflow page, that's going to
> >> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
> >> index will grow quite a lot.  But any modern filesystem should handle
> >> that without much difficulty by treating the index as a sparse file.
> 
> > Uh, last I heard we didn't allow or want sparse files in the backend
> > because then we have to handle a possible out-of-disk-space failure on
> > every write.
> 
> For a hash index, this would happen during a bucket split, which would
> need to be resilient against out-of-disk-space anyway.

We wouldn't attempt to use the area of the file which is not yet
allocated except when doing a bucket split?  If that's the case then
this does seem to at least be less of an issue, though I hope we put in
appropriate comments about it.

> >> There may be some work to be done in places like pg_basebackup to
> >> recognize and deal with sparse files, but it doesn't seem like a
> >> reason to panic.
> 
> > Well, and every file-based backup tool out there..
> 
> Weren't you the one leading the charge to deprecate use of file-based
> backup?

No, nor do I see how we would ever be able to deprecate file-based
backups.  If anything, I'd like to see us improve our support for
them.  I'm certainly curious where the notion that I was ever in favor
of deprecating them came from, particularly given all of the effort that
David and I have been pouring into our favorite file-based backup tool
over the past few years.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> It's true that as soon as we need another overflow page, that's going to
>> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
>> index will grow quite a lot.  But any modern filesystem should handle
>> that without much difficulty by treating the index as a sparse file.

> Uh, last I heard we didn't allow or want sparse files in the backend
> because then we have to handle a possible out-of-disk-space failure on
> every write.

For a hash index, this would happen during a bucket split, which would
need to be resilient against out-of-disk-space anyway.

>> There may be some work to be done in places like pg_basebackup to
>> recognize and deal with sparse files, but it doesn't seem like a
>> reason to panic.

> Well, and every file-based backup tool out there..

Weren't you the one leading the charge to deprecate use of file-based
backup?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Robert Haas  writes:
> > On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> It's become pretty clear to me that there are a bunch of other things
> >>> about hash indexes which are not exactly great, the worst of which is
> >>> the way they grow by DOUBLING IN SIZE.
> 
> >> Uh, what?  Growth should happen one bucket-split at a time.
> 
> > Technically, the buckets are created one at a time, but because of the
> > way hashm_spares works, the primary bucket pages for all bucket from
> > 2^N to 2^{N+1}-1 must be physically consecutive.  See
> > _hash_alloc_buckets.
> 
> Right, but we only fill those pages one at a time.
> 
> It's true that as soon as we need another overflow page, that's going to
> get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
> index will grow quite a lot.  But any modern filesystem should handle
> that without much difficulty by treating the index as a sparse file.

Uh, last I heard we didn't allow or want sparse files in the backend
because then we have to handle a possible out-of-disk-space failure on
every write.

If we think they're ok to do, it'd be awful nice to figure out a way for
VACUUM to turn an entirely-empty 1G chunk into a sparse file..

> There may be some work to be done in places like pg_basebackup to
> recognize and deal with sparse files, but it doesn't seem like a
> reason to panic.

Well, and every file-based backup tool out there..

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Tom Lane
Robert Haas  writes:
> On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> It's become pretty clear to me that there are a bunch of other things
>>> about hash indexes which are not exactly great, the worst of which is
>>> the way they grow by DOUBLING IN SIZE.

>> Uh, what?  Growth should happen one bucket-split at a time.

> Technically, the buckets are created one at a time, but because of the
> way hashm_spares works, the primary bucket pages for all bucket from
> 2^N to 2^{N+1}-1 must be physically consecutive.  See
> _hash_alloc_buckets.

Right, but we only fill those pages one at a time.

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot.  But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

There may be some work to be done in places like pg_basebackup to
recognize and deal with sparse files, but it doesn't seem like a
reason to panic.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Robert Haas
On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> It's become pretty clear to me that there are a bunch of other things
>> about hash indexes which are not exactly great, the worst of which is
>> the way they grow by DOUBLING IN SIZE.
>
> Uh, what?  Growth should happen one bucket-split at a time.

Technically, the buckets are created one at a time, but because of the
way hashm_spares works, the primary bucket pages for all bucket from
2^N to 2^{N+1}-1 must be physically consecutive.  See
_hash_alloc_buckets.

>> Other things that are not so great:
>
>> - no multi-column support
>> - no amcanunique support
>> - every insert dirties the metapage
>> - splitting is generally too aggressive; very few overflow pages are
>> ever created unless you have piles of duplicates
>
> Yeah.  It's a bit hard to see how to add multi-column support unless you
> give up the property of allowing queries on a subset of the index columns.
> Lack of amcanunique seems like mostly a round-tuit shortage.  The other
> two are implementation deficiencies that maybe we can remedy someday.
>
> Another thing I'd like to see is support for 64-bit hash values.
>
> But all of these were mainly blocked by people not wanting to sink effort
> into hash indexes as long as they were unusable for production due to lack
> of WAL support.  So this is a huge step forward.

Agreed, on all points.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Tom Lane
Robert Haas  writes:
> It's become pretty clear to me that there are a bunch of other things
> about hash indexes which are not exactly great, the worst of which is
> the way they grow by DOUBLING IN SIZE.

Uh, what?  Growth should happen one bucket-split at a time.

> Other things that are not so great:

> - no multi-column support
> - no amcanunique support
> - every insert dirties the metapage
> - splitting is generally too aggressive; very few overflow pages are
> ever created unless you have piles of duplicates

Yeah.  It's a bit hard to see how to add multi-column support unless you
give up the property of allowing queries on a subset of the index columns.
Lack of amcanunique seems like mostly a round-tuit shortage.  The other
two are implementation deficiencies that maybe we can remedy someday.

Another thing I'd like to see is support for 64-bit hash values.

But all of these were mainly blocked by people not wanting to sink effort
into hash indexes as long as they were unusable for production due to lack
of WAL support.  So this is a huge step forward.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Robert Haas
On Tue, Mar 14, 2017 at 1:40 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Great!  I've committed the latest version of the patch, with some
>> cosmetic changes.
>
> Woo hoo!  That's been a bee in the bonnet for, um, decades.

Yeah.  I'm pretty happy to see that go in.

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE.  Mithun has submitted a patch
which - if it's acceptable - would alleviate that to some degree by
splitting up each doubling into four separate increments.  But that
still could lead to very large growth increments on big indexes, like
your 32GB index suddenly growing - in one fell swoop - to 40GB.
That's admittedly a lot better than what will happen right now, which
is instantaneous growth to 64GB, but it's not great, and it's not
altogether clear how it could be fixed in a really satisfying way.

Other things that are not so great:

- no multi-column support
- no amcanunique support
- every insert dirties the metapage
- splitting is generally too aggressive; very few overflow pages are
ever created unless you have piles of duplicates

Still, this is a big step forward.  I think it will be useful for
people with long text strings or other very wide data that they want
to index, and hopefully we'll get some feedback from users about where
this works well and where it doesn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Tom Lane
Robert Haas  writes:
> Great!  I've committed the latest version of the patch, with some
> cosmetic changes.

Woo hoo!  That's been a bee in the bonnet for, um, decades.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-14 Thread Robert Haas
On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila  wrote:
> We didn't found any issue with the above testing.

Great!  I've committed the latest version of the patch, with some
cosmetic changes.

It would be astonishing if there weren't a bug or two left, but I
think overall this is very solid work, and I think it's time to put
this out there and see how things go.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-13 Thread Amit Kapila
On Mon, Mar 13, 2017 at 6:26 PM, Amit Kapila  wrote:
> On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas  wrote:
>> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas  wrote:
> Great, thanks.  0001 looks good to me now, so committed.

 Committed 0002.
>>>
>>> Here are some initial review thoughts on 0003 based on a first read-through.
>>
>
>> It seems like a good test to do with this patch would be to set up a
>> pgbench test on the master with a hash index replacing the usual btree
>> index.  Then, set up a standby and run a read-only pgbench on the
>> standby while a read-write pgbench test runs on the master.  Maybe
>> you've already tried something like that?
>>
>
> I also think so and apart from that I think it makes sense to perform
> recovery test by Jeff Janes tool and probably tests with
> wal_consistency_check. These tests are already running from past seven
> hours or so and I will keep them running for the whole night to see if
> there is any discrepancy.
>

We didn't found any issue with the above testing.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-13 Thread Amit Kapila
On Sun, Mar 12, 2017 at 8:06 AM, Robert Haas  wrote:
> On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila  wrote:
>>>  /*
>>> + * Change the shared buffer state in critical section,
>>> + * otherwise any error could make it unrecoverable after
>>> + * recovery.
>>> + */
>>> +START_CRIT_SECTION();
>>> +
>>> +/*
>>>   * Insert tuple on new page, using _hash_pgaddtup to ensure
>>>   * correct ordering by hashkey.  This is a tad inefficient
>>>   * since we may have to shuffle itempointers repeatedly.
>>>   * Possible future improvement: accumulate all the items 
>>> for
>>>   * the new page and qsort them before insertion.
>>>   */
>>>  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>>>
>>> +END_CRIT_SECTION();
>>>
>>> No way.  You have to start the critical section before making any page
>>> modifications and keep it alive until all changes have been logged.
>>>
>>
>> I think what we need to do here is to accumulate all the tuples that
>> need to be added to new bucket page till either that page has no more
>> space or there are no more tuples remaining in an old bucket.  Then in
>> a critical section, add them to the page using _hash_pgaddmultitup and
>> log the entire new bucket page contents as is currently done in patch
>> log_split_page().
>
> I agree.
>

Okay, I have changed like that in the patch.

>> Now, here we can choose to log the individual
>> tuples as well instead of a complete page, however not sure if there
>> is any benefit for doing the same because XLogRecordAssemble() will
>> anyway remove the empty space from the page.  Let me know if you have
>> something else in mind.
>
> Well, if you have two pages that are 75% full, and you move a third of
> the tuples from one of them into the other, it's going to be about
> four times more efficient to log only the moved tuples than the whole
> page.
>

I don't see how this could happen during split?  I mean if you are
moving 25% tuples from old bucket page to a new bucket page which is
75% full, it will log the new bucket page only after it gets full.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-13 Thread Amit Kapila
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas  wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas  wrote:
 Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
> The text you've added to the hash index README throws around the term
> "single WAL entry" pretty freely.  For example: "An insertion that
> causes an addition of an overflow page is logged as a single WAL entry
> preceded by a WAL entry for new overflow page required to insert a
> tuple."  When you say a single WAL entry, you make it sound like
> there's only one, but because it's preceded by another WAL entry,
> there are actually two.  I would rephase this to just avoid using the
> word "single" this way.  For example: "If an insertion causes the
> addition of an overflow page, there will be one WAL entry for the new
> overflow page and a second entry for the insert itself."
>

Okay, changed as per suggestion.  I have also removed the second part
of the above comment which states that newly allocated overflow page
can be used by concurrent transactions.  That won't be true after
fixing release/reacquire stuff in _hash_addovflpage.  I have also
rephrased another similar usage of "single WAL entry".

> +mode anyway).  It would seem natural to complete the split in VACUUM, but 
> since
> +splitting a bucket might require allocating a new page, it might fail if you
> +run out of disk space.  That would be bad during VACUUM - the reason for
> +running VACUUM in the first place might be that you run out of disk space,
> +and now VACUUM won't finish because you're out of disk space.  In contrast,
> +an insertion can require enlarging the physical file anyway.
>
> But VACUUM actually does try to complete the splits, so this is wrong, I 
> think.
>

As discussed up thread, there is no need to change.

> +Squeeze operation moves tuples from one of the buckets later in the chain to
>
> A squeeze operation
>
> +As Squeeze operation involves writing multiple atomic operations, it is
>
> As a squeeze operation involves multiple atomic operations
>

Changed as per suggestion.

> +if (!xlrec.is_primary_bucket_page)
> +XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>
> So, in hashbucketcleanup, the page is registered and therefore an FPI
> will be taken if this is the first reference to this page since the
> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
> exposes us to a torn-page hazard.
> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
> Not sure exactly what the fix should be.
>

I have used REGBUF_NO_IMAGE while registering buffer to avoid this
hazard and during replay used XLogReadBufferForRedoExtended() to read
block and take cleanup lock on it.

> +/*
> + * As bitmap page doesn't have standard page layout, so this will
> + * allow us to log the data.
> + */
> +XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
> +XLogRegisterBufData(2, (char *) _page_bit, 
> sizeof(uint32));
>
> If the page doesn't have a standard page layout, why are we passing
> REGBUF_STANDARD?  But I think you've hacked things so that bitmap
> pages DO have a standard page layout, per the change to
> _hash_initbitmapbuffer, in which case the comment seems wrong.
>

Yes, the comment is obsolete and I have removed it.  We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

>
> +recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
> +
> +PageSetLSN(BufferGetPage(ovflbuf), recptr);
> +PageSetLSN(BufferGetPage(buf), recptr);
> +
> +if (BufferIsValid(mapbuf))
> +PageSetLSN(BufferGetPage(mapbuf), recptr);
> +
> +if (BufferIsValid(newmapbuf))
> +PageSetLSN(BufferGetPage(newmapbuf), recptr);
>
> I think you forgot to call PageSetLSN() on metabuf.  (There are 5
> block references in the write-ahead log record but only 4 calls to
> PageSetLSN ... surely that can't be right!)
>

Right, fixed.

> +/*
> + * We need to release the locks once the prev pointer of overflow bucket
> + * and next of left bucket are set, otherwise concurrent read might skip
> + * the bucket.
> + */
> +if (BufferIsValid(leftbuf))
> +UnlockReleaseBuffer(leftbuf);
> +UnlockReleaseBuffer(ovflbuf);
>
> I don't believe the comment.  Postponing the lock release wouldn't
> cause the concurrent read to skip the bucket.  It might cause it to be
> delayed unnecessarily, but the following comment already addresses
> that point.  I think you can just nuke this comment.
>

Okay, removed the comment.

> +xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-11 Thread Robert Haas
On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila  wrote:
>>  /*
>> + * Change the shared buffer state in critical section,
>> + * otherwise any error could make it unrecoverable after
>> + * recovery.
>> + */
>> +START_CRIT_SECTION();
>> +
>> +/*
>>   * Insert tuple on new page, using _hash_pgaddtup to ensure
>>   * correct ordering by hashkey.  This is a tad inefficient
>>   * since we may have to shuffle itempointers repeatedly.
>>   * Possible future improvement: accumulate all the items for
>>   * the new page and qsort them before insertion.
>>   */
>>  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>>
>> +END_CRIT_SECTION();
>>
>> No way.  You have to start the critical section before making any page
>> modifications and keep it alive until all changes have been logged.
>>
>
> I think what we need to do here is to accumulate all the tuples that
> need to be added to new bucket page till either that page has no more
> space or there are no more tuples remaining in an old bucket.  Then in
> a critical section, add them to the page using _hash_pgaddmultitup and
> log the entire new bucket page contents as is currently done in patch
> log_split_page().

I agree.

> Now, here we can choose to log the individual
> tuples as well instead of a complete page, however not sure if there
> is any benefit for doing the same because XLogRecordAssemble() will
> anyway remove the empty space from the page.  Let me know if you have
> something else in mind.

Well, if you have two pages that are 75% full, and you move a third of
the tuples from one of them into the other, it's going to be about
four times more efficient to log only the moved tuples than the whole
page.  But logging the whole filled page wouldn't be wrong, just
somewhat inefficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-10 Thread Amit Kapila
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas  wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas  wrote:
 Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
>  /*
> + * Change the shared buffer state in critical section,
> + * otherwise any error could make it unrecoverable after
> + * recovery.
> + */
> +START_CRIT_SECTION();
> +
> +/*
>   * Insert tuple on new page, using _hash_pgaddtup to ensure
>   * correct ordering by hashkey.  This is a tad inefficient
>   * since we may have to shuffle itempointers repeatedly.
>   * Possible future improvement: accumulate all the items for
>   * the new page and qsort them before insertion.
>   */
>  (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
>
> +END_CRIT_SECTION();
>
> No way.  You have to start the critical section before making any page
> modifications and keep it alive until all changes have been logged.
>

I think what we need to do here is to accumulate all the tuples that
need to be added to new bucket page till either that page has no more
space or there are no more tuples remaining in an old bucket.  Then in
a critical section, add them to the page using _hash_pgaddmultitup and
log the entire new bucket page contents as is currently done in patch
log_split_page().  Now, here we can choose to log the individual
tuples as well instead of a complete page, however not sure if there
is any benefit for doing the same because XLogRecordAssemble() will
anyway remove the empty space from the page.  Let me know if you have
something else in mind.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-10 Thread Robert Haas
On Fri, Mar 10, 2017 at 8:02 AM, Amit Kapila  wrote:
> I was thinking that we will use REGBUF_NO_IMAGE flag as is used in
> XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra
> I/O and will make it safe as well.  I think that makes registering the
> buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record.

Hmm, yeah, that might work, though I would probably want to spend some
time studying the patch and thinking about it before deciding for
sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-10 Thread Amit Kapila
On Fri, Mar 10, 2017 at 6:21 PM, Robert Haas  wrote:
> On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila  wrote:
>> On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas  wrote:
>>> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila  wrote:
 Do we really need to set LSN on this page (or mark it dirty), if so
 why?  Are you worried about restoration of FPI or something else?
>>>
>>> I haven't thought through all of the possible consequences and am a
>>> bit to tired to do so just now, but doesn't it seem rather risky to
>>> invent a whole new way of using these xlog functions?
>>> src/backend/access/transam/README describes how to do write-ahead
>>> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
>>> described as an optional step.
>>
>> Just to salvage my point, I think this is not the first place where we
>> register buffer, but don't set lsn.  For XLOG_HEAP2_VISIBLE, we
>> register heap and vm buffers but set the LSN conditionally on heap
>> buffer.  Having said that, I see the value of your point and I am open
>> to doing it that way if you feel that is a better way.
>
> Right, we did that, and it seems to have worked.  But it was a scary
> exception that required a lot of thought.  Now, since we did it once,
> we could do it again, but I am not sure it is for the best.  In the
> case of vacuum, we knew that a vacuum on a large table could otherwise
> emit an FPI for every page, which would almost double the amount of
> write I/O generated by a vacuum - instead of WAL records + heap pages,
> you'd be writing FPIs + heap pages, a big increase.  Now here I think
> that the amount of write I/O that it's costing us is not so clear.
> Unless it's going to be really big, I'd rather do this in some way we
> can think is definitely safe.
>
> Also, if we want to avoid dirtying the primary bucket page by setting
> the LSN, IMHO the way to do that is to store the block number of the
> primary bucket page without registering the buffer, and then during
> recovery, lock that block.  That seems cleaner than hoping that we can
> take an FPI without setting the page LSN.
>

I was thinking that we will use REGBUF_NO_IMAGE flag as is used in
XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra
I/O and will make it safe as well.  I think that makes registering the
buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-10 Thread Robert Haas
On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila  wrote:
> On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas  wrote:
>> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila  wrote:
>>> Do we really need to set LSN on this page (or mark it dirty), if so
>>> why?  Are you worried about restoration of FPI or something else?
>>
>> I haven't thought through all of the possible consequences and am a
>> bit to tired to do so just now, but doesn't it seem rather risky to
>> invent a whole new way of using these xlog functions?
>> src/backend/access/transam/README describes how to do write-ahead
>> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
>> described as an optional step.
>
> Just to salvage my point, I think this is not the first place where we
> register buffer, but don't set lsn.  For XLOG_HEAP2_VISIBLE, we
> register heap and vm buffers but set the LSN conditionally on heap
> buffer.  Having said that, I see the value of your point and I am open
> to doing it that way if you feel that is a better way.

Right, we did that, and it seems to have worked.  But it was a scary
exception that required a lot of thought.  Now, since we did it once,
we could do it again, but I am not sure it is for the best.  In the
case of vacuum, we knew that a vacuum on a large table could otherwise
emit an FPI for every page, which would almost double the amount of
write I/O generated by a vacuum - instead of WAL records + heap pages,
you'd be writing FPIs + heap pages, a big increase.  Now here I think
that the amount of write I/O that it's costing us is not so clear.
Unless it's going to be really big, I'd rather do this in some way we
can think is definitely safe.

Also, if we want to avoid dirtying the primary bucket page by setting
the LSN, IMHO the way to do that is to store the block number of the
primary bucket page without registering the buffer, and then during
recovery, lock that block.  That seems cleaner than hoping that we can
take an FPI without setting the page LSN.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-10 Thread Amit Kapila
On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas  wrote:
> On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila  wrote:
>> Do we really need to set LSN on this page (or mark it dirty), if so
>> why?  Are you worried about restoration of FPI or something else?
>
> I haven't thought through all of the possible consequences and am a
> bit to tired to do so just now, but doesn't it seem rather risky to
> invent a whole new way of using these xlog functions?
> src/backend/access/transam/README describes how to do write-ahead
> logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
> described as an optional step.
>

Just to salvage my point, I think this is not the first place where we
register buffer, but don't set lsn.  For XLOG_HEAP2_VISIBLE, we
register heap and vm buffers but set the LSN conditionally on heap
buffer.  Having said that, I see the value of your point and I am open
to doing it that way if you feel that is a better way.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-09 Thread Robert Haas
On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila  wrote:
> Do we really need to set LSN on this page (or mark it dirty), if so
> why?  Are you worried about restoration of FPI or something else?

I haven't thought through all of the possible consequences and am a
bit to tired to do so just now, but doesn't it seem rather risky to
invent a whole new way of using these xlog functions?
src/backend/access/transam/README describes how to do write-ahead
logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
described as an optional step.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-09 Thread Amit Kapila
On Thu, Mar 9, 2017 at 11:15 PM, Robert Haas  wrote:
> On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila  wrote:
>> Right, if we use XLogReadBufferForRedoExtended() instead of
>> XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
>> then it will restore FPI when required and can serve our purpose of
>> acquiring clean up lock on primary bucket page.  Yet another way could
>> be to store some information like block number, relfilenode, forknum
>> (maybe we can get away with some less info) of primary bucket in the
>> fixed part of each of these records and then using that read the page
>> and then take a cleanup lock.   Now, as in most cases, this buffer
>> needs to be registered only in cases when there are multiple overflow
>> pages, I think having fixed information might hurt in some cases.
>
> Just because the WAL record gets larger?  I think you could arrange to
> record the data only in the cases where it is needed, but I'm also not
> sure it would matter that much anyway.  Off-hand, it seems better than
> having to mark the primary bucket page dirty (because you have to set
> the LSN) every time any page in the chain is modified.
>

Do we really need to set LSN on this page (or mark it dirty), if so
why?  Are you worried about restoration of FPI or something else?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-09 Thread Robert Haas
On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila  wrote:
>> +mode anyway).  It would seem natural to complete the split in VACUUM, but 
>> since
>> +splitting a bucket might require allocating a new page, it might fail if you
>> +run out of disk space.  That would be bad during VACUUM - the reason for
>> +running VACUUM in the first place might be that you run out of disk space,
>> +and now VACUUM won't finish because you're out of disk space.  In contrast,
>> +an insertion can require enlarging the physical file anyway.
>>
>> But VACUUM actually does try to complete the splits, so this is wrong, I 
>> think.
>
> Vacuum just tries to clean the remaining tuples from old bucket.  Only
> insert or split operation tries to complete the split.

Oh, right.  Sorry.

>> +if (!xlrec.is_primary_bucket_page)
>> +XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>>
>> So, in hashbucketcleanup, the page is registered and therefore an FPI
>> will be taken if this is the first reference to this page since the
>> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
>> exposes us to a torn-page hazard.
>> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
>> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
>
> Right, if we use XLogReadBufferForRedoExtended() instead of
> XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
> then it will restore FPI when required and can serve our purpose of
> acquiring clean up lock on primary bucket page.  Yet another way could
> be to store some information like block number, relfilenode, forknum
> (maybe we can get away with some less info) of primary bucket in the
> fixed part of each of these records and then using that read the page
> and then take a cleanup lock.   Now, as in most cases, this buffer
> needs to be registered only in cases when there are multiple overflow
> pages, I think having fixed information might hurt in some cases.

Just because the WAL record gets larger?  I think you could arrange to
record the data only in the cases where it is needed, but I'm also not
sure it would matter that much anyway.  Off-hand, it seems better than
having to mark the primary bucket page dirty (because you have to set
the LSN) every time any page in the chain is modified.

>> +/*
>> + * Note that we still update the page even if it was restored from a 
>> full
>> + * page image, because the bucket flag is not included in the image.
>> + */
>>
>> Really?  Why not?  (Because it's part of the special space?)
>
> Yes, because it's part of special space.  Do you want that to
> mentioned in comments?

Seems like a good idea.  Maybe just change the end to "because the
special space is not included in the image" to make it slightly more
explicit.

>> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
>> guard against concurrent scans?
>
> I don't think so, because regular code takes it on old bucket to
> protect the split from concurrent inserts and cleanup lock on new
> bucket is just for the sake of consistency.

You might be right, but this definitely does create a state on the
standby that can't occur on the master: the bucket being split is
flagged as such, but the bucket being populated doesn't appear to
exist yet.  Maybe that's OK.  It makes me slightly nervous to have the
standby take a weaker lock than the master does, but perhaps it's
harmless.

>> And also hash_xlog_split_complete?
>
> In regular code, we are not taking cleanup lock for this operation.

You're right, sorry.

>> And hash_xlog_delete?  If the regular code path is getting a cleanup
>> lock to protect against concurrent scans, recovery better do it, too.
>>
>
> We are already taking cleanup lock for this, not sure what exactly is
> missed here, can you be more specific?

Never mind, I'm wrong.

>> +/*
>> + * There is no harm in releasing the lock on old bucket before new 
>> bucket
>> + * during replay as no other bucket splits can be happening 
>> concurrently.
>> + */
>> +if (BufferIsValid(oldbuf))
>> +UnlockReleaseBuffer(oldbuf);
>>
>> And I'm not sure this is safe either.  The issue isn't that there
>> might be another bucket split happening concurrently, but that there
>> might be scans that get confused by seeing the bucket split flag set
>> before the new bucket is created or the metapage updated.
>
> During scan we check for bucket being populated, so this should be safe.

Yeah, I guess so.  This business of the standby taking weaker locks
still seems scary to me, as in hash_xlog_split_allocpage above.

>>  Seems like
>> it would be safer to keep all the locks for this one.
>
> If you feel better that way, I can change it and add a comment saying
> something like "we could release lock on bucket being split early as
> well, but doing here to be consistent with normal operation"

I think that might not be a bad idea.

-- 
Robert 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-09 Thread Amit Kapila
On Thu, Mar 9, 2017 at 12:25 AM, Robert Haas  wrote:
> On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila  wrote:
>> Okay, I can try, but note that currently there is no test related to
>> "snapshot too old" for any other indexes.
>
> Wow, that's surprising.  It seems the snapshot_too_old test only
> checks that this works for a table that has no indexes.  Have you,
> anyway, tested it manually?
>

Yes, I have tested in manually.  I think we need to ensure that the
modified tuple falls on the same page as old tuple to make the test
work.  The slight difficulty with the index is to ensure the modified
tuple to be inserted into same page as old tuple, this is more true
with hash indexes.  Also, for heap, I think it relies on hot pruning
stuff and for index we need to perform manual vacuum.  Basically, if
we want we can write a test for index, but not sure if it is worth the
pain to write for hash index when the test for btree is not there.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-09 Thread Amit Kapila
On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas  wrote:
> On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas  wrote:
 Great, thanks.  0001 looks good to me now, so committed.
>>>
>>> Committed 0002.
>>
>> Here are some initial review thoughts on 0003 based on a first read-through.
>
> More thoughts on the main patch:
>
> +mode anyway).  It would seem natural to complete the split in VACUUM, but 
> since
> +splitting a bucket might require allocating a new page, it might fail if you
> +run out of disk space.  That would be bad during VACUUM - the reason for
> +running VACUUM in the first place might be that you run out of disk space,
> +and now VACUUM won't finish because you're out of disk space.  In contrast,
> +an insertion can require enlarging the physical file anyway.
>
> But VACUUM actually does try to complete the splits, so this is wrong, I 
> think.
>

Vacuum just tries to clean the remaining tuples from old bucket.  Only
insert or split operation tries to complete the split.

>
> +if (!xlrec.is_primary_bucket_page)
> +XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
>
> So, in hashbucketcleanup, the page is registered and therefore an FPI
> will be taken if this is the first reference to this page since the
> checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
> exposes us to a torn-page hazard.
> _hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
> disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
>

Right, if we use XLogReadBufferForRedoExtended() instead of
XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
then it will restore FPI when required and can serve our purpose of
acquiring clean up lock on primary bucket page.  Yet another way could
be to store some information like block number, relfilenode, forknum
(maybe we can get away with some less info) of primary bucket in the
fixed part of each of these records and then using that read the page
and then take a cleanup lock.   Now, as in most cases, this buffer
needs to be registered only in cases when there are multiple overflow
pages, I think having fixed information might hurt in some cases.

>
> +/*
> + * As bitmap page doesn't have standard page layout, so this will
> + * allow us to log the data.
> + */
> +XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
> +XLogRegisterBufData(2, (char *) _page_bit, 
> sizeof(uint32));
>
> If the page doesn't have a standard page layout, why are we passing
> REGBUF_STANDARD?  But I think you've hacked things so that bitmap
> pages DO have a standard page layout, per the change to
> _hash_initbitmapbuffer, in which case the comment seems wrong.
>

Yes, the comment is obsolete and I have removed it.  We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

>
> +/*
> + * Note that we still update the page even if it was restored from a full
> + * page image, because the bucket flag is not included in the image.
> + */
>
> Really?  Why not?  (Because it's part of the special space?)
>

Yes, because it's part of special space.  Do you want that to
mentioned in comments?

> Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
> guard against concurrent scans?
>

I don't think so, because regular code takes it on old bucket to
protect the split from concurrent inserts and cleanup lock on new
bucket is just for the sake of consistency.

> And also hash_xlog_split_complete?

In regular code, we are not taking cleanup lock for this operation.

> And hash_xlog_delete?  If the regular code path is getting a cleanup
> lock to protect against concurrent scans, recovery better do it, too.
>

We are already taking cleanup lock for this, not sure what exactly is
missed here, can you be more specific?

> +/*
> + * There is no harm in releasing the lock on old bucket before new bucket
> + * during replay as no other bucket splits can be happening concurrently.
> + */
> +if (BufferIsValid(oldbuf))
> +UnlockReleaseBuffer(oldbuf);
>
> And I'm not sure this is safe either.  The issue isn't that there
> might be another bucket split happening concurrently, but that there
> might be scans that get confused by seeing the bucket split flag set
> before the new bucket is created or the metapage updated.
>

During scan we check for bucket being populated, so this should be safe.

>  Seems like
> it would be safer to keep all the locks for this one.
>

If you feel better that way, I can change it and add a comment saying
something like "we could release lock on bucket being split early as
well, but doing here to be consistent with normal operation"

>
> It seems like a good test to do with this patch would be to set up a
> pgbench test on the master with a hash index replacing the usual btree
> index.  Then, set 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-08 Thread Robert Haas
On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas  wrote:
>>> Great, thanks.  0001 looks good to me now, so committed.
>>
>> Committed 0002.
>
> Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:

The text you've added to the hash index README throws around the term
"single WAL entry" pretty freely.  For example: "An insertion that
causes an addition of an overflow page is logged as a single WAL entry
preceded by a WAL entry for new overflow page required to insert a
tuple."  When you say a single WAL entry, you make it sound like
there's only one, but because it's preceded by another WAL entry,
there are actually two.  I would rephase this to just avoid using the
word "single" this way.  For example: "If an insertion causes the
addition of an overflow page, there will be one WAL entry for the new
overflow page and a second entry for the insert itself."

+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.

But VACUUM actually does try to complete the splits, so this is wrong, I think.

+Squeeze operation moves tuples from one of the buckets later in the chain to

A squeeze operation

+As Squeeze operation involves writing multiple atomic operations, it is

As a squeeze operation involves multiple atomic operations

+if (!xlrec.is_primary_bucket_page)
+XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);

So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI.  I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
Not sure exactly what the fix should be.

+/*
+ * As bitmap page doesn't have standard page layout, so this will
+ * allow us to log the data.
+ */
+XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+XLogRegisterBufData(2, (char *) _page_bit, sizeof(uint32));

If the page doesn't have a standard page layout, why are we passing
REGBUF_STANDARD?  But I think you've hacked things so that bitmap
pages DO have a standard page layout, per the change to
_hash_initbitmapbuffer, in which case the comment seems wrong.

(The comment is also a little confusing grammatically, but let's iron
out the substantive issue first.)

+recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+PageSetLSN(BufferGetPage(ovflbuf), recptr);
+PageSetLSN(BufferGetPage(buf), recptr);
+
+if (BufferIsValid(mapbuf))
+PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+if (BufferIsValid(newmapbuf))
+PageSetLSN(BufferGetPage(newmapbuf), recptr);

I think you forgot to call PageSetLSN() on metabuf.  (There are 5
block references in the write-ahead log record but only 4 calls to
PageSetLSN ... surely that can't be right!)

+/*
+ * We need to release the locks once the prev pointer of overflow bucket
+ * and next of left bucket are set, otherwise concurrent read might skip
+ * the bucket.
+ */
+if (BufferIsValid(leftbuf))
+UnlockReleaseBuffer(leftbuf);
+UnlockReleaseBuffer(ovflbuf);

I don't believe the comment.  Postponing the lock release wouldn't
cause the concurrent read to skip the bucket.  It might cause it to be
delayed unnecessarily, but the following comment already addresses
that point.  I think you can just nuke this comment.

+xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;

Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?

+/*
+ * We release locks on writebuf and bucketbuf at end of replay operation
+ * to ensure that we hold lock on primary bucket page till end of
+ * operation.  We can optimize by releasing the lock on write buffer as
+ * soon as the operation for same is complete, if it is not same as
+ * primary bucket page, but that doesn't seem to be worth complicating the
+ * code.
+ */
+if (BufferIsValid(writebuf))
+UnlockReleaseBuffer(writebuf);
+
+if (BufferIsValid(bucketbuf))
+UnlockReleaseBuffer(bucketbuf);

Here in hash_xlog_squeeze_page(), you wait until the very end to
release these locks, but in e.g. hash_xlog_addovflpage you release
them considerably earlier for reasons that seem like they would also

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila  wrote:
>> I still think this is a bad idea.  Releasing and reacquiring the lock
>> on the master doesn't prevent the standby from seeing intermediate
>> states; the comment, if I understand correctly, is just plain wrong.
> I can remove this release and reacquire stuff, but first, let me try
> to explain what intermediate state I am talking here.  If we don't
> have a release/reacquire lock here, standby could see an empty
> overflow page at that location whereas master will be able to see the
> new page only after the insertion of tuple/'s in that bucket is
> finished.

That's true, but you keep speaking as if the release and reacquire
prevents the standby from seeing that state.  It doesn't.  Quite the
opposite: it allows the master to see that intermediate state as well.
I don't think there is a problem with the standby seeing an
intermediate state that can't be seen on the master, provided we've
set up the code to handle that intermediate state properly, which
hopefully we have.

>> + * Initialise the freed overflow page, here we can't complete zeroed the
>>
>> Don't use British spelling, and don't use a comma to join two
>> sentences.  Also "here we can't complete zeroed" doesn't seem right; I
>> don't know what that means.
>
> What the comment means to say is that we can't initialize page as zero
> (something like below)
> MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
>
> typo in sentence
> /complete/completely

I think you can just say we initialize the page.  Saying we can't
initializing it by zeroing it seems pretty obvious, both from the code
and from the general theory of how stuff works.  If you did want to
explain it, I think you'd write something like "Just zeroing the page
won't work, because ."

>> +/*
>> + * We need to release and if required reacquire the lock on
>> + * rbuf to ensure that standby shouldn't see an intermediate
>> + * state of it.  If we don't release the lock, after replay 
>> of
>> + * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
>> be able to
>> + * view the results of partial deletion on rblkno.
>> + */
>> +LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
>>
>> If you DO release the lock, this will STILL be true, because what
>> matters on the standby is what the redo code does.
>
> That's right, what I intend to do here is to release the lock in
> master where it will be released in standby.  In this case, we can't
> ensure what user can see in master is same as in standby after this
> WAL record is replayed, because in master we have exclusive lock on
> write buf, so no one can read the contents of read buf (the location
> of read buf will be after write buf) whereas, in standby, it will be
> possible to read the contents of the read buf.  I think this is not a
> correctness issue, so we might want to leave as it is, what do you
> say?

Well, as in the case above, you're doing extra work to make sure that
every state which can be observed on the standby can also be observed
on the master.  But that has no benefit, and it does have a cost: you
have to release and reacquire a lock that you could otherwise retain,
saving CPU cycles and code complexity.  So I think the way you've got
it is not good.

>> + * We can log the exact changes made to meta page, however as no
>> + * concurrent operation could see the index during the replay of 
>> this
>> + * record, we can perform the operations during replay as they are
>> + * done here.
>>
>> Don't use a comma to join two sentences.  Also, I don't understand
>> anything up to the "however".
>>
>
> I mean the below changes made to meta buf (refer existing code):
> metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
>
> metap->hashm_nmaps++;
>
> I think it will be clear if you look both the DO and REDO operation of
> XLOG_HASH_INIT_BITMAP_PAGE.  Let me know if you still think that the
> comment needs to be changed?

I think it's not very clear.  I think what this boils down to is that
this is another place where you've got a comment to explain that you
didn't log the entire metapage, but I don't think anyone would expect
that anyway, so it doesn't seem like a very important thing to
explain.  If you want a comment here, maybe something like /* This is
safe only because nobody else can be modifying the index at this
stage; it's only visible to the transaction that is creating it */

>> + * Set pd_lower just past the end of the metadata.  This is to log
>> + * full_page_image of metapage in xloginsert.c.
>>
>> Why the underscores?
>>
>> + * won't be a leak on standby, as next split will consume this 
>> space.
>
> No specific reason, just trying to resemble full_page_writes, but can
> change if you feel it doesn't make much sense.

Well, usually I don't 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-08 Thread Amit Kapila
On Wed, Mar 8, 2017 at 5:11 AM, Robert Haas  wrote:
>
> Here are some initial review thoughts on 0003 based on a first read-through.
>
>
> +/*
> + * we need to release and reacquire the lock on overflow buffer to ensure
> + * that standby shouldn't see an intermediate state of it.
> + */
> +LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
> +LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
>
> I still think this is a bad idea.  Releasing and reacquiring the lock
> on the master doesn't prevent the standby from seeing intermediate
> states; the comment, if I understand correctly, is just plain wrong.
>

I can remove this release and reacquire stuff, but first, let me try
to explain what intermediate state I am talking here.  If we don't
have a release/reacquire lock here, standby could see an empty
overflow page at that location whereas master will be able to see the
new page only after the insertion of tuple/'s in that bucket is
finished.  I understand that there is not much functionality impact of
this change but still wanted to make you understand why I have added
this code.  I am okay with removing this code if you still feel this
is just a marginal case and we should not bother about it. Let me know
what you think?

>
> + * Initialise the freed overflow page, here we can't complete zeroed the
>
> Don't use British spelling, and don't use a comma to join two
> sentences.  Also "here we can't complete zeroed" doesn't seem right; I
> don't know what that means.

What the comment means to say is that we can't initialize page as zero
(something like below)
MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));

typo in sentence
/complete/completely

>
> + * To replay meta page changes, we can log the entire metapage which
> + * doesn't seem advisable considering size of hash metapage or we can
> + * log the individual updated values which seems doable, but we 
> prefer
> + * to perform exact operations on metapage during replay as are done
> + * during actual operation.  That looks straight forward and has an
> + * advantage of using lesser space in WAL.
>
> This sounds like it is describing two alternatives that were not
> selected and then the one that was selected, but I don't understand
> the difference between the second non-selected alternative and what
> was actually picked.
>

Actually, the second alternative was picked.

> I'd just write "We need only log the one field
> from the metapage that has changed." or maybe drop the comment
> altogether.
>

I have added the comment so that others can understand why we don't
want to log the entire meta page (I think in some other indexes, we
always log the entire meta page, so such a question can arise).
However, after reading your comment, I think it is overkill and I
prefer to drop the entire comment unless you feel differently.

>
> +/*
> + * We need to release and if required reacquire the lock on
> + * rbuf to ensure that standby shouldn't see an intermediate
> + * state of it.  If we don't release the lock, after replay 
> of
> + * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
> be able to
> + * view the results of partial deletion on rblkno.
> + */
> +LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
>
> If you DO release the lock, this will STILL be true, because what
> matters on the standby is what the redo code does.
>

That's right, what I intend to do here is to release the lock in
master where it will be released in standby.  In this case, we can't
ensure what user can see in master is same as in standby after this
WAL record is replayed, because in master we have exclusive lock on
write buf, so no one can read the contents of read buf (the location
of read buf will be after write buf) whereas, in standby, it will be
possible to read the contents of the read buf.  I think this is not a
correctness issue, so we might want to leave as it is, what do you
say?

>
> + * We can log the exact changes made to meta page, however as no
> + * concurrent operation could see the index during the replay of this
> + * record, we can perform the operations during replay as they are
> + * done here.
>
> Don't use a comma to join two sentences.  Also, I don't understand
> anything up to the "however".
>

I mean the below changes made to meta buf (refer existing code):
metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;

metap->hashm_nmaps++;

I think it will be clear if you look both the DO and REDO operation of
XLOG_HASH_INIT_BITMAP_PAGE.  Let me know if you still think that the
comment needs to be changed?

>
> + * Set pd_lower just past the end of the metadata.  This is to log
> + * full_page_image of metapage in xloginsert.c.
>
> Why the underscores?
>
> + * won't be a leak on standby, as next split will 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-07 Thread Amit Kapila
On Wed, Mar 8, 2017 at 3:38 AM, Robert Haas  wrote:
> On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas  wrote:
>> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila  wrote:
>>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>>> but I think as this patch is standalone, so we should not remove it
>>> from their existing usage, I have added those back and rebased the
>>> remaining patches.
>>
>> Great, thanks.  0001 looks good to me now, so committed.
>
> Committed 0002.
>

Thanks.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-07 Thread Robert Haas
On Tue, Mar 7, 2017 at 5:08 PM, Robert Haas  wrote:
> On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas  wrote:
>> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila  wrote:
>>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>>> but I think as this patch is standalone, so we should not remove it
>>> from their existing usage, I have added those back and rebased the
>>> remaining patches.
>>
>> Great, thanks.  0001 looks good to me now, so committed.
>
> Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

+ affected by this setting.  Example include system catalogs.  For

Not grammatical.  Perhaps "affected by this setting, such as system catalogs".

-mark meta page dirty and release buffer content lock and pin
+ mark meta page dirty and write WAL for update of metapage
+ release buffer content lock and pin

Was the change in the indentation level intentional?

+accomodated in the initial set of buckets and it is not unusual to have large

Spelling.

+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the

Need two commas and an "or": (a) removing tuples from some of the
bucket pages, (b) before clearing the garbage flag, or (c)

+quite possible, that system crashes before completing the operation on

No comma.  Also, "that the system crashes".

+the index will remain bloated and can impact performance of read and

and this can impact

+/*
+ * we need to release and reacquire the lock on overflow buffer to ensure
+ * that standby shouldn't see an intermediate state of it.
+ */
+LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);

I still think this is a bad idea.  Releasing and reacquiring the lock
on the master doesn't prevent the standby from seeing intermediate
states; the comment, if I understand correctly, is just plain wrong.
What it does do is make the code on the master more complicated, since
now the page returned by _hash_addovflpage might theoretically fill up
while you are busy releasing and reacquiring the lock.

+ *Add the tuples (itups) to wbuf in this function, we could do that in the
+ *caller as well.  The advantage of doing it here is we can easily write

Change comma to period.  Then: "We could easily do that in the caller,
but the advantage..."

+ * Initialise the freed overflow page, here we can't complete zeroed the

Don't use British spelling, and don't use a comma to join two
sentences.  Also "here we can't complete zeroed" doesn't seem right; I
don't know what that means.

+ * To replay meta page changes, we can log the entire metapage which
+ * doesn't seem advisable considering size of hash metapage or we can
+ * log the individual updated values which seems doable, but we prefer
+ * to perform exact operations on metapage during replay as are done
+ * during actual operation.  That looks straight forward and has an
+ * advantage of using lesser space in WAL.

This sounds like it is describing two alternatives that were not
selected and then the one that was selected, but I don't understand
the difference between the second non-selected alternative and what
was actually picked.  I'd just write "We need only log the one field
from the metapage that has changed." or maybe drop the comment
altogether.

+XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);

The use of XLR_NORMAL_RDATAS here seems wrong.  Presumably you should
pass the number of rdatas you will actually need, which is some
constant plus nitups.  I think you should just write that constant
here directly, whether or not it happens to be equal to
XLR_NORMAL_RDATAS.

+/*
+ * We need to release and if required reacquire the lock on
+ * rbuf to ensure that standby shouldn't see an intermediate
+ * state of it.  If we don't release the lock, after replay of
+ * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
be able to
+ * view the results of partial deletion on rblkno.
+ */
+LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);

If you DO release the lock, this will STILL be true, because what
matters on the standby is what the redo code does.  This code doesn't
even execute on the standby, so how could it possibly affect what
intermediate states are visible there?

+/*
+ * Here, we could have copied the entire metapage as well to WAL
+ * record and restore it as it is during replay.  However the size of
+ * metapage is not small (more than 600 bytes), so recording just the
+ * information required to construct metapage.

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-07 Thread Robert Haas
On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas  wrote:
> On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila  wrote:
>> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
>> but I think as this patch is standalone, so we should not remove it
>> from their existing usage, I have added those back and rebased the
>> remaining patches.
>
> Great, thanks.  0001 looks good to me now, so committed.

Committed 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-03-01 Thread Robert Haas
On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila  wrote:
> Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
> but I think as this patch is standalone, so we should not remove it
> from their existing usage, I have added those back and rebased the
> remaining patches.

Great, thanks.  0001 looks good to me now, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-27 Thread Robert Haas
On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila  wrote:
> Attached are refactoring patches.  WAL patch needs some changes based
> on above comments, so will post it later.

After some study, I have committed 0001, and also committed 0002 and
0003 as a single commit, with only minor tweaks.

I haven't made a full study of 0004 yet, but I'm a bit concerned about
a couple of the hunks, specifically this one:

@@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel,

 if (PageGetFreeSpace(npage) < itemsz)
 {
-/* write out nbuf and drop lock, but keep pin */
-MarkBufferDirty(nbuf);
+/* drop lock, but keep pin */
 LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
 /* chain to a new overflow page */
 nbuf = _hash_addovflpage(rel, metabuf, nbuf,
(nbuf == bucket_nbuf) ? true : false);
 npage = BufferGetPage(nbuf);

And also this one:

@@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel,
  * To avoid deadlocks due to locking order of buckets, first lock the old
  * bucket and then the new bucket.
  */
-if (nbuf == bucket_nbuf)
-{
-MarkBufferDirty(bucket_nbuf);
-LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-}
-else
-{
-MarkBufferDirty(nbuf);
-_hash_relbuf(rel, nbuf);
-}
-
 LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 opage = BufferGetPage(bucket_obuf);
 oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I haven't quite grasped what's going on here, but it looks like those
MarkBufferDirty() calls didn't move someplace else, but rather just
vanished.  That would seem to be not good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-16 Thread Amit Kapila
On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila  wrote:
> On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas  wrote:
>
> Attached are refactoring patches.  WAL patch needs some changes based
> on above comments, so will post it later.
>

Attached is a rebased patch to enable WAL logging for hash indexes.
Note, that this needs to be applied on top of refactoring patches [1]
sent in the previous e-mail.

[1]  - 
https://www.postgresql.org/message-id/CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_%2B-oCRtRkdqPrQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0006-Enable-WAL-for-Hash-Indexes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-16 Thread Amit Kapila
On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas  wrote:
> On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila  wrote:
>> As discussed, attached are refactoring patches and a patch to enable
>> WAL for the hash index on top of them.
>
> Thanks.  I think that the refactoring patches shouldn't add
> START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
> for the final patch.  Nor should their comments reference the idea of
> writing WAL; that should also be for the final patch.
>

Okay, changed as per suggestion.

> PageGetFreeSpaceForMulTups doesn't seem like a great name.
> PageGetFreeSpaceForMultipleTuples?

Okay, changed as per suggestion.

>  Or maybe just use
> PageGetExactFreeSpace and then do the account in the caller.  I'm not
> sure it's really worth having a function just to subtract a multiple
> of sizeof(ItemIdData), and it would actually be more efficient to have
> the caller take care of this, since you wouldn't need to keep
> recalculating the value for every iteration of the loop.
>

I have tried this one, but I think even if track outside we need
something like PageGetFreeSpaceForMultipleTuples for the cases when we
have to try next write page/'s where data (set of index tuples) can be
accommodated.

> I think we ought to just rip (as a preliminary patch) out the support
> for freeing an overflow page in the middle of a bucket chain.  That
> code is impossible to hit, I believe, and instead of complicating the
> WAL machinery to cater to a nonexistent case, maybe we ought to just
> get rid of it.
>

I also could not think of a case where we need to care about the page
in the middle of bucket chain as per it's current usage.  In
particular, are you worried about the code related to nextblock number
in _hash_freeovflpage()?   Surely, we can remove it, but what
complexity are you referring to here?   There is some additional
book-keeping for primary bucket page, but there is nothing much about
maintaining a backward link.  One more thing to note is that this is
an exposed API, so to avoid getting it used in some wrong way, we
might want to make it static.

> +   /*
> +* We need to release and if required
> reacquire the lock on
> +* rbuf to ensure that standby
> shouldn't see an intermediate
> +* state of it.
> +*/
>
> How does releasing and reacquiring the lock on the master affect
> whether the standby can see an intermediate state?

I am talking about the intermediate state of standby with respect to
the master.  We will release the lock on standby after replaying the
WAL for moving tuples from read page to write page whereas master will
hold that for the much longer period (till we move all the tuples from
read page and free that page).  I understand that there is no
correctness issue here, but was trying to make operations on master
and standby similar and another thing is that it will help us in
obeying the coding rule of releasing the lock on buffers after writing
WAL record.  I see no harm in maintaining the existing coding pattern,
however, if you think that is not important in this case, then I can
change the code to avoid releasing the lock on read page.

>  That sounds
> totally wrong to me (and also doesn't belong in a refactoring patch
> anyway since we're not emitting any WAL here).
>

Agreed that even if we want to do such a change, then also it should
be part of WAL patch.  So for now, I have reverted that change.

> -   Assert(PageIsNew(page));
>
> This worries me a bit.  What's going on here?
>

This is related to below change.

+ /*
+ * Initialise the freed overflow page, here we can't complete zeroed the
+ *
page as WAL replay routines expect pages to be initialized. See
+ * explanation of RBM_NORMAL
mode atop XLogReadBufferExtended.
+ */
+ _hash_pageinit(ovflpage, BufferGetPageSize
(ovflbuf));

Basically, we need this routine to initialize freed overflow pages.
Also, if you see the similar function in btree (_bt_pageinit(Page
page, Size size)), it doesn't have any such Assertion.

Attached are refactoring patches.  WAL patch needs some changes based
on above comments, so will post it later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0001-Expose-a-new-API-_hash_pgaddmultitup.patch
Description: Binary data


0002-Expose-an-API-hashinitbitmapbuffer.patch
Description: Binary data


0003-Restructure-_hash_addovflpage.patch
Description: Binary data


0004-Restructure-split-bucket-code.patch
Description: Binary data


0005-Restructure-hash-index-creation.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-15 Thread Robert Haas
On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila  wrote:
> As discussed, attached are refactoring patches and a patch to enable
> WAL for the hash index on top of them.

Thanks.  I think that the refactoring patches shouldn't add
START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
for the final patch.  Nor should their comments reference the idea of
writing WAL; that should also be for the final patch.

PageGetFreeSpaceForMulTups doesn't seem like a great name.
PageGetFreeSpaceForMultipleTuples?  Or maybe just use
PageGetExactFreeSpace and then do the account in the caller.  I'm not
sure it's really worth having a function just to subtract a multiple
of sizeof(ItemIdData), and it would actually be more efficient to have
the caller take care of this, since you wouldn't need to keep
recalculating the value for every iteration of the loop.

I think we ought to just rip (as a preliminary patch) out the support
for freeing an overflow page in the middle of a bucket chain.  That
code is impossible to hit, I believe, and instead of complicating the
WAL machinery to cater to a nonexistent case, maybe we ought to just
get rid of it.

+   /*
+* We need to release and if required
reacquire the lock on
+* rbuf to ensure that standby
shouldn't see an intermediate
+* state of it.
+*/

How does releasing and reacquiring the lock on the master affect
whether the standby can see an intermediate state?  That sounds
totally wrong to me (and also doesn't belong in a refactoring patch
anyway since we're not emitting any WAL here).

-   Assert(PageIsNew(page));

This worries me a bit.  What's going on here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-15 Thread Kuntal Ghosh
On Mon, Feb 13, 2017 at 8:52 PM, Amit Kapila  wrote:
> As discussed, attached are refactoring patches and a patch to enable
> WAL for the hash index on top of them.

0006-Enable-WAL-for-Hash-Indexes.patch needs to be rebased after
commit 8da9a226369e9ceec7cef1.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-09 Thread Robert Haas
On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila  wrote:
> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
>  wrote:
>> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>>> rebase.  Attached is the rebased patch.
>>
>> This needs a rebase after commit e898437.
>
> Attached find the rebased patch.

Well, I've managed to break this again by committing more things.

I think it would be helpful if you broke this into a series of
preliminary refactoring patches and then a final patch that actually
adds WAL-logging. The things that look like preliminary refactoring to
me are:

- Adding _hash_pgaddmultitup and using it in various places.
- Adding and freeing overflow pages has been extensively reworked.
- Similarly, there is some refactoring of how bitmap pages get initialized.
- Index initialization has been rejiggered significantly.
- Bucket splits have been rejiggered.

Individually those changes don't look all that troublesome to review,
but altogether it's quite a lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-03 Thread Amit Kapila
On Sat, Feb 4, 2017 at 4:37 AM, Jeff Janes  wrote:
> On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila 
> wrote:
>>
>> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
>>  wrote:
>> > On 12/27/2016 01:58 AM, Amit Kapila wrote:
>> >>
>> >> After recent commit's 7819ba1e and 25216c98, this patch requires a
>> >> rebase.  Attached is the rebased patch.
>> >>
>> >
>> > This needs a rebase after commit e898437.
>> >
>>
>> Attached find the rebased patch.
>>
>> Thanks!
>
>
> I've put this through a lot of crash-recovery testing using different
> variations of my previously posted testing harness, and have not had any
> problems.
>

Thanks for detailed testing of this patch.

> I occasionally get deadlocks which are properly detected.  I think this is
> just a user-space problem, not a bug, but will describe it anyway just in
> case...
>
> What happens is that connections occasionally create 10,000 nuisance tuples
> all using the same randomly chosen negative integer index value (the one the
> hash index is on), and then some time later delete those tuples using the
> memorized negative index value, to force the page split and squeeze to get
> exercised.  If two connections happen to choose the same negative index for
> their nuisance tuples, and then try to delete "their" tuples at the same
> time, and they are deleting the same 20,000 "nuisance" tuples concurrently
> with a bucket split/squeeze/something, they might see the tuples in a
> different order from each other and so deadlock against each other waiting
> on each others transaction locks due to "locked" tuples.  Is that a
> plausible explanation?
>

Yes, that is right explanation for the deadlock you are seeing.  A few
days back Jesper and Ashutosh Sharma have also faced the same problem
[1] with the similar test (non-unique values in hash index).  To
explain in brief, how this can happen is that squeeze operation moves
tuples from the far end of the bucket to rear end and it is quite
possible that different backends can see tuples with same values in a
different order.


[1] - 
https://www.postgresql.org/message-id/CAE9k0PmBGrNO_s%2ByKn72VO--ypaXWemOsizsWr5cHjEv0X4ERg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-02-03 Thread Jeff Janes
On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila 
wrote:

> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
>  wrote:
> > On 12/27/2016 01:58 AM, Amit Kapila wrote:
> >>
> >> After recent commit's 7819ba1e and 25216c98, this patch requires a
> >> rebase.  Attached is the rebased patch.
> >>
> >
> > This needs a rebase after commit e898437.
> >
>
> Attached find the rebased patch.
>
> Thanks!
>

I've put this through a lot of crash-recovery testing using different
variations of my previously posted testing harness, and have not had any
problems.

I occasionally get deadlocks which are properly detected.  I think this is
just a user-space problem, not a bug, but will describe it anyway just in
case...

What happens is that connections occasionally create 10,000 nuisance tuples
all using the same randomly chosen negative integer index value (the one
the hash index is on), and then some time later delete those tuples using
the memorized negative index value, to force the page split and squeeze to
get exercised.  If two connections happen to choose the same negative index
for their nuisance tuples, and then try to delete "their" tuples at the
same time, and they are deleting the same 20,000 "nuisance" tuples
concurrently with a bucket split/squeeze/something, they might see the
tuples in a different order from each other and so deadlock against each
other waiting on each others transaction locks due to "locked" tuples.  Is
that a plausible explanation?

Cheers,

Jeff


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-01-31 Thread Michael Paquier
On Fri, Jan 13, 2017 at 12:23 PM, Amit Kapila  wrote:
> On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
>  wrote:
>> On 12/27/2016 01:58 AM, Amit Kapila wrote:
>>>
>>> After recent commit's 7819ba1e and 25216c98, this patch requires a
>>> rebase.  Attached is the rebased patch.
>>>
>>
>> This needs a rebase after commit e898437.
>>
>
> Attached find the rebased patch.

The patch applies, and there have been no reviews since the last
version has been submitted. So moved to CF 2017-03.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2017-01-12 Thread Jesper Pedersen

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase.  Attached is the rebased patch.



This needs a rebase after commit e898437.

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-22 Thread Amit Kapila
On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas  wrote:
> On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila  wrote:
>>> I'll review after that, since I have other things to review meanwhile.
>>
>> Attached, please find the rebased patch attached with this e-mail.
>> There is no fundamental change in patch except for adapting the new
>> locking strategy in squeeze operation.  I have done the sanity testing
>> of the patch with master-standby setup and Kuntal has helped me to
>> verify it with his wal-consistency checker patch.
>
> This patch again needs a rebase,
>

I had completed it yesterday night and kept it for night long tests to
check the sanity of the patch, but I guess now I need another rebase.
Anyway, I feel this is all for the betterment of final patch.

> but before you do that I'd like to
> make it harder by applying the attached patch to remove
> _hash_chgbufaccess(), which I think is a bad plan for more or less the
> same reasons that motivated the removal of _hash_wrtbuf() in commit
> 25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
> more simplification and cleanup that can be done afterward in the wake
> of this; what I've done here is just a mechanical replacement of
> _hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().

The patch has replaced usage of HASH_READ/HASH_WRITE with
BUFFER_LOCK_SHARE/BUFFER_LOCK_EXCLUSIVE which will make hash code
using two types of defines for the same purpose.  Now, we can say that
it is better to replace HASH_READ/HASH_WRITE in whole hash index code
or maybe the usage this patch is introducing is okay, however, it
seems like btree is using similar terminology (BT_READ/BT_WRITE).
Other than that your patch looks good.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-22 Thread Robert Haas
On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila  wrote:
>> I'll review after that, since I have other things to review meanwhile.
>
> Attached, please find the rebased patch attached with this e-mail.
> There is no fundamental change in patch except for adapting the new
> locking strategy in squeeze operation.  I have done the sanity testing
> of the patch with master-standby setup and Kuntal has helped me to
> verify it with his wal-consistency checker patch.

This patch again needs a rebase, but before you do that I'd like to
make it harder by applying the attached patch to remove
_hash_chgbufaccess(), which I think is a bad plan for more or less the
same reasons that motivated the removal of _hash_wrtbuf() in commit
25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
more simplification and cleanup that can be done afterward in the wake
of this; what I've done here is just a mechanical replacement of
_hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().  The
point is that having MarkBufferDirty() calls happen implicitly instead
some other function is not what we want for write-ahead logging.  Your
patch gets rid of that, too; this is just doing it somewhat more
thoroughly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0eeb37d..1fa087a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -274,7 +274,7 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 	 * Reacquire the read lock here.
 	 */
 	if (BufferIsValid(so->hashso_curbuf))
-		_hash_chgbufaccess(rel, so->hashso_curbuf, HASH_NOLOCK, HASH_READ);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 
 	/*
 	 * If we've already initialized this scan, we can just advance it in the
@@ -354,7 +354,7 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 	/* Release read lock on current buffer, but keep it pinned */
 	if (BufferIsValid(so->hashso_curbuf))
-		_hash_chgbufaccess(rel, so->hashso_curbuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
 	scan->xs_ctup.t_self = so->hashso_heappos;
@@ -524,7 +524,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	orig_ntuples = metap->hashm_ntuples;
 	memcpy(_metapage, metap, sizeof(local_metapage));
 	/* release the lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Scan the buckets that we know exist */
 	cur_bucket = 0;
@@ -576,9 +576,9 @@ loop_top:
 			 * (and thus can't be further split), update our cached metapage
 			 * data.
 			 */
-			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			LockBuffer(metabuf, BUFFER_LOCK_SHARE);
 			memcpy(_metapage, metap, sizeof(local_metapage));
-			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 		}
 
 		bucket_buf = buf;
@@ -597,7 +597,7 @@ loop_top:
 	}
 
 	/* Write-lock metapage and check for split since we started */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
 	if (cur_maxbucket != metap->hashm_maxbucket)
@@ -605,7 +605,7 @@ loop_top:
 		/* There's been a split, so process the additional bucket(s) */
 		cur_maxbucket = metap->hashm_maxbucket;
 		memcpy(_metapage, metap, sizeof(local_metapage));
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 		goto loop_top;
 	}
 
@@ -821,7 +821,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 * page
 		 */
 		if (retain_pin)
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
@@ -836,7 +836,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	if (buf != bucket_buf)
 	{
 		_hash_relbuf(rel, buf);
-		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+		LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE);
 	}
 
 	/*
@@ -866,7 +866,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 			bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 59c4213..b51b487 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -102,7 +102,7 @@ restart_insert:
 		lowmask = metap->hashm_lowmask;
 
 		/* Release metapage lock, but keep pin. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If the previous 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-02 Thread Robert Haas
On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila  wrote:
>> Thanks.  I am thinking that it might make sense to try to get the
>> "microvacuum support for hash index" and "cache hash index meta page"
>> patches committed before this one, because I'm guessing they are much
>> simpler than this one, and I think therefore that the review of those
>> patches can probably move fairly quickly.
>
> I think it makes sense to move "cache hash index meta page" first,
> however "microvacuum support for hash index" is based on WAL patch as
> the action in this patch (delete op) also needs to be logged.  One
> idea could be that we can try to split the patch so that WAL logging
> can be done as a separate patch, but I am not sure if it is worth.

The thing is, there's a fair amount locking stupidity in what just got
committed because of the requirement that the TID doesn't decrease
within a page.  I'd really like to get that fixed.

>>  Of course, ideally I can
>> also start reviewing this one in the meantime.  Does that make sense
>> to you?
>>
>
> You can start reviewing some of the operations like "Create Index",
> "Insert".  However, some changes are required because of change in
> locking strategy for Vacuum.  I am planning to work on rebasing it and
> making required changes in next week.

I'll review after that, since I have other things to review meanwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-01 Thread Amit Kapila
On Thu, Dec 1, 2016 at 9:44 PM, Robert Haas  wrote:
> On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila  wrote:
>> On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila  wrote:
>>> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes  wrote:
 Unless we want to wait until that work is committed before doing more 
 review
 and testing on this.
>>>
>>> The concurrent hash index patch is getting changed and some of the
>>> changes needs change in this patch as well.  So, I think after it gets
>>> somewhat stabilized, I will update this patch as well.
>>
>> Now that concurrent hash index patch is committed [1], I will work on
>> rebasing this patch.  Note, I have moved this to next CF.
>
> Thanks.  I am thinking that it might make sense to try to get the
> "microvacuum support for hash index" and "cache hash index meta page"
> patches committed before this one, because I'm guessing they are much
> simpler than this one, and I think therefore that the review of those
> patches can probably move fairly quickly.
>

I think it makes sense to move "cache hash index meta page" first,
however "microvacuum support for hash index" is based on WAL patch as
the action in this patch (delete op) also needs to be logged.  One
idea could be that we can try to split the patch so that WAL logging
can be done as a separate patch, but I am not sure if it is worth.

>  Of course, ideally I can
> also start reviewing this one in the meantime.  Does that make sense
> to you?
>

You can start reviewing some of the operations like "Create Index",
"Insert".  However, some changes are required because of change in
locking strategy for Vacuum.  I am planning to work on rebasing it and
making required changes in next week.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-12-01 Thread Robert Haas
On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila  wrote:
> On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila  wrote:
>> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes  wrote:
>>> Unless we want to wait until that work is committed before doing more review
>>> and testing on this.
>>
>> The concurrent hash index patch is getting changed and some of the
>> changes needs change in this patch as well.  So, I think after it gets
>> somewhat stabilized, I will update this patch as well.
>
> Now that concurrent hash index patch is committed [1], I will work on
> rebasing this patch.  Note, I have moved this to next CF.

Thanks.  I am thinking that it might make sense to try to get the
"microvacuum support for hash index" and "cache hash index meta page"
patches committed before this one, because I'm guessing they are much
simpler than this one, and I think therefore that the review of those
patches can probably move fairly quickly.  Of course, ideally I can
also start reviewing this one in the meantime.  Does that make sense
to you?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-11-30 Thread Amit Kapila
On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila  wrote:
> On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes  wrote:
>>
>> Unless we want to wait until that work is committed before doing more review
>> and testing on this.
>>
>
> The concurrent hash index patch is getting changed and some of the
> changes needs change in this patch as well.  So, I think after it gets
> somewhat stabilized, I will update this patch as well.
>

Now that concurrent hash index patch is committed [1], I will work on
rebasing this patch.  Note, I have moved this to next CF.


[1] - 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6d46f4783efe457f74816a75173eb23ed8930020

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-11-09 Thread Amit Kapila
On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes  wrote:
> On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila 
> wrote:
>>
>> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila 
>> wrote:
>> >
>> > I think here I am slightly wrong.  For the full page writes, it do use
>> > RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> > doing page verification check and rather blindly setting the page to
>> > zero and then overwrites it with full page image.  So after my fix,
>> > you will not see the error of checksum failure.  I have a fix ready,
>> > but still doing some more verification.  If everything passes, I will
>> > share the patch in a day or so.
>> >
>>
>> Attached patch fixes the problem, now we do perform full page writes
>> for bitmap pages.  Apart from that, I have rebased the patch based on
>> latest concurrent index patch [1].  I have updated the README as well
>> to reflect the WAL logging related information for different
>> operations.
>>
>> With attached patch, all the review comments or issues found till now
>> are addressed.
>
>
> This needs to be updated to apply over concurrent_hash_index_v10.patch.
>
> Unless we want to wait until that work is committed before doing more review
> and testing on this.
>

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well.  So, I think after it gets
somewhat stabilized, I will update this patch as well.  I am not sure
if it is good idea to update it with every version of hash index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-11-08 Thread Jeff Janes
On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila 
wrote:

> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila 
> wrote:
> >
> > I think here I am slightly wrong.  For the full page writes, it do use
> > RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
> > doing page verification check and rather blindly setting the page to
> > zero and then overwrites it with full page image.  So after my fix,
> > you will not see the error of checksum failure.  I have a fix ready,
> > but still doing some more verification.  If everything passes, I will
> > share the patch in a day or so.
> >
>
> Attached patch fixes the problem, now we do perform full page writes
> for bitmap pages.  Apart from that, I have rebased the patch based on
> latest concurrent index patch [1].  I have updated the README as well
> to reflect the WAL logging related information for different
> operations.
>
> With attached patch, all the review comments or issues found till now
> are addressed.
>

This needs to be updated to apply over concurrent_hash_index_v10.patch.

Unless we want to wait until that work is committed before doing more
review and testing on this.

Thanks,

Jeff


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-27 Thread Jesper Pedersen

On 09/25/2016 01:00 AM, Amit Kapila wrote:

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages.  Apart from that, I have rebased the patch based on
latest concurrent index patch [1].  I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.



I have been running various tests, and applications with this patch 
together with the CHI v8 patch [1].


As I havn't seen any failures and doesn't currently have additional 
feedback I'm moving this patch to "Ready for Committer" for their feedback.


If others have comments, move the patch status back in the CommitFest 
application, please.


[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-25 Thread Ashutosh Sharma
Hi All,

> I forgot to mention that Ashutosh has tested this patch for a day
> using Jeff's tool and he didn't found any problem.  Also, he has found
> a way to easily reproduce the problem.  Ashutosh, can you share your
> changes to the script using which you have reproduce the problem?

I made slight changes in Jeff Janes patch (crash_REL10.patch) to
reproduce the issue reported by him earlier [1]. I changed the
crash_REL10.patch such that a torn page write happens only for a
bitmap page in hash index. As described by Amit in [2] & [3] we were
not logging full page image for bitmap page and that's the reason we
were not be able to restore bitmap page in case it is a torn page
which would eventually result in below error.

38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
failed, calculated checksum 65067 but expected 21260
38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block
9 of relation base/16384/17334

Jeff Jane's original patch had following conditions for a torn page write:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress())
nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);

Ashutosh has added some more condition's in above if( ) to ensure that
the issue associated with bitmap page gets reproduced easily:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress() && bucket_opaque->hasho_page_id == HASHO_PAGE_ID
&& bucket_opaque->hasho_flag & LH_BITMAP_PAGE)
nbytes = FileWrite(v->mdfd_vfd, buffer, 24);

Also, please note that i am allowing only 24 bytes of data i.e. just a
page header to be written into a disk file so that even if a single
overflow page is added into hash index table the issue will be
observed.

Note: You can find Jeff Jane's patch for torn page write at - [4].

[1] - 
https://www.postgresql.org/message-id/CAMkU%3D1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN%2BXQw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr%3DAdKbTEg%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bk9tGPw7vOACRo%2B4h5n%3DutHnSuGgMoJrLANybAjNBW9w%40mail.gmail.com
[4] - 
https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-24 Thread Amit Kapila
On Sun, Sep 25, 2016 at 10:30 AM, Amit Kapila  wrote:
> On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila  wrote:
>>
>> I think here I am slightly wrong.  For the full page writes, it do use
>> RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
>> doing page verification check and rather blindly setting the page to
>> zero and then overwrites it with full page image.  So after my fix,
>> you will not see the error of checksum failure.  I have a fix ready,
>> but still doing some more verification.  If everything passes, I will
>> share the patch in a day or so.
>>
>
> Attached patch fixes the problem, now we do perform full page writes
> for bitmap pages.  Apart from that, I have rebased the patch based on
> latest concurrent index patch [1].  I have updated the README as well
> to reflect the WAL logging related information for different
> operations.
>
> With attached patch, all the review comments or issues found till now
> are addressed.
>

I forgot to mention that Ashutosh has tested this patch for a day
using Jeff's tool and he didn't found any problem.  Also, he has found
a way to easily reproduce the problem.  Ashutosh, can you share your
changes to the script using which you have reproduce the problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-23 Thread Amit Kapila
On Thu, Sep 22, 2016 at 10:16 AM, Amit Kapila  wrote:
> On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes  wrote:
>>
>>
>> Correct.  But any torn page write must be covered by the restoration of a
>> full page image during replay, shouldn't it?  And that restoration should
>> happen blindly, without first reading in the old page and verifying the
>> checksum.
>>
>
> Probably, but I think this is not currently the way it is handled and
> I don't want to change it.  AFAIU, what happens now is that we first
> read the old page (and we do page verification while reading old page
> and display *warning* if checksum doesn't match) and then restore the
> image.  The relevant code path is
> XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().
>
> Now, here the point is that why instead of WARNING we are seeing FATAL
> for the bitmap page of hash index.  The reason as explained in my
> previous e-mail is that for bitmap page we are not logging full page
> image and we should fix that as explained there.  Once the full page
> image is logged, then first time it reads the torn page, it will use
> flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
> to WARNING.
>

I think here I am slightly wrong.  For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image.  So after my fix,
you will not see the error of checksum failure.  I have a fix ready,
but still doing some more verification.  If everything passes, I will
share the patch in a day or so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-21 Thread Amit Kapila
On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes  wrote:
> On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila 
> wrote:
>>
>> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes  wrote:
>> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
>> > wrote:
>> >>
>> >>
>> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> >> that, I have changed _hash_alloc_buckets() to initialize the page
>> >> instead of making it completely Zero because of problems discussed in
>> >> another related thread [1].  I have also updated README.
>> >>
>> >
>> > with v7 of the concurrent has patch and v4 of the write ahead log patch
>> > and
>> > the latest relcache patch (I don't know how important that is to
>> > reproducing
>> > this, I suspect it is not), I once got this error:
>> >
>> >
>> > 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
>> > interrupted; last known up at 2016-09-19 16:25:49 PDT
>> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
>> > properly shut down; automatic recovery in progress
>> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at
>> > 3F/2200DE90
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
>> > failed,
>> > calculated checksum 65067 but expected 21260
>> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
>> > of
>> > relation base/16384/17334
>> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
>> > 3F/22053B50
>> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>> >
>> >
>> > The original page with the invalid checksum is:
>> >
>>
>> I think this is a example of torn page problem, which seems to be
>> happening because of the below code in your test.
>>
>> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
>> !RecoveryInProgress()) {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
>> ! ereport(FATAL,
>> ! (errcode(ERRCODE_DISK_FULL),
>> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
>> bytes",
>> ! blocknum,
>> ! relpath(reln->smgr_rnode, forknum),
>> ! nbytes, BLCKSZ),
>> !  errhint("JJ is screwing with the database.")));
>> ! } else {
>> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
>> ! }
>>
>> If you are running the above test by disabling JJ_torn_page, then it
>> is a different matter and we need to investigate it, but l assume you
>> are running by enabling it.
>>
>> I think this could happen if the actual change in page is in 2/3 part
>> of page which you are not writing in above code.  The checksum in page
>> header which is written as part of partial page write (1/3 part of
>> page) would have considered the actual change you have made whereas
>> after restart when it again read the page to apply redo, the checksum
>> calculation won't include the change being made in 2/3 part.
>
>
> Correct.  But any torn page write must be covered by the restoration of a
> full page image during replay, shouldn't it?  And that restoration should
> happen blindly, without first reading in the old page and verifying the
> checksum.
>

Probably, but I think this is not currently the way it is handled and
I don't want to change it.  AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image.  The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index.  The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there.  Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think this is the reason why Ashutosh is seeing WARNING for heap
page whereas you are seeing FATAL for bitmap page of hash index.

I don't think we should try to avoid WARNING for torn pages as part of
this patch, even if that is better course of action, but certainly
FATAL should be fixed and a WARNING should be displayed instead as it
is displayed for heap pages.

>
>> Today, while thinking on this problem, I realized that currently in
>> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
>> problem reported by you [1].  That change will fix the problem
>> reported by you, but it will expose bitmap pages for torn-page
>> hazards.  I think the right fix there is to make pd_lower equal to
>> pd_upper for bitmap page, so that full page writes doesn't exclude the
>> data in bitmappage.
>
>
> 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-21 Thread Jeff Janes
On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila 
wrote:

> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes  wrote:
> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
> > wrote:
> >>
> >>
> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
> >> that, I have changed _hash_alloc_buckets() to initialize the page
> >> instead of making it completely Zero because of problems discussed in
> >> another related thread [1].  I have also updated README.
> >>
> >
> > with v7 of the concurrent has patch and v4 of the write ahead log patch
> and
> > the latest relcache patch (I don't know how important that is to
> reproducing
> > this, I suspect it is not), I once got this error:
> >
> >
> > 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> > interrupted; last known up at 2016-09-19 16:25:49 PDT
> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> > properly shut down; automatic recovery in progress
> > 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
> failed,
> > calculated checksum 65067 but expected 21260
> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
> of
> > relation base/16384/17334
> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> >
> >
> > The original page with the invalid checksum is:
> >
>
> I think this is a example of torn page problem, which seems to be
> happening because of the below code in your test.
>
> ! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
> !RecoveryInProgress()) {
> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
> ! ereport(FATAL,
> ! (errcode(ERRCODE_DISK_FULL),
> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
> bytes",
> ! blocknum,
> ! relpath(reln->smgr_rnode, forknum),
> ! nbytes, BLCKSZ),
> !  errhint("JJ is screwing with the database.")));
> ! } else {
> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
> ! }
>
> If you are running the above test by disabling JJ_torn_page, then it
> is a different matter and we need to investigate it, but l assume you
> are running by enabling it.
>
> I think this could happen if the actual change in page is in 2/3 part
> of page which you are not writing in above code.  The checksum in page
> header which is written as part of partial page write (1/3 part of
> page) would have considered the actual change you have made whereas
> after restart when it again read the page to apply redo, the checksum
> calculation won't include the change being made in 2/3 part.
>

Correct.  But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it?  And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum.  Failure to restore the page from a FPI would be a bug.  (That
was the purpose for which I wrote this testing harness in the first place,
to verify that the restoration of FPI happens correctly; although most of
the bugs it happens to uncover have been unrelated to that.)



>
> Today, Ashutosh has shared the logs of his test run where he has shown
> similar problem for HEAP page.  I think this could happen though
> rarely for any page with the above kind of tests.
>

I think Ashutosh's examples are of warnings, not errors.   I think the
warnings occur when replay needs to read in the block (for reason's I don't
understand yet) but then doesn't care if it passes the checksum or not
because it will just be blown away by the replay anyway.


> Does this explanation explains the reason of problem you are seeing?
>

If it can't survive artificial torn page writes, then it probably can't
survive reals ones either.  So I am pretty sure it is a bug of some sort.
Perhaps the bug is that it is generating an ERROR when should just be a
WARNING?


>
> >
> > If I ignore the checksum failure and re-start the system, the page gets
> > restored to be a bitmap page.
> >
>
> Okay, but have you ensured in some way that redo is applied to bitmap page?
>


I haven't done that yet.  I can't start the system without destroying the
evidence, and I haven't figured out yet how to import a specific block from
a shut-down server into a bytea of a running server, in order to inspect it
using pageinspect.

Today, while thinking on this problem, I realized that currently in
> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
> problem reported by you [1].  That change will fix the problem
> reported by you, but it will expose bitmap pages for torn-page
> hazards.  I think the right fix there is to make pd_lower equal to
> pd_upper for 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Amit Kapila
On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes  wrote:
> On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
> wrote:
>>
>>
>> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
>> that, I have changed _hash_alloc_buckets() to initialize the page
>> instead of making it completely Zero because of problems discussed in
>> another related thread [1].  I have also updated README.
>>
>
> with v7 of the concurrent has patch and v4 of the write ahead log patch and
> the latest relcache patch (I don't know how important that is to reproducing
> this, I suspect it is not), I once got this error:
>
>
> 38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> interrupted; last known up at 2016-09-19 16:25:49 PDT
> 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> properly shut down; automatic recovery in progress
> 38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification failed,
> calculated checksum 65067 but expected 21260
> 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
> relation base/16384/17334
> 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
> for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
>
>
> The original page with the invalid checksum is:
>

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
!  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
!  errhint("JJ is screwing with the database.")));
! } else {
!   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code.  The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page.  I think this could happen though
rarely for any page with the above kind of tests.

Does this explanation explains the reason of problem you are seeing?

>
> If I ignore the checksum failure and re-start the system, the page gets
> restored to be a bitmap page.
>

Okay, but have you ensured in some way that redo is applied to bitmap page?


Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1].  That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards.  I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Jesper Pedersen

On 09/16/2016 02:42 AM, Amit Kapila wrote:

Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1].  I have also updated README.



Thanks.

This needs a rebase against the CHI v8 [1] patch.

[1] 
https://www.postgresql.org/message-id/CAA4eK1+X=8sud1uczdzne3d9cgi9kw+kjxp2tnw7sx5w8pl...@mail.gmail.com


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-20 Thread Jeff Janes
On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila 
wrote:

>
> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
> that, I have changed _hash_alloc_buckets() to initialize the page
> instead of making it completely Zero because of problems discussed in
> another related thread [1].  I have also updated README.
>
>
with v7 of the concurrent has patch and v4 of the write ahead log patch and
the latest relcache patch (I don't know how important that is to
reproducing this, I suspect it is not), I once got this error:


38422  0 2016-09-19 16:25:50.055 PDT:LOG:  database system was
interrupted; last known up at 2016-09-19 16:25:49 PDT
38422  0 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
properly shut down; automatic recovery in progress
38422  0 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
failed, calculated checksum 65067 but expected 21260
38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9 of
relation base/16384/17334
38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T


The original page with the invalid checksum is:

$ od 16384_17334_9
000 32 00 015420 077347 020404 00 30 017760
020 017760 020004 00 00 00 00 00 00
040 00 00 00 00 00 00 00 00
*
0017760 07 02 17 17 07 00 02 177600
002

If I ignore the checksum failure and re-start the system, the page gets
restored to be a bitmap page.

Cheers,

Jeff


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Amit Kapila
On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> Below is the backtrace for the issue reported in my earlier mail [1].
> From the callstack it looks like we are trying to release  lock on a
> meta page twice in _hash_expandtable().
>

Thanks for the call stack.  I think below code in patch is culprit.
Here we have already released the meta page lock and then again on
failure, we are trying to release it.

_hash_expandtable()
{
..

/* Release the metapage lock, before completing the split. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
..

if (!buf_nblkno)
{
_hash_relbuf(rel, buf_oblkno);
goto fail;
}
..
fail:
/* We didn't write the metapage, so just drop lock */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
}

This is a problem of concurrent hash index patch.  I will send the fix
in next version of the patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Ashutosh Sharma
Hi All,

Below is the backtrace for the issue reported in my earlier mail [1].
>From the callstack it looks like we are trying to release  lock on a
meta page twice in _hash_expandtable().

(gdb) bt
#0  0x007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799
#1  0x0078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540
#2  0x004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0,
buf=2, from_access=1, to_access=-1) at hashpage.c:331
#3  0x004a722d in _hash_expandtable (rel=0x7f5605b608c0,
metabuf=2) at hashpage.c:995
#4  0x004a316a in _hash_doinsert (rel=0x7f5605b608c0,
itup=0x2ba5030) at hashinsert.c:313
#5  0x004a0d85 in hashinsert (rel=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4,
heapRel=0x7f5605b58250,
checkUnique=UNIQUE_CHECK_NO) at hash.c:248
#6  0x004c5a16 in index_insert (indexRelation=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "",
heap_t_ctid=0x2c2e4f4,
heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204
#7  0x0063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0,
tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000',
specConflict=0x0,
arbiterIndexes=0x0) at execIndexing.c:388
#8  0x0066a1fc in ExecInsert (mtstate=0x2b99e50,
slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0,
onconflict=ONCONFLICT_NONE,
estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481
#9  0x0066b841 in ExecModifyTable (node=0x2b99e50) at
nodeModifyTable.c:1496
#10 0x00645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396
#11 0x006424d9 in ExecutePlan (estate=0x2b99c00,
planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT,
sendTuples=0 '\000',
numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20
) at execMain.c:1567
#12 0x0064087a in standard_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x7f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#14 0x00640751 in ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:284
#15 0x007c331e in ProcessQuery (plan=0x2b873f0,
sourceText=0x2b89bd0 "insert into foo (index) select $1 from
generate_series(1,1)",
params=0x2b89c20, dest=0xd73c20 ,
completionTag=0x7ffdf540a3f0 "") at pquery.c:187
#16 0x007c4a0c in PortalRunMulti (portal=0x2ae7930,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20
,
altdest=0xd73c20 , completionTag=0x7ffdf540a3f0 "")
at pquery.c:1303
#17 0x007c4055 in PortalRun (portal=0x2ae7930,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90,
altdest=0x2b5dc90,
completionTag=0x7ffdf540a3f0 "") at pquery.c:815
#18 0x007bfb45 in exec_execute_message (portal_name=0x2b5d880
"", max_rows=9223372036854775807) at postgres.c:1977
#19 0x007c25c7 in PostgresMain (argc=1, argv=0x2b05df8,
dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at
postgres.c:4133
#20 0x00744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260
#21 0x00744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934
#22 0x00740f9c in ServerLoop () at postmaster.c:1691
#23 0x00740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at
postmaster.c:1299
#24 0x006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228

Please let me know for any further inputs.

[1]- 
https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> I am getting following error when running the test script shared by
> Jeff -[1] . The error is observed upon executing the test script for
> around 3-4 hrs.
>
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
> buffer_content 1 is not held
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
> foo (index) select $1 from generate_series(1,1)
>
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
> buffer_content 10 is not held
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,1)
>
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
> buffer_content 10 is not held
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,1)
>
> [1]- 
> https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com
>
> Please note that i am performing the test on latest patch for
> concurrent hash index and WAL log in hash index shared by Amit
> yesterday.
>
>
> With Regards,
> Ashutosh Sharma
> EnterpriseDB: 

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Ashutosh Sharma
Hi All,

I am getting following error when running the test script shared by
Jeff -[1] . The error is observed upon executing the test script for
around 3-4 hrs.

57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
buffer_content 1 is not held
57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
foo (index) select $1 from generate_series(1,1)

124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
buffer_content 10 is not held
124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,1)

124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
buffer_content 10 is not held
124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,1)

[1]- 
https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

Please note that i am performing the test on latest patch for
concurrent hash index and WAL log in hash index shared by Amit
yesterday.


With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
 wrote:
> On 09/13/2016 07:41 AM, Amit Kapila wrote:
>>>
>>> README:
>>> +in_complete split flag.  The reader algorithm works correctly, as it
>>> will
>>> scan
>>>
>>> What flag ?
>>>
>>
>> in-complete-split flag which indicates that split has to be finished
>> for that particular bucket.  The value of these flags are
>> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
>> bucket respectively.  It is set in hasho_flag in special area of page.
>> I have slightly expanded the definition in README, but not sure if it
>> is good idea to mention flags defined in hash.h. Let me know, if still
>> it is unclear or you want some thing additional to be added in README.
>>
>
> I think it is better now.
>
>>> hashxlog.c:
>>>
>>> hash_xlog_move_page_contents
>>> hash_xlog_squeeze_page
>>>
>>> Both have "bukcetbuf" (-> "bucketbuf"), and
>>>
>>> +   if (BufferIsValid(bukcetbuf));
>>>
>>> ->
>>>
>>> +   if (BufferIsValid(bucketbuf))
>>>
>>> and indent following line:
>>>
>>> LockBufferForCleanup(bukcetbuf);
>>>
>>> hash_xlog_delete
>>>
>>> has the "if" issue too.
>>>
>>
>> Fixed all the above cosmetic issues.
>>
>
> I meant there is an extra ';' on the "if" statements:
>
> +   if (BufferIsValid(bukcetbuf)); <--
>
> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
> hash_xlog_delete.
>
>
> Best regards,
>  Jesper
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-13 Thread Jesper Pedersen

On 09/13/2016 07:41 AM, Amit Kapila wrote:

README:
+in_complete split flag.  The reader algorithm works correctly, as it will
scan

What flag ?



in-complete-split flag which indicates that split has to be finished
for that particular bucket.  The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively.  It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.



I think it is better now.


hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+   if (BufferIsValid(bukcetbuf));

->

+   if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.



Fixed all the above cosmetic issues.



I meant there is an extra ';' on the "if" statements:

+   if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and 
hash_xlog_delete.


Best regards,
 Jesper





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-13 Thread Amit Kapila
On Mon, Sep 12, 2016 at 11:29 AM, Jeff Janes  wrote:
>
>
> My test program (as posted) injects crashes and then checks the
> post-crash-recovery system for consistency, so it cannot be run as-is
> without the WAL patch.  I also ran the test with crashing turned off (just
> change the JJ* variables at the stop of the do.sh to all be set to the empty
> string), and in that case I didn't see either problem, but it it could just
> be that I that I didn't run it long enough.
>
> It should have been long enough to detect the rather common "buffer  is
> not owned by resource owner Portal" problem, so that one I think is specific
> to the WAL patch (probably the part which tries to complete bucket splits
> when it detects one was started but not completed?)
>

Both the problem seems to be due to same reason and in concurrent hash
index patch.  So what is happening here is that we are trying to unpin
the old bucket buffer twice.  We need to scan the old bucket when
there is a incomplete-split, so the issue here is that during split
the system has crashed and after restart, during old bucket scan it
tries to unpin the old primary bucket buffer twice when the new bucket
has additional overflow pages.  I will post the updated patch on
concurrent hash index thread.  Once, I post the patch, it is better if
you try to reproduce the issue once more.

Thanks to Ashutosh Sharma who has offlist shared the call stack after
reproducing the problem.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-12 Thread Jesper Pedersen

Hi,

On 09/07/2016 05:58 AM, Amit Kapila wrote:

Okay, I have fixed this issue as explained above.  Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation).  I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended).  Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.



Some initial feedback.

README:
+in_complete split flag.  The reader algorithm works correctly, as it 
will scan


What flag ?

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+   if (BufferIsValid(bukcetbuf));

->

+   if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

hash.h:

Move the XLog related content to hash_xlog.h

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-12 Thread Jeff Janes
On Sun, Sep 11, 2016 at 7:40 PM, Amit Kapila 
wrote:

> On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes  wrote:
> > On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes 
> wrote:
> >
> >>
> >> I plan to do testing using my own testing harness after changing it to
> >> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
> >> column, which are never queried by the core part of the harness) and
> >> deleting them at random intervals.  I think that none of pgbench's
> built in
> >> tests are likely to give the bucket splitting and squeezing code very
> much
> >> exercise.
> >
> >
> >
> > I've implemented this, by adding lines 197 through 202 to the count.pl
> > script.  (I'm reattaching the test case)
> >
> > Within a few minutes of testing, I start getting Errors like these:
> >
> > 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not
> > owned by resource owner Portal
> > 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set
> > count=count+1 where index=$1
> >
> >
> > In one test, I also got an error from my test harness itself indicating
> > tuples are transiently missing from the index, starting an hour into a
> test:
> >
> > child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
> > count.pl line 194.\n  at count.pl line 208.
> > child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
> > count.pl line 194.\n  at count.pl line 208.
> > child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
> > count.pl line 194.\n  at count.pl line 208.
> >
> > Those key values should always find exactly one row to update.
> >
> > If the tuples were permanently missing from the index, I would keep
> getting
> > errors on the same key values very frequently.  But I don't get that, the
> > errors remain infrequent and are on different value each time, so I think
> > the tuples are in the index but the scan somehow misses them, either
> while
> > the bucket is being split or while it is being squeezed.
> >
> > This on a build without enable-asserts.
> >
> > Any ideas on how best to go about investigating this?
> >
>
> I think these symptoms indicate the bug in concurrent hash index
> patch, but it could be that the problem can be only revealed with WAL
> patch.  Is it possible to just try this with concurrent hash index
> patch?  In any case, thanks for testing it, I will look into these
> issues.
>

My test program (as posted) injects crashes and then checks the
post-crash-recovery system for consistency, so it cannot be run as-is
without the WAL patch.  I also ran the test with crashing turned off (just
change the JJ* variables at the stop of the do.sh to all be set to the
empty string), and in that case I didn't see either problem, but it it
could just be that I that I didn't run it long enough.

It should have been long enough to detect the rather common "buffer  is
not owned by resource owner Portal" problem, so that one I think is
specific to the WAL patch (probably the part which tries to complete bucket
splits when it detects one was started but not completed?)


Cheers,

Jeff


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Amit Kapila
On Sun, Sep 11, 2016 at 3:01 PM, Mark Kirkwood
 wrote:
> On 11/09/16 19:16, Mark Kirkwood wrote:
>
>>
>>
>> On 11/09/16 17:01, Amit Kapila wrote:
>>>
>>> ...Do you think we can do some read-only
>>> workload benchmarking using this server?  If yes, then probably you
>>> can use concurrent hash index patch [1] and cache the metapage patch
>>> [2] (I think Mithun needs to rebase his patch) to do so.
>>>
>>>
>>>
>>> [1] -
>>> https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
>>> [2] -
>>> https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com
>>>
>>>
>>
>> I can do - are we checking checking for hangs/assertions or comparing
>> patched vs unpatched performance (for the metapage patch)?
>>
>>

The first step is to find any bugs and then do performance testing.
What I wanted for performance testing was to compare HEAD against all
three patches (all three together) of Hash Index (concurrent hash
index, cache the meta page, WAL for hash index).  For Head, we want
two set of numbers, one with hash indexes and another with btree
indexes.  As Jeff has found few problems, I think it is better to
first fix those before going for performance tests.

>
> So, assuming the latter - testing performance with and without the metapage
> patch:
>
> For my 1st runs:
>
> - cpus 16, ran 16G
> - size 100, clients 32
>
> I'm seeing no difference in performance for read only (-S) pgbench workload
> (with everybody using has indexes). I guess not that surprising as the db
> fites in ram (1.6G and we have 16G). So I'll retry with a bigger dataset
> (suspect size 2000 is needed).
>

I think you should try with -S -M prepared and with various client
counts (8,16,32,64,100...).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Amit Kapila
On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes  wrote:
> On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes  wrote:
>
>>
>> I plan to do testing using my own testing harness after changing it to
>> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
>> column, which are never queried by the core part of the harness) and
>> deleting them at random intervals.  I think that none of pgbench's built in
>> tests are likely to give the bucket splitting and squeezing code very much
>> exercise.
>
>
>
> I've implemented this, by adding lines 197 through 202 to the count.pl
> script.  (I'm reattaching the test case)
>
> Within a few minutes of testing, I start getting Errors like these:
>
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not
> owned by resource owner Portal
> 29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set
> count=count+1 where index=$1
>
>
> In one test, I also got an error from my test harness itself indicating
> tuples are transiently missing from the index, starting an hour into a test:
>
> child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
> child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
> count.pl line 194.\n  at count.pl line 208.
>
> Those key values should always find exactly one row to update.
>
> If the tuples were permanently missing from the index, I would keep getting
> errors on the same key values very frequently.  But I don't get that, the
> errors remain infrequent and are on different value each time, so I think
> the tuples are in the index but the scan somehow misses them, either while
> the bucket is being split or while it is being squeezed.
>
> This on a build without enable-asserts.
>
> Any ideas on how best to go about investigating this?
>

I think these symptoms indicate the bug in concurrent hash index
patch, but it could be that the problem can be only revealed with WAL
patch.  Is it possible to just try this with concurrent hash index
patch?  In any case, thanks for testing it, I will look into these
issues.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Jeff Janes
On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes  wrote:


> I plan to do testing using my own testing harness after changing it to
> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
> column, which are never queried by the core part of the harness) and
> deleting them at random intervals.  I think that none of pgbench's built in
> tests are likely to give the bucket splitting and squeezing code very much
> exercise.
>


I've implemented this, by adding lines 197 through 202 to the count.pl
script.  (I'm reattaching the test case)

Within a few minutes of testing, I start getting Errors like these:

29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR:  buffer 2762 is not
owned by resource owner Portal
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT:  update foo set
count=count+1 where index=$1


In one test, I also got an error from my test harness itself indicating
tuples are transiently missing from the index, starting an hour into a test:

child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
count.pl line 194.\n  at count.pl line 208.
child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
count.pl line 194.\n  at count.pl line 208.
child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
count.pl line 194.\n  at count.pl line 208.

Those key values should always find exactly one row to update.

If the tuples were permanently missing from the index, I would keep getting
errors on the same key values very frequently.  But I don't get that, the
errors remain infrequent and are on different value each time, so I think
the tuples are in the index but the scan somehow misses them, either while
the bucket is being split or while it is being squeezed.

This on a build without enable-asserts.

Any ideas on how best to go about investigating this?

Cheers,

Jeff


count.pl
Description: Binary data


crash_REL10.patch
Description: Binary data


do.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Mark Kirkwood

On 11/09/16 19:16, Mark Kirkwood wrote:




On 11/09/16 17:01, Amit Kapila wrote:

...Do you think we can do some read-only
workload benchmarking using this server?  If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.



[1] - 
https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com





I can do - are we checking checking for hangs/assertions or comparing 
patched vs unpatched performance (for the metapage patch)?





So, assuming the latter - testing performance with and without the 
metapage patch:


For my 1st runs:

- cpus 16, ran 16G
- size 100, clients 32

I'm seeing no difference in performance for read only (-S) pgbench 
workload (with everybody using has indexes). I guess not that surprising 
as the db fites in ram (1.6G and we have 16G). So I'll retry with a 
bigger dataset (suspect size 2000 is needed).


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-11 Thread Mark Kirkwood



On 11/09/16 17:01, Amit Kapila wrote:

On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
 wrote:


performed several 10 hour runs on size 100 database using 32 and 64 clients.
For the last run I rebuilt with assertions enabled. No hangs or assertion
failures.


Thanks for verification.  Do you think we can do some read-only
workload benchmarking using this server?  If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.



[1] - 
https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com




I can do - are we checking checking for hangs/assertions or comparing 
patched vs unpatched performance (for the metapage patch)?


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-10 Thread Amit Kapila
On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
 wrote:
>
>
> performed several 10 hour runs on size 100 database using 32 and 64 clients.
> For the last run I rebuilt with assertions enabled. No hangs or assertion
> failures.
>

Thanks for verification.  Do you think we can do some read-only
workload benchmarking using this server?  If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.



[1] - 
https://www.postgresql.org/message-id/caa4ek1j6b8o4pcepqrxnyblvbftonmjeem+qn0jzx31-obx...@mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/cad__ouhj29cebif_flge4t9vj_-cfxbwcxhjo+d_16txbem...@mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-10 Thread Mark Kirkwood



On 09/09/16 14:50, Mark Kirkwood wrote:


Yeah, good suggestion about replacing (essentially) all the indexes 
with hash ones and testing. I did some short runs with this type of 
schema yesterday (actually to get a feel for if hash performance vs 
btree was compareable - does seem tp be) - but probably longer ones 
with higher concurrency (as high as I can manage on a single socket i7 
anyway) is a good plan. If Ashutosh has access to seriously large 
numbers of cores then that is even better :-)





I managed to find a slightly bigger server (used our openstack cloud to 
run a 16 cpu vm). With the schema modified as follows:


bench=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" hash (aid)

bench=# \d pgbench_branches
   Table "public.pgbench_branches"
  Column  | Type  | Modifiers
--+---+---
 bid  | integer   | not null
 bbalance | integer   |
 filler   | character(88) |
Indexes:
"pgbench_branches_pkey" hash (bid)

bench=# \d pgbench_tellers
Table "public.pgbench_tellers"
  Column  | Type  | Modifiers
--+---+---
 tid  | integer   | not null
 bid  | integer   |
 tbalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_tellers_pkey" hash (tid)

bench=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_pkey" hash (bid)

performed several 10 hour runs on size 100 database using 32 and 64 
clients. For the last run I rebuilt with assertions enabled. No hangs or 
assertion failures.


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Amit Kapila
On Fri, Sep 9, 2016 at 12:39 AM, Jeff Janes  wrote:
>
> I plan to do testing using my own testing harness after changing it to
> insert a lot of dummy tuples (ones with negative values in the pseudo-pk
> column, which are never queried by the core part of the harness) and
> deleting them at random intervals.  I think that none of pgbench's built in
> tests are likely to give the bucket splitting and squeezing code very much
> exercise.
>

Hash index tests [1] written by Mithun does cover part of that code
and we have done testing by extending those tests to cover splitting
and squeezing part of code.

> Is there a way to gather statistics on how many of each type of WAL record
> are actually getting sent over the replication link?  The only way I can
> think of is to turn on wal archving as well as replication, then using
> pg_xlogdump to gather the stats.
>

Sounds sensible, but what do you want to know by getting the number of
each type of WAL records?  I understand it is important to cover all
the WAL records for hash index (and I think Ashutosh has done that
during his tests [2]), but may be sending multiple times same record
could further strengthen the validation.

> I've run my original test for a while now and have not seen any problems.
> But I realized I forgot to compile with enable-casserts, to I will have to
> redo it to make sure the assertion failures have been fixed.  In my original
> testing I did very rarely get a deadlock (or some kind of hang), and I
> haven't seen that again so far.  It was probably the same source as the one
> Mark observed, and so the same fix.
>

Thanks for the verification.


[1] - https://commitfest.postgresql.org/10/716/
[2] - 
https://www.postgresql.org/message-id/CAE9k0PkPumi4iWFuD%2BjHHkpcxn531%3DDJ8uH0dctsvF%2BdaZY6yQ%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Mark Kirkwood

On 09/09/16 07:09, Jeff Janes wrote:

On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma > wrote:


> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid)


Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will 
never use hash indexes for searching, so it limits the amount of test 
coverage you will get.  While searching shouldn't generate novel types 
of WAL records (that I know of), it will generate locking and timing 
issues that might uncover bugs (if there are any left to uncover, of 
course).


I would drop the primary key on pgbench_accounts and replace it with a 
hash index and test it that way (except I don't have a 128 core 
machine at my disposal, so really I am suggesting that you do this...)


The lack of primary key and the non-uniqueness of the hash index 
should not be an operational problem, because the built in pgbench 
runs never attempt to violate the constraints anyway.


In fact, I'd replace all of the indexes on the rest of the pgbench 
tables with hash indexes, too, just for additional testing.


I plan to do testing using my own testing harness after changing it to 
insert a lot of dummy tuples (ones with negative values in the 
pseudo-pk column, which are never queried by the core part of the 
harness) and deleting them at random intervals.  I think that none of 
pgbench's built in tests are likely to give the bucket splitting and 
squeezing code very much exercise.


Is there a way to gather statistics on how many of each type of WAL 
record are actually getting sent over the replication link?  The only 
way I can think of is to turn on wal archving as well as replication, 
then using pg_xlogdump to gather the stats.


I've run my original test for a while now and have not seen any 
problems.  But I realized I forgot to compile with enable-casserts, to 
I will have to redo it to make sure the assertion failures have been 
fixed.  In my original testing I did very rarely get a deadlock (or 
some kind of hang), and I haven't seen that again so far.  It was 
probably the same source as the one Mark observed, and so the same fix.


Cheers,

Jeff


Yeah, good suggestion about replacing (essentially) all the indexes with 
hash ones and testing. I did some short runs with this type of schema 
yesterday (actually to get a feel for if hash performance vs btree was 
compareable - does seem tp be) - but probably longer ones with higher 
concurrency (as high as I can manage on a single socket i7 anyway) is a 
good plan. If Ashutosh has access to seriously large numbers of cores 
then that is even better :-)


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-08 Thread Jeff Janes
On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma 
wrote:

> > Thanks to Ashutosh Sharma for doing the testing of the patch and
> > helping me in analyzing some of the above issues.
>
> Hi All,
>
> I would like to summarize the test-cases that i have executed for
> validating WAL logging in hash index feature.
>
> 1) I have mainly ran the pgbench test with read-write workload at the
> scale factor of 1000 and various client counts like 16, 64 and 128 for
> time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
> on highly configured power2 machine with 128 cores and 512GB of RAM. I
> ran the test-case both with and without the replication setup.
>
> Please note that i have changed the schema of pgbench tables created
> during initialisation phase.
>
> The new schema of pgbench tables looks as shown below on both master
> and standby:
>
> postgres=# \d pgbench_accounts
>Table "public.pgbench_accounts"
>   Column  | Type  | Modifiers
> --+---+---
>  aid  | integer   | not null
>  bid  | integer   |
>  abalance | integer   |
>  filler   | character(84) |
> Indexes:
> "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
> "pgbench_accounts_bid" hash (bid)
>
> postgres=# \d pgbench_history
>   Table "public.pgbench_history"
>  Column |Type | Modifiers
> +-+---
>  tid| integer |
>  bid| integer |
>  aid| integer |
>  delta  | integer |
>  mtime  | timestamp without time zone |
>  filler | character(22)   |
> Indexes:
> "pgbench_history_bid" hash (bid)
>
>
Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will never
use hash indexes for searching, so it limits the amount of test coverage
you will get.  While searching shouldn't generate novel types of WAL
records (that I know of), it will generate locking and timing issues that
might uncover bugs (if there are any left to uncover, of course).

I would drop the primary key on pgbench_accounts and replace it with a hash
index and test it that way (except I don't have a 128 core machine at my
disposal, so really I am suggesting that you do this...)

The lack of primary key and the non-uniqueness of the hash index should not
be an operational problem, because the built in pgbench runs never attempt
to violate the constraints anyway.

In fact, I'd replace all of the indexes on the rest of the pgbench tables
with hash indexes, too, just for additional testing.

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals.  I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

Is there a way to gather statistics on how many of each type of WAL record
are actually getting sent over the replication link?  The only way I can
think of is to turn on wal archving as well as replication, then using
pg_xlogdump to gather the stats.

I've run my original test for a while now and have not seen any problems.
But I realized I forgot to compile with enable-casserts, to I will have to
redo it to make sure the assertion failures have been fixed.  In my
original testing I did very rarely get a deadlock (or some kind of hang),
and I haven't seen that again so far.  It was probably the same source as
the one Mark observed, and so the same fix.

Cheers,

Jeff


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Amit Kapila
On Thu, Sep 8, 2016 at 10:02 AM, Mark Kirkwood
 wrote:
>
> Repeating my tests with these new patches applied points to the hang issue
> being solved. I tested several 10 minute runs (any of which was enough to
> elicit the hang previously). I'll do some longer ones, but looks good!
>

Thanks for verifying the issue and doing further testing of the patch.
It is really helpful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Mark Kirkwood

On 07/09/16 21:58, Amit Kapila wrote:


On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes  wrote:

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila 
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes  wrote:


After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.


Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set.  Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit.  Then while freeing
the overflow page it can hit the Assert as mentioned by you.  I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue.  As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.



The test is rather awkward, it might be easier to just have me test it.


Okay, I have fixed this issue as explained above.  Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation).  I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended).  Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.




Repeating my tests with these new patches applied points to the hang 
issue being solved. I tested several 10 minute runs (any of which was 
enough to elicit the hang previously). I'll do some longer ones, but 
looks good!


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Ashutosh Sharma
> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid)


2) I have also made use of the following tools for analyzing the hash
index tables created on master and standby.

a) pgstattuple : Using this tool, i could verfiy the tuple level
statistics which includes index table physical length, dead tuple
percentageand other infos on both master and standby. However, i could
see below error message when using this tool for one of the hash index
table on both master and standby server.

postgres=# SELECT * FROM pgstattuple('pgbench_history_bid');
ERROR:  index "pgbench_history_bid" contains unexpected zero page at
block 2482297

To investigate on this, i made use of pageinspect contrib module and
came to know that above block contains an uninitialised page. But,
this is quite possible in hash index as here the bucket split happens
in the power-of-2 and we may find some of the uninitialised bucket
pages.

b) pg_filedump : Using this tool, i could take a dump of all the hash
index tables on master and standy and compare the dump file to verify
if there is any difference between hash index pages on both master and
standby.

In short, this is all i did to verify the patch for wal logging in hash index.

I would like thank Amit and Robert for their valuable inputs during
the hash index testing phase.

I am also planning to verify wal logging in hash index using Kuntal's
patch for WAL consistency check once it is stablized.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Amit Kapila
On Wed, Sep 7, 2016 at 3:28 PM, Amit Kapila  wrote:
>
> Okay, I have fixed this issue as explained above.  Apart from that, I
> have fixed another issue reported by Mark Kirkwood upthread and few
> other issues found during internal testing by Ashutosh Sharma.
>

Forgot to mention that you need to apply the latest concurrent hash
index patch [1] before applying this patch.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm%2Bqn0jZX31-obXrJw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-29 Thread Amit Kapila
On Tue, Aug 30, 2016 at 3:40 AM, Alvaro Herrera
 wrote:
> Amit Kapila wrote:
>
>> How about attached?
>
> That works; pushed.

Thanks.

> (I removed a few #includes from the new file.)
>

oops, copied from hash.h and forgot to remove those.

>> If you want, I think we can one step further and move hash_redo to a
>> new file hash_xlog.c which is required for main patch, but we can
>> leave it for later as well.
>
> I think that can be a part of the main patch.
>

Okay, makes sense.  Any suggestions on splitting the main patch or in
general on design/implementation is welcome.

I will rebase the patches on the latest commit.  Current status is
that, I have fixed the problems reported by Jeff Janes and Mark
Kirkwood.  Currently, we are doing the stress testing of the patch
using pgbench, once that is done, will post a new version.  I am
hoping that it will be complete in this week.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-29 Thread Alvaro Herrera
Amit Kapila wrote:

> How about attached?

That works; pushed.  (I removed a few #includes from the new file.)

> If you want, I think we can one step further and move hash_redo to a
> new file hash_xlog.c which is required for main patch, but we can
> leave it for later as well.

I think that can be a part of the main patch.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-26 Thread Amit Kapila
On Thu, Aug 25, 2016 at 6:54 PM, Alvaro Herrera
 wrote:
> Amit Kapila wrote:
>> On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
>>  wrote:
>
>> > Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
>> > instead of cramming it in hash.h?  Removing the existing #include
>> > "xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
>> > preliminary header cleanup commits.
>>
>> So, what you are expecting here is that we create hash_xlog.h and move
>> necessary functions like hash_redo(), hash_desc() and hash_identify()
>> to it. Then do whatever else is required to build it successfully.
>> Once that is done, I can build my patches on top of it.  Is that
>> right?  If yes, I can work on it.
>
> Yes, thanks.
>

How about attached?  If you want, I think we can one step further and
move hash_redo to a new file hash_xlog.c which is required for main
patch, but we can leave it for later as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


add_hash_xlog-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-25 Thread Alvaro Herrera
Amit Kapila wrote:
> On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
>  wrote:

> > Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
> > instead of cramming it in hash.h?  Removing the existing #include
> > "xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
> > preliminary header cleanup commits.
> 
> So, what you are expecting here is that we create hash_xlog.h and move
> necessary functions like hash_redo(), hash_desc() and hash_identify()
> to it. Then do whatever else is required to build it successfully.
> Once that is done, I can build my patches on top of it.  Is that
> right?  If yes, I can work on it.

Yes, thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-24 Thread Amit Kapila
On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
 wrote:
> Amit Kapila wrote:
>> $SUBJECT will make hash indexes reliable and usable on standby.
>
> Nice work.
>
> Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
> instead of cramming it in hash.h?  Removing the existing #include
> "xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
> preliminary header cleanup commits.
>

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it.  Is that
right?  If yes, I can work on it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-24 Thread Alvaro Herrera
Amit Kapila wrote:
> $SUBJECT will make hash indexes reliable and usable on standby.

Nice work.

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h?  Removing the existing #include
"xlogreader.h" from hash.h would be nice.  I volunteer for pushing any
preliminary header cleanup commits.

I think it's silly that access/hash.h is the go-to header for hash
usage, including hash_any().  Not this patch's fault, of course, just
venting.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-24 Thread Jeff Janes
On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila 
wrote:

> On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes  wrote:
>
> >
> > After an intentionally created crash, I get an Assert triggering:
> >
> > TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
> > (1<<((bitmapbit)%32", File: "hashovfl.c", Line: 553)
> >
> > freep[0] is zero and bitmapbit is 16.
> >
>
> Here what is happening is that when it tries to clear the bitmapbit,
> it expects it to be set.  Now, I think the reason for why it didn't
> find the bit as set could be that after the new overflow page is added
> and the bit corresponding to it is set, you might have crashed the
> system and the replay would not have set the bit.  Then while freeing
> the overflow page it can hit the Assert as mentioned by you.  I think
> the problem here could be that I am using REGBUF_STANDARD to log the
> bitmap page updates which seems to be causing the issue.  As bitmap
> page doesn't follow the standard page layout, it would have omitted
> the actual contents while taking full page image and then during
> replay, it would not have set the bit, because page doesn't need REDO.
> I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
> buffers.
>
> If you can send me the detailed steps for how you have produced the
> problem, then I can verify after fixing whether you are seeing the
> same problem or something else.
>


The test is rather awkward, it might be easier to just have me test it.
But, I've attached it.

There is a patch that needs to applied and compiled (alongside your
patches, of course), to inject the crashes.  A perl script which creates
the schema and does the updates.  And a shell script which sets up the
cluster with the appropriate parameters, and then calls the perl script in
a loop.

The top of the shell script has some hard coded paths to the binaries, and
to the test data directory (which is automatically deleted)

I run it like "sh do.sh >& do.err &"

It gives two different types of assertion failures:

$ fgrep TRAP: do.err |sort|uniq -c

 21 TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32", File: "hashovfl.c", Line: 553)
 32 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c",
Line: 2506)

The second one is related to the intentional crashes, and so is not
relevant to you.

Cheers,

Jeff


count.pl
Description: Binary data


crash_REL10.patch
Description: Binary data


do.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >