Re: [HACKERS] Hash Indexes

2016-12-20 Thread Amit Kapila
On Tue, Dec 20, 2016 at 7:44 PM, Robert Haas  wrote:
> On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila  wrote:
>> On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas  wrote:
>>> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  
>>> wrote:
 We have mainly four actions for squeeze operation, add tuples to the
 write page, empty overflow page, unlinks overflow page, make it free
 by setting the corresponding bit in overflow page.  Now, if we don't
 log the changes to write page and freeing of overflow page as one
 operation, then won't query on standby can either see duplicate tuples
 or miss the tuples which are freed in overflow page.
>>>
>>> No, I think you could have two operations:
>>>
>>> 1. Move tuples from the "read" page to the "write" page.
>>>
>>> 2. Unlink the overflow page from the chain and mark it free.
>>>
>>> If we fail after step 1, the bucket chain might end with an empty
>>> overflow page, but that's OK.
>>
>> If there is an empty page in bucket chain, access to that page will
>> give an error (In WAL patch we are initializing the page instead of
>> making it completely empty, so we might not see an error in such a
>> case).
>
> It wouldn't be a new, uninitialized page.  It would be empty of
> tuples, not all-zeroes.
>

AFAIU we initialize page as all-zeros, but I think you are envisioning
that we need to change it to a new uninitialized page.

>> What advantage do you see by splitting the operation?
>
> It's simpler.  The code here is very complicated and trying to merge
> too many things into a single operation may make it even more
> complicated, increasing the risk of bugs and making the code hard to
> maintain without necessarily buying much performance.
>

Sure, if you find that way better, then we can change it, but the
current patch treats it as a single operation.  If after looking the
patch you find it is better to change it into two operations, I will
do so.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-20 Thread Robert Haas
On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila  wrote:
> On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas  wrote:
>> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  wrote:
>>> We have mainly four actions for squeeze operation, add tuples to the
>>> write page, empty overflow page, unlinks overflow page, make it free
>>> by setting the corresponding bit in overflow page.  Now, if we don't
>>> log the changes to write page and freeing of overflow page as one
>>> operation, then won't query on standby can either see duplicate tuples
>>> or miss the tuples which are freed in overflow page.
>>
>> No, I think you could have two operations:
>>
>> 1. Move tuples from the "read" page to the "write" page.
>>
>> 2. Unlink the overflow page from the chain and mark it free.
>>
>> If we fail after step 1, the bucket chain might end with an empty
>> overflow page, but that's OK.
>
> If there is an empty page in bucket chain, access to that page will
> give an error (In WAL patch we are initializing the page instead of
> making it completely empty, so we might not see an error in such a
> case).

It wouldn't be a new, uninitialized page.  It would be empty of
tuples, not all-zeroes.

> What advantage do you see by splitting the operation?

It's simpler.  The code here is very complicated and trying to merge
too many things into a single operation may make it even more
complicated, increasing the risk of bugs and making the code hard to
maintain without necessarily buying much performance.

> Anyway, I think it is better to discuss this in WAL patch thread.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-20 Thread Amit Kapila
On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas  wrote:
> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  wrote:
>> We have mainly four actions for squeeze operation, add tuples to the
>> write page, empty overflow page, unlinks overflow page, make it free
>> by setting the corresponding bit in overflow page.  Now, if we don't
>> log the changes to write page and freeing of overflow page as one
>> operation, then won't query on standby can either see duplicate tuples
>> or miss the tuples which are freed in overflow page.
>
> No, I think you could have two operations:
>
> 1. Move tuples from the "read" page to the "write" page.
>
> 2. Unlink the overflow page from the chain and mark it free.
>
> If we fail after step 1, the bucket chain might end with an empty
> overflow page, but that's OK.
>

If there is an empty page in bucket chain, access to that page will
give an error (In WAL patch we are initializing the page instead of
making it completely empty, so we might not see an error in such a
case).   What advantage do you see by splitting the operation?
Anyway, I think it is better to discuss this in WAL patch thread.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-20 Thread Robert Haas
On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila  wrote:
> We have mainly four actions for squeeze operation, add tuples to the
> write page, empty overflow page, unlinks overflow page, make it free
> by setting the corresponding bit in overflow page.  Now, if we don't
> log the changes to write page and freeing of overflow page as one
> operation, then won't query on standby can either see duplicate tuples
> or miss the tuples which are freed in overflow page.

No, I think you could have two operations:

1. Move tuples from the "read" page to the "write" page.

2. Unlink the overflow page from the chain and mark it free.

If we fail after step 1, the bucket chain might end with an empty
overflow page, but that's OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-20 Thread Amit Kapila
On Mon, Dec 19, 2016 at 11:05 PM, Robert Haas  wrote:
> On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila  wrote:
>>> I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
>>> some reservations about fix_lock_chaining_v1.  ISTM that the natural
>>> fix here would be to change the API contract for _hash_freeovflpage so
>>> that it doesn't release the lock on the write buffer.  Why does it
>>> even do that?  I think that the only reason why _hash_freeovflpage
>>> should be getting wbuf as an argument is so that it can handle the
>>> case where wbuf happens to be the previous block correctly.
>>
>> Yeah, as of now that is the only case, but for WAL patch, I think we
>> need to ensure that the action of moving all the tuples to the page
>> being written and the overflow page being freed needs to be logged
>> together as an atomic operation.
>
> Not really.  We can have one operation that empties the overflow page
> and another that unlinks it and makes it free.
>

We have mainly four actions for squeeze operation, add tuples to the
write page, empty overflow page, unlinks overflow page, make it free
by setting the corresponding bit in overflow page.  Now, if we don't
log the changes to write page and freeing of overflow page as one
operation, then won't query on standby can either see duplicate tuples
or miss the tuples which are freed in overflow page.

>> Now apart from that, it is
>> theoretically possible that write page will remain locked for multiple
>> overflow pages being freed (when the page being written has enough
>> space that it can accommodate tuples from multiple overflow pages).  I
>> am not sure if it is worth worrying about such a case because
>> practically it might happen rarely.  So, I have prepared a patch to
>> retain a lock on wbuf in _hash_freeovflpage() as suggested by you.
>
> Committed.
>

Thanks.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-19 Thread Robert Haas
On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila  wrote:
>> I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
>> some reservations about fix_lock_chaining_v1.  ISTM that the natural
>> fix here would be to change the API contract for _hash_freeovflpage so
>> that it doesn't release the lock on the write buffer.  Why does it
>> even do that?  I think that the only reason why _hash_freeovflpage
>> should be getting wbuf as an argument is so that it can handle the
>> case where wbuf happens to be the previous block correctly.
>
> Yeah, as of now that is the only case, but for WAL patch, I think we
> need to ensure that the action of moving all the tuples to the page
> being written and the overflow page being freed needs to be logged
> together as an atomic operation.

Not really.  We can have one operation that empties the overflow page
and another that unlinks it and makes it free.

> Now apart from that, it is
> theoretically possible that write page will remain locked for multiple
> overflow pages being freed (when the page being written has enough
> space that it can accommodate tuples from multiple overflow pages).  I
> am not sure if it is worth worrying about such a case because
> practically it might happen rarely.  So, I have prepared a patch to
> retain a lock on wbuf in _hash_freeovflpage() as suggested by you.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-18 Thread Amit Kapila
On Fri, Dec 16, 2016 at 9:57 PM, Robert Haas  wrote:
> On Thu, Dec 15, 2016 at 11:33 AM, Amit Kapila  wrote:
>> Attached are the two patches on top of remove-hash-wrtbuf.   Patch
>> fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of
>> the corner cases in _hash_freeovflpage() and avoids to mark dirty
>> without need in _hash_squeezebucket().  I think this can be combined
>> with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the
>> chaining behavior (lock next overflow bucket page before releasing the
>> previous bucket page) was broken in _hash_freeovflpage().  These
>> patches can be applied in series, first remove-hash-wrtbuf, then
>> fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch.
>
> I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
> some reservations about fix_lock_chaining_v1.  ISTM that the natural
> fix here would be to change the API contract for _hash_freeovflpage so
> that it doesn't release the lock on the write buffer.  Why does it
> even do that?  I think that the only reason why _hash_freeovflpage
> should be getting wbuf as an argument is so that it can handle the
> case where wbuf happens to be the previous block correctly.
>

Yeah, as of now that is the only case, but for WAL patch, I think we
need to ensure that the action of moving all the tuples to the page
being written and the overflow page being freed needs to be logged
together as an atomic operation.  Now apart from that, it is
theoretically possible that write page will remain locked for multiple
overflow pages being freed (when the page being written has enough
space that it can accommodate tuples from multiple overflow pages).  I
am not sure if it is worth worrying about such a case because
practically it might happen rarely.  So, I have prepared a patch to
retain a lock on wbuf in _hash_freeovflpage() as suggested by you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_lock_chaining_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-16 Thread Robert Haas
On Thu, Dec 15, 2016 at 11:33 AM, Amit Kapila  wrote:
> Attached are the two patches on top of remove-hash-wrtbuf.   Patch
> fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of
> the corner cases in _hash_freeovflpage() and avoids to mark dirty
> without need in _hash_squeezebucket().  I think this can be combined
> with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the
> chaining behavior (lock next overflow bucket page before releasing the
> previous bucket page) was broken in _hash_freeovflpage().  These
> patches can be applied in series, first remove-hash-wrtbuf, then
> fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch.

I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got
some reservations about fix_lock_chaining_v1.  ISTM that the natural
fix here would be to change the API contract for _hash_freeovflpage so
that it doesn't release the lock on the write buffer.  Why does it
even do that?  I think that the only reason why _hash_freeovflpage
should be getting wbuf as an argument is so that it can handle the
case where wbuf happens to be the previous block correctly.  Aside
from that there's no reason for it to touch wbuf.  If you fix it like
that then you should be able to avoid this rather ugly wart:

 * XXX Here, we are moving to next overflow page for writing without
 * ensuring if the previous write page is full.  This is annoying, but
 * should not hurt much in practice as that space will anyway be consumed
 * by future inserts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-15 Thread Amit Kapila
On Wed, Dec 14, 2016 at 10:47 PM, Robert Haas  wrote:
> On Wed, Dec 14, 2016 at 4:27 AM, Amit Kapila  wrote:
>>
>> Yeah, it will fix the problem in hashbucketcleanup, but there are two
>> other problems that need to be fixed for which I can send a separate
>> patch.  By the way, as mentioned to you earlier that WAL patch already
>> takes care of removing _hash_wrtbuf and simplified the logic wherever
>> possible without introducing the logic of MarkBufferDirty multiple
>> times under one lock.  However, if you want to proceed with the
>> current patch, then I can send you separate patches for the remaining
>> problems as addressed in bug fix patch.
>
> That sounds good.
>

Attached are the two patches on top of remove-hash-wrtbuf.   Patch
fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of
the corner cases in _hash_freeovflpage() and avoids to mark dirty
without need in _hash_squeezebucket().  I think this can be combined
with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the
chaining behavior (lock next overflow bucket page before releasing the
previous bucket page) was broken in _hash_freeovflpage().  These
patches can be applied in series, first remove-hash-wrtbuf, then
fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_dirty_marking_v1.patch
Description: Binary data


fix_lock_chaining_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-14 Thread Robert Haas
On Wed, Dec 14, 2016 at 4:27 AM, Amit Kapila  wrote:
>> It's not required for enabling WAL.  You don't *have* to release and
>> reacquire the buffer lock; in fact, that just adds overhead.
>
> If we don't release the lock, then it will break the general coding
> pattern of writing WAL (Acquire pin and an exclusive lock,
> Markbufferdirty, Write WAL, Release Lock).  Basically, I think it is
> to ensure that we don't hold it for multiple atomic operations or in
> this case avoid calling MarkBufferDirty multiple times.

I think you're being too pedantic.  Yes, the general rule is to
release the lock after each WAL record, but no harm comes if we take
the lock, emit TWO WAL records, and then release it.

> It is possible that we can MarkBufferDirty multiple times (twice in
> hashbucketcleanup and then in _hash_squeezebucket) while holding the
> lock.  I don't think that is a big problem as of now but wanted to
> avoid it as I thought we need it for WAL patch.

I see no harm in calling MarkBufferDirty multiple times, either now or
after the WAL patch.  Of course we don't want to end up with tons of
extra calls - it's not totally free - but it's pretty cheap.

>>  Aside from hopefully fixing the bug we're talking about
>> here, this makes the logic in several places noticeably less
>> contorted.
>
> Yeah, it will fix the problem in hashbucketcleanup, but there are two
> other problems that need to be fixed for which I can send a separate
> patch.  By the way, as mentioned to you earlier that WAL patch already
> takes care of removing _hash_wrtbuf and simplified the logic wherever
> possible without introducing the logic of MarkBufferDirty multiple
> times under one lock.  However, if you want to proceed with the
> current patch, then I can send you separate patches for the remaining
> problems as addressed in bug fix patch.

That sounds good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-14 Thread Amit Kapila
On Tue, Dec 13, 2016 at 11:30 PM, Robert Haas  wrote:
> On Mon, Dec 12, 2016 at 9:21 PM, Amit Kapila  wrote:
>> The reason is to make the operations consistent in master and standby.
>> In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and
>> if we don't release the lock after writing a WAL, the operation can
>> appear on standby even before on master.   Currently, without WAL,
>> there is no benefit of doing so and we can fix by using
>> MarkBufferDirty, however, I thought it might be simpler to keep it
>> this way as this is required for enabling WAL.  OTOH, we can leave
>> that for WAL patch.  Let me know which way you prefer?
>
> It's not required for enabling WAL.  You don't *have* to release and
> reacquire the buffer lock; in fact, that just adds overhead.
>

If we don't release the lock, then it will break the general coding
pattern of writing WAL (Acquire pin and an exclusive lock,
Markbufferdirty, Write WAL, Release Lock).  Basically, I think it is
to ensure that we don't hold it for multiple atomic operations or in
this case avoid calling MarkBufferDirty multiple times.

> You *do*
> have to be aware that the standby will perhaps see the intermediate
> state, because it won't hold the lock throughout.  But that does not
> mean that the master must release the lock.
>

Okay, but I think that will be avoided to a great extent because we do
follow the practice of releasing the lock immediately after writing
the WAL.

>>> I don't think we should be afraid to call MarkBufferDirty() instead of
>>> going through these (fairly stupid) hasham-specific APIs.
>>
>> Right and anyway we need to use it at many more call sites when we
>> enable WAL for hash index.
>
> I propose the attached patch, which removes _hash_wrtbuf() and causes
> those functions which previously called it to do MarkBufferDirty()
> directly.
>


It is possible that we can MarkBufferDirty multiple times (twice in
hashbucketcleanup and then in _hash_squeezebucket) while holding the
lock.  I don't think that is a big problem as of now but wanted to
avoid it as I thought we need it for WAL patch.

>  Aside from hopefully fixing the bug we're talking about
> here, this makes the logic in several places noticeably less
> contorted.
>

Yeah, it will fix the problem in hashbucketcleanup, but there are two
other problems that need to be fixed for which I can send a separate
patch.  By the way, as mentioned to you earlier that WAL patch already
takes care of removing _hash_wrtbuf and simplified the logic wherever
possible without introducing the logic of MarkBufferDirty multiple
times under one lock.  However, if you want to proceed with the
current patch, then I can send you separate patches for the remaining
problems as addressed in bug fix patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-13 Thread Robert Haas
On Mon, Dec 12, 2016 at 9:21 PM, Amit Kapila  wrote:
> The reason is to make the operations consistent in master and standby.
> In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and
> if we don't release the lock after writing a WAL, the operation can
> appear on standby even before on master.   Currently, without WAL,
> there is no benefit of doing so and we can fix by using
> MarkBufferDirty, however, I thought it might be simpler to keep it
> this way as this is required for enabling WAL.  OTOH, we can leave
> that for WAL patch.  Let me know which way you prefer?

It's not required for enabling WAL.  You don't *have* to release and
reacquire the buffer lock; in fact, that just adds overhead.  You *do*
have to be aware that the standby will perhaps see the intermediate
state, because it won't hold the lock throughout.  But that does not
mean that the master must release the lock.

>> I don't think we should be afraid to call MarkBufferDirty() instead of
>> going through these (fairly stupid) hasham-specific APIs.
>
> Right and anyway we need to use it at many more call sites when we
> enable WAL for hash index.

I propose the attached patch, which removes _hash_wrtbuf() and causes
those functions which previously called it to do MarkBufferDirty()
directly.  Aside from hopefully fixing the bug we're talking about
here, this makes the logic in several places noticeably less
contorted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f1511d0..0eeb37d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -635,7 +635,8 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -724,7 +725,6 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		vacuum_delay_point();
 
@@ -805,7 +805,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		{
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+			MarkBufferDirty(buf);
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -820,15 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -862,6 +854,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		MarkBufferDirty(bucket_buf);
 	}
 
 	/*
@@ -873,7 +866,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 			bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 572146a..59c4213 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -208,11 +208,12 @@ restart_insert:
 	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
 
 	/*
-	 * write and release the modified page.  if the page we modified was an
+	 * dirty and release the modified page.  if the page we modified was an
 	 * overflow page, we also need to separately drop the pin we retained on
 	 * the primary bucket page.
 	 */
-	_hash_wrtbuf(rel, buf);
+	MarkBufferDirty(buf);
+	_hash_relbuf(rel, buf);
 	if (buf != bucket_buf)
 		_hash_dropbuf(rel, bucket_buf);
 
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e2d208e..8fbf494 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -149,10 +149,11 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 	/* logically chain overflow page to previous page */
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+	MarkBufferDirty(buf);
 	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 	else
-		_hash_wrtbuf(rel, buf);
+		_hash_relbuf(rel, buf);
 
 	return ovflbuf;
 }
@@ -304,7 +305,8 @@ found:
 
 	/* mark page "in use" in t

Re: [HACKERS] Hash Indexes

2016-12-13 Thread Jesper Pedersen

On 12/11/2016 11:37 PM, Amit Kapila wrote:

On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila  wrote:

On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes  wrote:

With above fixes, the test ran successfully for more than a day.


There was a small typo in the previous patch which is fixed in
attached.  I don't think it will impact the test results if you have
already started the test with the previous patch, but if not, then it
is better to test with attached.



A mix work load (INSERT, DELETE and VACUUM primarily) is successful here 
too using _v2.


Thanks !

Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-12 Thread Amit Kapila
On Tue, Dec 13, 2016 at 2:51 AM, Robert Haas  wrote:
> On Sun, Dec 11, 2016 at 1:24 AM, Amit Kapila  wrote:
>>
>> With above fixes, the test ran successfully for more than a day.
>
> Instead of doing this:
>
> +_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
> +_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
>
> ...wouldn't it be better to just do MarkBufferDirty()?  There's no
> real reason to release the lock only to reacquire it again, is there?
>

The reason is to make the operations consistent in master and standby.
In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and
if we don't release the lock after writing a WAL, the operation can
appear on standby even before on master.   Currently, without WAL,
there is no benefit of doing so and we can fix by using
MarkBufferDirty, however, I thought it might be simpler to keep it
this way as this is required for enabling WAL.  OTOH, we can leave
that for WAL patch.  Let me know which way you prefer?

> I don't think we should be afraid to call MarkBufferDirty() instead of
> going through these (fairly stupid) hasham-specific APIs.
>

Right and anyway we need to use it at many more call sites when we
enable WAL for hash index.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-12 Thread Robert Haas
On Sun, Dec 11, 2016 at 1:24 AM, Amit Kapila  wrote:
> The reason for this and the similar error in vacuum was that in one of
> the corner cases after freeing the overflow page and updating the link
> for the previous bucket, we were not marking the buffer as dirty.  So,
> due to concurrent activity, the buffer containing the updated links
> got evicted and then later when we tried to access the same buffer, it
> brought back the old copy which contains a link to freed overflow
> page.
>
> Apart from above issue, Kuntal has noticed that there is assertion
> failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the
> same test as provided by you. The reason for that problem was that
> after deleting tuples in hashbucketcleanup, we were not marking the
> buffer as dirty due to which the old copy of the overflow page was
> re-appearing and causing that failure.
>
> After fixing the above problem,  it has been noticed that there is
> another assertion failure (Assert(bucket == obucket);) in
> _hash_splitbucket_guts.  The reason for this problem was that after
> the split, vacuum failed to remove tuples from the old bucket that are
> moved due to split. Now, during next split from the same old bucket,
> we don't expect old bucket to contain tuples from the previous split.
> To fix this whenever vacuum needs to perform split cleanup, it should
> update the metapage values (masks required to calculate bucket
> number), so that it shouldn't miss cleaning the tuples.
> I believe this is the same assertion what Andreas has reported in
> another thread [1].
>
> The next problem we encountered is that after running the same test
> for somewhat longer, inserts were failing with error "unexpected zero
> page at block ..".  After some analysis, I have found that the lock
> chain (lock next overflow bucket page before releasing the previous
> bucket page) was broken in one corner case in _hash_freeovflpage due
> to which insert went ahead than squeeze bucket operation and accessed
> the freed overflow page before the link for the same has been updated.
>
> With above fixes, the test ran successfully for more than a day.

Instead of doing this:

+_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
+_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);

...wouldn't it be better to just do MarkBufferDirty()?  There's no
real reason to release the lock only to reacquire it again, is there?
I don't think we should be afraid to call MarkBufferDirty() instead of
going through these (fairly stupid) hasham-specific APIs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-11 Thread Amit Kapila
On Mon, Dec 12, 2016 at 10:25 AM, Jeff Janes  wrote:
> On Sun, Dec 11, 2016 at 8:37 PM, Amit Kapila 
> wrote:
>>
>> On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila 
>> wrote:
>> > On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes  wrote:
>> >
>> > With above fixes, the test ran successfully for more than a day.
>> >
>>
>> There was a small typo in the previous patch which is fixed in
>> attached.  I don't think it will impact the test results if you have
>> already started the test with the previous patch, but if not, then it
>> is better to test with attached.
>
>
> Thanks,  I've already been running the previous one for several hours, and
> so far it looks good.
>

Thanks.

>  I've tried forward porting it to the WAL patch to
> test that as well, but didn't have any luck doing so.
>

I think we can verify WAL patch separately.  I am already working on it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-11 Thread Jeff Janes
On Sun, Dec 11, 2016 at 8:37 PM, Amit Kapila 
wrote:

> On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila 
> wrote:
> > On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes  wrote:
> >
> > With above fixes, the test ran successfully for more than a day.
> >
>
> There was a small typo in the previous patch which is fixed in
> attached.  I don't think it will impact the test results if you have
> already started the test with the previous patch, but if not, then it
> is better to test with attached.
>

Thanks,  I've already been running the previous one for several hours, and
so far it looks good.  I've tried forward porting it to the WAL patch to
test that as well, but didn't have any luck doing so.

Cheers,

Jeff


Re: [HACKERS] Hash Indexes

2016-12-11 Thread Amit Kapila
On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes  wrote:
> On Thu, Dec 1, 2016 at 10:54 PM, Amit Kapila 
> wrote:
>
> With the latest HASH WAL patch applied, I get different but apparently
> related errors
>
> 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:ERROR:  index "foo_index_idx"
> contains corrupted page at block 27602
> 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:HINT:  Please REINDEX it.
> 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:STATEMENT:  update foo set
> count=count+1 where index=$1
>

This is not the problem of WAL patch per se.  It should be fixed with
the hash index bug fix patch I sent upthread.  However, after the bug
fix patch, WAL patch needs to be rebased which I will do and send it
after verification.  In the meantime, it would be great if you can
verify the fix posted.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-11 Thread Amit Kapila
On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila  wrote:
> On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes  wrote:
>
> With above fixes, the test ran successfully for more than a day.
>

There was a small typo in the previous patch which is fixed in
attached.  I don't think it will impact the test results if you have
already started the test with the previous patch, but if not, then it
is better to test with attached.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_hashindex_issues_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-10 Thread Amit Kapila
On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes  wrote:
> On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila  wrote:
>>
>> On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes  wrote:
>> >
>> >
>> > I just occasionally insert a bunch of equal tuples, which have to be in
>> > overflow pages no matter how much splitting happens.
>> >
>> > I am getting vacuum errors against HEAD, after about 20 minutes or so (8
>> > cores).
>> >
>> > 49233  XX002 2016-12-05 23:06:44.087 PST:ERROR:  index "foo_index_idx"
>> > contains unexpected zero page at block 64941
>> > 49233  XX002 2016-12-05 23:06:44.087 PST:HINT:  Please REINDEX it.
>> > 49233  XX002 2016-12-05 23:06:44.087 PST:CONTEXT:  automatic vacuum of
>> > table
>> > "jjanes.public.foo"
>> >
>>
>> Thanks for the report.  This can happen due to vacuum trying to access
>> a new page.  Vacuum can do so if (a) the calculation for maxbuckets
>> (in metapage) is wrong or (b) it is trying to free the overflow page
>> twice.  Offhand, I don't see that can happen in code.  I will
>> investigate further to see if there is any another reason why vacuum
>> can access the new page.  BTW, have you done the test after commit
>> 2f4193c3, that doesn't appear to be the cause of this problem, but
>> still, it is better to test after that fix.  I am trying to reproduce
>> the issue, but in the meantime, if by anychance, you have a callstack,
>> then please share the same.
>
>
> It looks like I compiled the code for testing a few minutes before 2f4193c3
> went in.
>
> I've run it again with cb9dcbc1eebd8, after promoting the ERROR in
> _hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to
> feed to gdb.
>
> This time it was an INSERT, not autovac, that got the error:
>

The reason for this and the similar error in vacuum was that in one of
the corner cases after freeing the overflow page and updating the link
for the previous bucket, we were not marking the buffer as dirty.  So,
due to concurrent activity, the buffer containing the updated links
got evicted and then later when we tried to access the same buffer, it
brought back the old copy which contains a link to freed overflow
page.

Apart from above issue, Kuntal has noticed that there is assertion
failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the
same test as provided by you. The reason for that problem was that
after deleting tuples in hashbucketcleanup, we were not marking the
buffer as dirty due to which the old copy of the overflow page was
re-appearing and causing that failure.

After fixing the above problem,  it has been noticed that there is
another assertion failure (Assert(bucket == obucket);) in
_hash_splitbucket_guts.  The reason for this problem was that after
the split, vacuum failed to remove tuples from the old bucket that are
moved due to split. Now, during next split from the same old bucket,
we don't expect old bucket to contain tuples from the previous split.
To fix this whenever vacuum needs to perform split cleanup, it should
update the metapage values (masks required to calculate bucket
number), so that it shouldn't miss cleaning the tuples.
I believe this is the same assertion what Andreas has reported in
another thread [1].

The next problem we encountered is that after running the same test
for somewhat longer, inserts were failing with error "unexpected zero
page at block ..".  After some analysis, I have found that the lock
chain (lock next overflow bucket page before releasing the previous
bucket page) was broken in one corner case in _hash_freeovflpage due
to which insert went ahead than squeeze bucket operation and accessed
the freed overflow page before the link for the same has been updated.

With above fixes, the test ran successfully for more than a day.

Many thanks to Kuntal and Dilip for helping me in analyzing and
testing the fixes for these problems.

[1] - https://www.postgresql.org/message-id/87y3zrzbu5.fsf_-_%40ansel.ydns.eu

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_hashindex_issues_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-06 Thread Jeff Janes
On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila  wrote:

> On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes  wrote:
> >
> >
> > I just occasionally insert a bunch of equal tuples, which have to be in
> > overflow pages no matter how much splitting happens.
> >
> > I am getting vacuum errors against HEAD, after about 20 minutes or so (8
> > cores).
> >
> > 49233  XX002 2016-12-05 23:06:44.087 PST:ERROR:  index "foo_index_idx"
> > contains unexpected zero page at block 64941
> > 49233  XX002 2016-12-05 23:06:44.087 PST:HINT:  Please REINDEX it.
> > 49233  XX002 2016-12-05 23:06:44.087 PST:CONTEXT:  automatic vacuum of
> table
> > "jjanes.public.foo"
> >
>
> Thanks for the report.  This can happen due to vacuum trying to access
> a new page.  Vacuum can do so if (a) the calculation for maxbuckets
> (in metapage) is wrong or (b) it is trying to free the overflow page
> twice.  Offhand, I don't see that can happen in code.  I will
> investigate further to see if there is any another reason why vacuum
> can access the new page.  BTW, have you done the test after commit
> 2f4193c3, that doesn't appear to be the cause of this problem, but
> still, it is better to test after that fix.  I am trying to reproduce
> the issue, but in the meantime, if by anychance, you have a callstack,
> then please share the same.
>

It looks like I compiled the code for testing a few minutes before 2f4193c3
went in.

I've run it again with cb9dcbc1eebd8, after promoting the ERROR in
_hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to
feed to gdb.

This time it was an INSERT, not autovac, that got the error:

35495 INSERT XX002 2016-12-06 09:25:09.517 PST:PANIC:  XX002: index
"foo_index_idx" contains unexpected zero page at block 202328
35495 INSERT XX002 2016-12-06 09:25:09.517 PST:HINT:  Please REINDEX it.
35495 INSERT XX002 2016-12-06 09:25:09.517 PST:LOCATION:  _hash_checkpage,
hashutil.c:174
35495 INSERT XX002 2016-12-06 09:25:09.517 PST:STATEMENT:  insert into foo
(index) select $1 from generate_series(1,1)


#0  0x003838c325e5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x003838c325e5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x003838c33dc5 in abort () at abort.c:92
#2  0x007d6adf in errfinish (dummy=) at
elog.c:557
#3  0x00498d93 in _hash_checkpage (rel=0x7f4d030906a0, buf=, flags=) at hashutil.c:169
#4  0x004967cf in _hash_getbuf_with_strategy (rel=0x7f4d030906a0,
blkno=, access=2, flags=1, bstrategy=)
at hashpage.c:234
#5  0x00493dbb in hashbucketcleanup (rel=0x7f4d030906a0,
cur_bucket=14544, bucket_buf=7801, bucket_blkno=22864, bstrategy=0x0,
maxbucket=276687,
highmask=524287, lowmask=262143, tuples_removed=0x0,
num_index_tuples=0x0, split_cleanup=1 '\001', callback=0,
callback_state=0x0) at hash.c:799
#6  0x00497921 in _hash_expandtable (rel=0x7f4d030906a0,
metabuf=7961) at hashpage.c:668
#7  0x00495596 in _hash_doinsert (rel=0x7f4d030906a0,
itup=0x1f426b0) at hashinsert.c:236
#8  0x00494830 in hashinsert (rel=0x7f4d030906a0, values=, isnull=, ht_ctid=0x7f4d03076404,
heapRel=, checkUnique=) at
hash.c:247
#9  0x005c81bc in ExecInsertIndexTuples (slot=0x1f28940,
tupleid=0x7f4d03076404, estate=0x1f28280, noDupErr=0 '\000',
specConflict=0x0,
arbiterIndexes=0x0) at execIndexing.c:389
#10 0x005e74ad in ExecInsert (node=0x1f284d0) at
nodeModifyTable.c:496
#11 ExecModifyTable (node=0x1f284d0) at nodeModifyTable.c:1511
#12 0x005cc9d8 in ExecProcNode (node=0x1f284d0) at
execProcnode.c:396
#13 0x005ca53a in ExecutePlan (queryDesc=0x1f21a30,
direction=, count=0) at execMain.c:1567
#14 standard_ExecutorRun (queryDesc=0x1f21a30, direction=, count=0) at execMain.c:338
#15 0x7f4d0c1a6dfb in pgss_ExecutorRun (queryDesc=0x1f21a30,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#16 0x006dfc8a in ProcessQuery (plan=,
sourceText=0x1f21990 "insert into foo (index) select $1 from
generate_series(1,1)",
params=0x1f219e0, dest=0xc191c0, completionTag=0x7ffe82a959d0 "") at
pquery.c:185
#17 0x006dfeda in PortalRunMulti (portal=0x1e86900, isTopLevel=1
'\001', setHoldSnapshot=0 '\000', dest=0xc191c0, altdest=0xc191c0,
completionTag=0x7ffe82a959d0 "") at pquery.c:1299
#18 0x006e056c in PortalRun (portal=0x1e86900,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x1eec870,
altdest=0x1eec870,
completionTag=0x7ffe82a959d0 "") at pquery.c:813
#19 0x006de832 in exec_execute_message (argc=,
argv=, dbname=0x1e933b8 "jjanes",
username=) at postgres.c:1977
#20 PostgresMain (argc=, argv=,
dbname=0x1e933b8 "jjanes", username=) at
postgres.c:4132
#21 0x0067e8a4 in BackendRun (argc=,
argv=) at postmaster.c:4274
#22 BackendStartup (argc=, argv=)
at postmaster.c:3946
#23 ServerLoop (argc=, argv=) at
postmaster.c:1704
#24 Postmas

Re: [HACKERS] Hash Indexes

2016-12-06 Thread Amit Kapila
On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes  wrote:
>
>
> I just occasionally insert a bunch of equal tuples, which have to be in
> overflow pages no matter how much splitting happens.
>
> I am getting vacuum errors against HEAD, after about 20 minutes or so (8
> cores).
>
> 49233  XX002 2016-12-05 23:06:44.087 PST:ERROR:  index "foo_index_idx"
> contains unexpected zero page at block 64941
> 49233  XX002 2016-12-05 23:06:44.087 PST:HINT:  Please REINDEX it.
> 49233  XX002 2016-12-05 23:06:44.087 PST:CONTEXT:  automatic vacuum of table
> "jjanes.public.foo"
>

Thanks for the report.  This can happen due to vacuum trying to access
a new page.  Vacuum can do so if (a) the calculation for maxbuckets
(in metapage) is wrong or (b) it is trying to free the overflow page
twice.  Offhand, I don't see that can happen in code.  I will
investigate further to see if there is any another reason why vacuum
can access the new page.  BTW, have you done the test after commit
2f4193c3, that doesn't appear to be the cause of this problem, but
still, it is better to test after that fix.  I am trying to reproduce
the issue, but in the meantime, if by anychance, you have a callstack,
then please share the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-06 Thread Jeff Janes
On Thu, Dec 1, 2016 at 10:54 PM, Amit Kapila 
wrote:

> On Thu, Dec 1, 2016 at 8:56 PM, Robert Haas  wrote:
> > On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila 
> wrote:
> >> On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas 
> wrote:
> >>> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila 
> wrote:
>  [ new patch ]
> >>>
> >>> Committed with some further cosmetic changes.
> >>
> >> Thank you very much.
> >>
> >>> I think it would be worth testing this code with very long overflow
> >>> chains by hacking the fill factor up to 1000
> >>
> >> 1000 is not a valid value for fill factor. Do you intend to say 100?
> >
> > No.  IIUC, 100 would mean split when the average bucket contains 1
> > page worth of tuples.
> >
>
> I also think so.
>
> >  I want to split when the average bucket
> > contains 10 pages worth of tuples.
> >
>
> oh, I think what you mean to say is hack the code to bump fill factor
> and then test it.  I was confused that how can user can do that from
> SQL command.
>

I just occasionally insert a bunch of equal tuples, which have to be in
overflow pages no matter how much splitting happens.

I am getting vacuum errors against HEAD, after about 20 minutes or so (8
cores).

49233  XX002 2016-12-05 23:06:44.087 PST:ERROR:  index "foo_index_idx"
contains unexpected zero page at block 64941
49233  XX002 2016-12-05 23:06:44.087 PST:HINT:  Please REINDEX it.
49233  XX002 2016-12-05 23:06:44.087 PST:CONTEXT:  automatic vacuum of
table "jjanes.public.foo"

Testing harness is attached.  It includes a lot of code to test crash
recovery, but all of that stuff is turned off in this instance. No patches
need to be applied to the server to get this one to run.


With the latest HASH WAL patch applied, I get different but apparently
related errors

41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:ERROR:  index
"foo_index_idx" contains corrupted page at block 27602
41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:HINT:  Please REINDEX it.
41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:STATEMENT:  update foo set
count=count+1 where index=$1

Cheers,

Jeff


count.pl
Description: Binary data


do_nocrash.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-05 Thread Robert Haas
On Fri, Dec 2, 2016 at 10:54 PM, Amit Kapila  wrote:
> On Sat, Dec 3, 2016 at 12:13 AM, Robert Haas  wrote:
>> On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila  wrote:
  I want to split when the average bucket
 contains 10 pages worth of tuples.
>>>
>>> oh, I think what you mean to say is hack the code to bump fill factor
>>> and then test it.  I was confused that how can user can do that from
>>> SQL command.
>>
>> Yes, that's why I said "hacking the fill factor up to 1000" when I
>> originally mentioned it.
>>
>> Actually, for hash indexes, there's no reason why we couldn't allow
>> fillfactor settings greater than 100, and it might be useful.
>
> Yeah, I agree with that, but as of now, it might be tricky to support
> the different range of fill factor for one of the indexes.  Another
> idea could be to have an additional storage parameter like
> split_bucket_length or something like that for hash indexes which
> indicate that split will occur after the average bucket contains
> "split_bucket_length * page" worth of tuples.  We do have additional
> storage parameters for other types of indexes, so having one for the
> hash index should not be a problem.

Agreed.

> I think this is important because split immediately increases the hash
> index space by approximately 2 times.  We might want to change that
> algorithm someday, but the above idea will prevent that in many cases.

Also agreed.

But the first thing is that you should probably do some testing in
that area via a quick hack to see if anything breaks in an obvious
way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-02 Thread Amit Kapila
On Sat, Dec 3, 2016 at 12:13 AM, Robert Haas  wrote:
> On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila  wrote:
>>>  I want to split when the average bucket
>>> contains 10 pages worth of tuples.
>>
>> oh, I think what you mean to say is hack the code to bump fill factor
>> and then test it.  I was confused that how can user can do that from
>> SQL command.
>
> Yes, that's why I said "hacking the fill factor up to 1000" when I
> originally mentioned it.
>
> Actually, for hash indexes, there's no reason why we couldn't allow
> fillfactor settings greater than 100, and it might be useful.
>

Yeah, I agree with that, but as of now, it might be tricky to support
the different range of fill factor for one of the indexes.  Another
idea could be to have an additional storage parameter like
split_bucket_length or something like that for hash indexes which
indicate that split will occur after the average bucket contains
"split_bucket_length * page" worth of tuples.  We do have additional
storage parameters for other types of indexes, so having one for the
hash index should not be a problem.

I think this is important because split immediately increases the hash
index space by approximately 2 times.  We might want to change that
algorithm someday, but the above idea will prevent that in many cases.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-02 Thread Robert Haas
On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila  wrote:
>>  I want to split when the average bucket
>> contains 10 pages worth of tuples.
>
> oh, I think what you mean to say is hack the code to bump fill factor
> and then test it.  I was confused that how can user can do that from
> SQL command.

Yes, that's why I said "hacking the fill factor up to 1000" when I
originally mentioned it.

Actually, for hash indexes, there's no reason why we couldn't allow
fillfactor settings greater than 100, and it might be useful.
Possibly it should be the default.  Not 1000, certainly, but I'm not
sure that the current value of 75 is at all optimal.  The optimal
value might be 100 or 125 or 150 or something like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-01 Thread Amit Kapila
On Thu, Dec 1, 2016 at 8:56 PM, Robert Haas  wrote:
> On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila  wrote:
>> On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas  wrote:
>>> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila  
>>> wrote:
 [ new patch ]
>>>
>>> Committed with some further cosmetic changes.
>>
>> Thank you very much.
>>
>>> I think it would be worth testing this code with very long overflow
>>> chains by hacking the fill factor up to 1000
>>
>> 1000 is not a valid value for fill factor. Do you intend to say 100?
>
> No.  IIUC, 100 would mean split when the average bucket contains 1
> page worth of tuples.
>

I also think so.

>  I want to split when the average bucket
> contains 10 pages worth of tuples.
>

oh, I think what you mean to say is hack the code to bump fill factor
and then test it.  I was confused that how can user can do that from
SQL command.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-12-01 Thread Robert Haas
On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila  wrote:
> On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas  wrote:
>> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila  wrote:
>>> [ new patch ]
>>
>> Committed with some further cosmetic changes.
>
> Thank you very much.
>
>> I think it would be worth testing this code with very long overflow
>> chains by hacking the fill factor up to 1000
>
> 1000 is not a valid value for fill factor. Do you intend to say 100?

No.  IIUC, 100 would mean split when the average bucket contains 1
page worth of tuples.  I want to split when the average bucket
contains 10 pages worth of tuples.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-30 Thread Amit Kapila
On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas  wrote:
> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila  wrote:
>> [ new patch ]
>
> Committed with some further cosmetic changes.
>

Thank you very much.

> I think it would be worth testing this code with very long overflow
> chains by hacking the fill factor up to 1000
>

1000 is not a valid value for fill factor. Do you intend to say 100?

 or something of that
> sort, so that we get lots and lots of overflow pages before we start
> splitting.  I think that might find some bugs that aren't obvious
> right now because most buckets get split before they even have a
> single overflow bucket.
>
> Also, the deadlock hazards that we talked about upthread should
> probably be documented in the README somewhere, along with why we're
> OK with accepting those hazards.
>

That makes sense.  I will send a patch along that lines.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-30 Thread Robert Haas
On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila  wrote:
> [ new patch ]

Committed with some further cosmetic changes.  I guess I won't be very
surprised if this turns out to have a few bugs yet, but I think it's
in fairly good shape at this point.

I think it would be worth testing this code with very long overflow
chains by hacking the fill factor up to 1000 or something of that
sort, so that we get lots and lots of overflow pages before we start
splitting.  I think that might find some bugs that aren't obvious
right now because most buckets get split before they even have a
single overflow bucket.

Also, the deadlock hazards that we talked about upthread should
probably be documented in the README somewhere, along with why we're
OK with accepting those hazards.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-18 Thread Amit Kapila
On Fri, Nov 18, 2016 at 12:11 PM, Amit Kapila  wrote:
> On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas  wrote:
>> On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila  
>> wrote:
>>
 I think this comment is saying that we'll release the pin on the
 primary bucket page for now, and then reacquire it later if the user
 reverses the scan direction.  But that doesn't sound very safe,
 because the bucket could be split in the meantime and the order in
 which tuples are returned could change.  I think we want that to
 remain stable within a single query execution.
>>>
>>> Isn't that possible even without the patch?  Basically, after reaching
>>> end of forward scan and for doing backward *all* scan, we need to
>>> perform portal rewind which will in turn call hashrescan where we will
>>> drop the lock on bucket and then again when we try to move cursor
>>> forward we acquire lock in _hash_first(), so in between when we don't
>>> have the lock, the split could happen and next scan results could
>>> differ.
>>
>> Well, the existing code doesn't drop the heavyweight lock at that
>> location, but your patch does drop the pin that serves the same
>> function, so I feel like there must be some difference.
>>
>
> Yes, but I am not sure if existing code is right.  Consider below scenario,
>
> Session-1
>
> Begin;
> Declare c cursor for select * from t4 where c1=1;
> Fetch forward all from c;  --here shared heavy-weight lock count becomes 1
> Fetch prior from c; --here shared heavy-weight lock count becomes 2
> close c; -- here, lock release will reduce the lock count and shared
> heavy-weight lock count becomes 1
>
> Now, if we try to insert from another session, such that it leads to
> bucket-split of the bucket for which session-1 had used a cursor, it
> will wait for session-1.
>

It will not wait, but just skip the split because we are using try
lock, however, the point remains that select should not hold bucket
level locks even after the cursor is closed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-17 Thread Amit Kapila
On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas  wrote:
> On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila  wrote:
>
>>> I think this comment is saying that we'll release the pin on the
>>> primary bucket page for now, and then reacquire it later if the user
>>> reverses the scan direction.  But that doesn't sound very safe,
>>> because the bucket could be split in the meantime and the order in
>>> which tuples are returned could change.  I think we want that to
>>> remain stable within a single query execution.
>>
>> Isn't that possible even without the patch?  Basically, after reaching
>> end of forward scan and for doing backward *all* scan, we need to
>> perform portal rewind which will in turn call hashrescan where we will
>> drop the lock on bucket and then again when we try to move cursor
>> forward we acquire lock in _hash_first(), so in between when we don't
>> have the lock, the split could happen and next scan results could
>> differ.
>
> Well, the existing code doesn't drop the heavyweight lock at that
> location, but your patch does drop the pin that serves the same
> function, so I feel like there must be some difference.
>

Yes, but I am not sure if existing code is right.  Consider below scenario,

Session-1

Begin;
Declare c cursor for select * from t4 where c1=1;
Fetch forward all from c;  --here shared heavy-weight lock count becomes 1
Fetch prior from c; --here shared heavy-weight lock count becomes 2
close c; -- here, lock release will reduce the lock count and shared
heavy-weight lock count becomes 1

Now, if we try to insert from another session, such that it leads to
bucket-split of the bucket for which session-1 had used a cursor, it
will wait for session-1. The insert can only proceed after session-1
performs the commit.  I think after the cursor is closed in session-1,
the insert from another session should succeed, don't you think so?

>>> Come to think of it, I'm a little worried about the locking in
>>> _hash_squeezebucket().  It seems like we drop the lock on each "write"
>>> bucket page before taking the lock on the next one.  So a concurrent
>>> scan could get ahead of the cleanup process.  That would be bad,
>>> wouldn't it?
>>
>> Yeah, that would be bad if it happens, but no concurrent scan can
>> happen during squeeze phase because we take an exclusive lock on a
>> bucket page and maintain it throughout the operation.
>
> Well, that's completely unacceptable.  A major reason the current code
> uses heavyweight locks is because you can't hold lightweight locks
> across arbitrary amounts of work -- because, just to take one example,
> a process holding or waiting for an LWLock isn't interruptible.  The
> point of this redesign was to get rid of that, so that LWLocks are
> only held for short periods.  I dislike the lock-chaining approach
> (take the next lock before releasing the previous one) quite a bit and
> really would like to find a way to get rid of that, but the idea of
> holding a buffer lock across a complete traversal of an unbounded
> number of overflow buckets is far worse.  We've got to come up with a
> design that doesn't require that, or else completely redesign the
> bucket-squeezing stuff.
>

I think we can use the idea of lock-chaining (take the next lock
before releasing the previous one) for squeeze-phase to solve this
issue.  Basically for squeeze operation, what we need to take care is
that there shouldn't be any scan before we start the squeeze and then
afterward if the scan starts, it should be always behind write-end of
a squeeze.  If we follow this, then there shouldn't be any problem
even for backward scans because to start backward scans, it needs to
start with the first bucket and reach last bucket page by locking each
bucket page in read mode.

> (Would it make any sense to change the order of the hash index patches
> we've got outstanding?  For instance, if we did the page-at-a-time
> stuff first, it would make life simpler for this patch in several
> ways, possibly including this issue.)
>

I agree that page-at-a-time can help hash indexes, but I don't think
it can help with this particular issue of squeeze operation.  While
cleaning dead-tuples, it would be okay even if scan went ahead of
cleanup (considering we have page-at-a-time mode), but for squeeze, we
can't afford that because it can move some tuples to a prior bucket
page and scan would miss those tuples.  Also, page-at-a-time will help
cleaning tuples only for MVCC scans (it might not help for unlogged
tables scan or non-MVCC scans).  Another point is that we don't have a
patch for page-at-a-time scan ready at this stage.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-17 Thread Robert Haas
On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila  wrote:
> Are you expecting a comment change here?
>
>> +old_blkno = _hash_get_oldblock_from_newbucket(rel,
>> opaque->hasho_bucket);
>>
>> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"?  I
>> feel like I'm repeating this ad nauseum, but I really think it's bad
>> to rely on the special space instead of our own local variables!
>>
>
> Sure, we can pass bucket as well. However, if you see few lines below
> (while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already
> relying on special space to pass variables. In general, we are using
> special space to pass variables to functions in many other places in
> the code.  What exactly are you bothered about in accessing special
> space, if it is safe to do?

I don't want to rely on the special space to know which buffers we
have locked or pinned.  We obviously need the special space to find
the next and previous buffers in the block chain -- there's no other
way to know that.  However, we should be more careful about locks and
pins.  If the special space is corrupted in some way, we still
shouldn't get confused about which buffers we have locked or pinned.

>> I think this comment is saying that we'll release the pin on the
>> primary bucket page for now, and then reacquire it later if the user
>> reverses the scan direction.  But that doesn't sound very safe,
>> because the bucket could be split in the meantime and the order in
>> which tuples are returned could change.  I think we want that to
>> remain stable within a single query execution.
>
> Isn't that possible even without the patch?  Basically, after reaching
> end of forward scan and for doing backward *all* scan, we need to
> perform portal rewind which will in turn call hashrescan where we will
> drop the lock on bucket and then again when we try to move cursor
> forward we acquire lock in _hash_first(), so in between when we don't
> have the lock, the split could happen and next scan results could
> differ.

Well, the existing code doesn't drop the heavyweight lock at that
location, but your patch does drop the pin that serves the same
function, so I feel like there must be some difference.

>> Also, I think that so->hashso_skip_moved_tuples is badly designed.
>> There are two separate facts you need to know: (1) whether you are
>> scanning a bucket that was still being populated at the start of your
>> scan and (2) if yes, whether you are scanning the bucket being
>> populated or whether you are instead scanning the corresponding "old"
>> bucket.  You're trying to keep track of that using one Boolean, but
>> one Boolean only has two states and there are three possible states
>> here.
>
> So do you prefer to have two booleans to track those facts or have an
> uint8 with a tri-state value or something else?

I don't currently have a preference.

>> Come to think of it, I'm a little worried about the locking in
>> _hash_squeezebucket().  It seems like we drop the lock on each "write"
>> bucket page before taking the lock on the next one.  So a concurrent
>> scan could get ahead of the cleanup process.  That would be bad,
>> wouldn't it?
>
> Yeah, that would be bad if it happens, but no concurrent scan can
> happen during squeeze phase because we take an exclusive lock on a
> bucket page and maintain it throughout the operation.

Well, that's completely unacceptable.  A major reason the current code
uses heavyweight locks is because you can't hold lightweight locks
across arbitrary amounts of work -- because, just to take one example,
a process holding or waiting for an LWLock isn't interruptible.  The
point of this redesign was to get rid of that, so that LWLocks are
only held for short periods.  I dislike the lock-chaining approach
(take the next lock before releasing the previous one) quite a bit and
really would like to find a way to get rid of that, but the idea of
holding a buffer lock across a complete traversal of an unbounded
number of overflow buckets is far worse.  We've got to come up with a
design that doesn't require that, or else completely redesign the
bucket-squeezing stuff.

(Would it make any sense to change the order of the hash index patches
we've got outstanding?  For instance, if we did the page-at-a-time
stuff first, it would make life simpler for this patch in several
ways, possibly including this issue.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-17 Thread Amit Kapila
On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas  wrote:
> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila  wrote:
>> You are right and I have changed the code as per your suggestion.
>
> So...
>
> +/*
> + * We always maintain the pin on bucket page for whole scan 
> operation,
> + * so releasing the additional pin we have acquired here.
> + */
> +if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
> +_hash_dropbuf(rel, *bufp);
>
> This relies on the page contents to know whether we took a pin; that
> seems like a bad plan.  We need to know intrinsically whether we took
> a pin.
>

Okay, I think we can do that as we have bucket buffer information
(hashso_bucket_buf) in HashScanOpaqueData.  We might need to pass this
information in _hash_readprev.

> + * If the bucket split is in progress, then we need to skip tuples that
> + * are moved from old bucket.  To ensure that vacuum doesn't clean any
> + * tuples from old or new buckets till this scan is in progress, maintain
> + * a pin on both of the buckets.  Here, we have to be cautious about
>
> It wouldn't be a problem if VACUUM removed tuples from the new bucket,
> because they'd have to be dead anyway.   It also wouldn't be a problem
> if it removed tuples from the old bucket that were actually dead.  The
> real issue isn't vacuum anyway, but the process of cleaning up after a
> split.  We need to hold the pin so that tuples being moved from the
> old bucket to the new bucket by the split don't get removed from the
> old bucket until our scan is done.
>

Are you expecting a comment change here?

> +old_blkno = _hash_get_oldblock_from_newbucket(rel,
> opaque->hasho_bucket);
>
> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"?  I
> feel like I'm repeating this ad nauseum, but I really think it's bad
> to rely on the special space instead of our own local variables!
>

Sure, we can pass bucket as well. However, if you see few lines below
(while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already
relying on special space to pass variables. In general, we are using
special space to pass variables to functions in many other places in
the code.  What exactly are you bothered about in accessing special
space, if it is safe to do?

> -/* we ran off the end of the bucket without finding a match */
> +/*
> + * We ran off the end of the bucket without finding a match.
> + * Release the pin on bucket buffers.  Normally, such pins are
> + * released at end of scan, however scrolling cursors can
> + * reacquire the bucket lock and pin in the same scan multiple
> + * times.
> + */
>  *bufP = so->hashso_curbuf = InvalidBuffer;
>  ItemPointerSetInvalid(current);
> +_hash_dropscanbuf(rel, so);
>
> I think this comment is saying that we'll release the pin on the
> primary bucket page for now, and then reacquire it later if the user
> reverses the scan direction.  But that doesn't sound very safe,
> because the bucket could be split in the meantime and the order in
> which tuples are returned could change.  I think we want that to
> remain stable within a single query execution.
>

Isn't that possible even without the patch?  Basically, after reaching
end of forward scan and for doing backward *all* scan, we need to
perform portal rewind which will in turn call hashrescan where we will
drop the lock on bucket and then again when we try to move cursor
forward we acquire lock in _hash_first(), so in between when we don't
have the lock, the split could happen and next scan results could
differ.

Also, in the documentation, it is mentioned that "The SQL standard
says that it is implementation-dependent whether cursors are sensitive
to concurrent updates of the underlying data by default. In
PostgreSQL, cursors are insensitive by default, and can be made
sensitive by specifying FOR UPDATE." which I think indicates that
results can't be guaranteed for forward and backward scans.

So, even if we try to come up with some solution for stable results in
some scenarios, I am not sure that can be guaranteed for all
scenarios.


> +/*
> + * setting hashso_skip_moved_tuples to false
> + * ensures that we don't check for tuples that 
> are
> + * moved by split in old bucket and it also
> + * ensures that we won't retry to scan the old
> + * bucket once the scan for same is finished.
> + */
> +so->hashso_skip_moved_tuples = false;
>
> I think you've got a big problem here.  Suppose the user starts the
> scan in the new bucket and runs it forward until they end up in the
> old bucket.  Then they turn around and run the scan backward.  When
> they reach 

Re: [HACKERS] Hash Indexes

2016-11-16 Thread Robert Haas
On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila  wrote:
> You are right and I have changed the code as per your suggestion.

So...

+/*
+ * We always maintain the pin on bucket page for whole scan operation,
+ * so releasing the additional pin we have acquired here.
+ */
+if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+_hash_dropbuf(rel, *bufp);

This relies on the page contents to know whether we took a pin; that
seems like a bad plan.  We need to know intrinsically whether we took
a pin.

+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket.  To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets.  Here, we have to be cautious about

It wouldn't be a problem if VACUUM removed tuples from the new bucket,
because they'd have to be dead anyway.   It also wouldn't be a problem
if it removed tuples from the old bucket that were actually dead.  The
real issue isn't vacuum anyway, but the process of cleaning up after a
split.  We need to hold the pin so that tuples being moved from the
old bucket to the new bucket by the split don't get removed from the
old bucket until our scan is done.

+old_blkno = _hash_get_oldblock_from_newbucket(rel,
opaque->hasho_bucket);

Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"?  I
feel like I'm repeating this ad nauseum, but I really think it's bad
to rely on the special space instead of our own local variables!

-/* we ran off the end of the bucket without finding a match */
+/*
+ * We ran off the end of the bucket without finding a match.
+ * Release the pin on bucket buffers.  Normally, such pins are
+ * released at end of scan, however scrolling cursors can
+ * reacquire the bucket lock and pin in the same scan multiple
+ * times.
+ */
 *bufP = so->hashso_curbuf = InvalidBuffer;
 ItemPointerSetInvalid(current);
+_hash_dropscanbuf(rel, so);

I think this comment is saying that we'll release the pin on the
primary bucket page for now, and then reacquire it later if the user
reverses the scan direction.  But that doesn't sound very safe,
because the bucket could be split in the meantime and the order in
which tuples are returned could change.  I think we want that to
remain stable within a single query execution.

+_hash_readnext(rel, &buf, &page, &opaque,
+   (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);

Same comment: don't rely on the special space to figure this out.
Keep track.  Also != 0 would be better than ? true : false.

+/*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+so->hashso_skip_moved_tuples = false;

I think you've got a big problem here.  Suppose the user starts the
scan in the new bucket and runs it forward until they end up in the
old bucket.  Then they turn around and run the scan backward.  When
they reach the beginning of the old bucket, they're going to stop, not
move back to the new bucket, AFAICS.  Oops.

_hash_first() has a related problem: a backward scan starts at the end
of the new bucket and moves backward, but it should start at the end
of the old bucket, and then when it reaches the beginning, flip to the
new bucket and move backward through that one.  Otherwise, a backward
scan and a forward scan don't return tuples in opposite order, which
they should.

I think what you need to do to fix both of these problems is a more
thorough job gluing the two buckets together.  I'd suggest that the
responsibility for switching between the two buckets should probably
be given to _hash_readprev() and _hash_readnext(), because every place
that needs to advance to the next or previous page that cares about
this.  Right now you are trying to handle it mostly in the functions
that call those functions, but that is prone to errors of omission.

Also, I think that so->hashso_skip_moved_tuples is badly designed.
There are two separate facts you need to know: (1) whether you are
scanning a bucket that was still being populated at the start of your
scan and (2) if yes, whether you are scanning the bucket being
populated or whether you are instead scanning the corresponding "old"
bucket.  You're trying to keep track of that using one Boolean, but
one Boolean only has two states and there are three possible states
here.

+if (H_BUCKET_BEING_SPLIT(pag

Re: [HACKERS] Hash Indexes

2016-11-09 Thread Amit Kapila
On Thu, Nov 10, 2016 at 2:57 AM, Robert Haas  wrote:
>
> Some more review:
>
> The API contract of _hash_finish_split seems a bit unfortunate.   The
> caller is supposed to have obtained a cleanup lock on both the old and
> new buffers, but the first thing it does is walk the entire new bucket
> chain, completely ignoring the old one.  That means holding a cleanup
> lock on the old buffer across an unbounded number of I/O operations --
> which also means that you can't interrupt the query by pressing ^C,
> because an LWLock (on the old buffer) is held.
>

I see the problem you are talking about, but it was done to ensure
locking order, old bucket first and then new bucket, else there could
be a deadlock risk.  However, I think we can avoid holding the cleanup
lock on old bucket till we scan the new bucket to form a hash table of
TIDs.

>  Moreover, the
> requirement to hold a lock on the new buffer isn't convenient for
> either caller; they both have to go do it, so why not move it into the
> function?
>

Yeah, we can move the locking of new bucket entirely into new function.

>  Perhaps the function should be changed so that it
> guarantees that a pin is held on the primary page of the existing
> bucket, but no locks are held.
>

Okay, so we can change the locking order as follows:
a. ensure a cleanup lock on old bucket and check if the bucket (old)
has pending split.
b. if there is a pending split, release the lock on old bucket, but not pin.

below steps will be performed by _hash_finish_split():

c. acquire the read content lock on new bucket and form the hash table
of TIDs and in the process of forming hash table, we need to traverse
whole bucket chain.  While traversing bucket chain, release the lock
on previous bucket (both lock and pin if not a primary bucket page).
d. After the hash table is formed, acquire cleanup lock on old and new
buckets conditionaly; if we are not able to get cleanup lock on
either, then just return from there.
e. Perform split operation.
f. release the lock and pin on new bucket
g. release the lock on old bucket.  We don't want to release the pin
on old bucket as the caller has ensure it before passing it to
_hash_finish_split(), so releasing pin should be resposibility of
caller.

Now, both the callers need to ensure that they restart the operation
from begining as after we release lock on old bucket, somebody might
have split the bucket.

Does the above change in locking strategy sounds okay?

> Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
> should it instead be trying to get the lock conditionally and
> returning immediately if it doesn't get the lock?  Seems like a good
> idea...
>

Yeah, we can take a cleanup lock conditionally, but it would waste the
effort of forming hash table, if we don't get cleanup lock
immediately.  Considering incomplete splits to be a rare operation,
may be this the wasted effort is okay, but I am not sure.  Don't you
think we should avoid that effort?

>  * We're at the end of the old bucket chain, so we're done partitioning
>  * the tuples.  Mark the old and new buckets to indicate split is
>  * finished.
>  *
>  * To avoid deadlocks due to locking order of buckets, first lock the old
>  * bucket and then the new bucket.
>
> These comments have drifted too far from the code to which they refer.
> The first part is basically making the same point as the
> slightly-later comment /* indicate that split is finished */.
>

I think we can remove the second comment /* indicate that split is
finished */.  Apart from that, I think the above comment you have
quoted seems to be inline with current code.  At that point, we have
finished partitioning the tuples, so I don't understand what makes you
think that it is drifted from the code? Is it because of second part
of comment (To avoid deadlocks ...)?  If so, I think we can move it to
few lines down where we actually performs the locking on old and new
bucket?

> The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
> coming to seem like a horrible idea to me.  That's not your fault - it
> was like this before - but maybe in a followup patch we should
> consider ripping all of that out and just calling MarkBufferDirty(),
> ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
> UnlockReleaseBuffer() as appropriate.  As far as I can see, the
> current style is just obfuscating the code.
>

Okay, we can do some study and try to change it in the way you are
suggesting.  It seems partially this has been derived from btree code
where we have function _bt_relbuf().  I am sure that we don't need
_hash_wrtbuf after WAL patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 12:11 PM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila  wrote:
>> On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas  wrote:
>>> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
>>> I think we can give a brief explanation right in the code comment.  I
>>> referred to "decreasing the TIDs"; you are referring to "moving tuples
>>> around".  But I think that moving the tuples to different locations is
>>> not the problem.  I think the problem is that a tuple might be
>>> assigned a lower spot in the item pointer array
>>
>> I think we both understand the problem and it is just matter of using
>> different words.  I will go with your suggestion and will try to
>> slightly adjust the README as well so that both places use same
>> terminology.
>
> Yes, I think we're on the same page.

Some more review:

The API contract of _hash_finish_split seems a bit unfortunate.   The
caller is supposed to have obtained a cleanup lock on both the old and
new buffers, but the first thing it does is walk the entire new bucket
chain, completely ignoring the old one.  That means holding a cleanup
lock on the old buffer across an unbounded number of I/O operations --
which also means that you can't interrupt the query by pressing ^C,
because an LWLock (on the old buffer) is held.  Moreover, the
requirement to hold a lock on the new buffer isn't convenient for
either caller; they both have to go do it, so why not move it into the
function?  Perhaps the function should be changed so that it
guarantees that a pin is held on the primary page of the existing
bucket, but no locks are held.

Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf),
should it instead be trying to get the lock conditionally and
returning immediately if it doesn't get the lock?  Seems like a good
idea...

 * We're at the end of the old bucket chain, so we're done partitioning
 * the tuples.  Mark the old and new buckets to indicate split is
 * finished.
 *
 * To avoid deadlocks due to locking order of buckets, first lock the old
 * bucket and then the new bucket.

These comments have drifted too far from the code to which they refer.
The first part is basically making the same point as the
slightly-later comment /* indicate that split is finished */.

The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is
coming to seem like a horrible idea to me.  That's not your fault - it
was like this before - but maybe in a followup patch we should
consider ripping all of that out and just calling MarkBufferDirty(),
ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or
UnlockReleaseBuffer() as appropriate.  As far as I can see, the
current style is just obfuscating the code.

itupsize = new_itup->t_info & INDEX_SIZE_MASK;
new_itup->t_info &= ~INDEX_SIZE_MASK;
new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
new_itup->t_info |= itupsize;

If I'm not mistaken, you could omit the first, second, and fourth
lines here and keep only the third one, and it would do exactly the
same thing.  The first line saves the bits in INDEX_SIZE_MASK.  The
second line clears the bits in INDEX_SIZE_MASK.   The fourth line
re-sets the bits that were originally saved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila  wrote:
> On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas  wrote:
>> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
>> I think we can give a brief explanation right in the code comment.  I
>> referred to "decreasing the TIDs"; you are referring to "moving tuples
>> around".  But I think that moving the tuples to different locations is
>> not the problem.  I think the problem is that a tuple might be
>> assigned a lower spot in the item pointer array
>
> I think we both understand the problem and it is just matter of using
> different words.  I will go with your suggestion and will try to
> slightly adjust the README as well so that both places use same
> terminology.

Yes, I think we're on the same page.

> Right, but we don't need that guarantee (there is no pending scan that
> has seen the flag after it is cleared) to clear the flags.  It was
> written in one of the previous patches where I was exploring the idea
> of using cleanup lock to clear the flags and then don't use the same
> during vacuum.  However, there were some problems in that design and I
> have changed the code, but forgot to update the comment.

OK, got it, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Amit Kapila
On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas  wrote:
> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
>
> I think we can give a brief explanation right in the code comment.  I
> referred to "decreasing the TIDs"; you are referring to "moving tuples
> around".  But I think that moving the tuples to different locations is
> not the problem.  I think the problem is that a tuple might be
> assigned a lower spot in the item pointer array
>

I think we both understand the problem and it is just matter of using
different words.  I will go with your suggestion and will try to
slightly adjust the README as well so that both places use same
terminology.


>>> +/*
>>> + * Acquiring cleanup lock to clear the split-in-progress flag ensures 
>>> that
>>> + * there is no pending scan that has seen the flag after it is cleared.
>>> + */
>>> +_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
>>> +opage = BufferGetPage(bucket_obuf);
>>> +oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
>>>
>>> I don't understand the comment, because the code *isn't* acquiring a
>>> cleanup lock.
>>
>> Oops, this is ramnant from one of the design approach to clear these
>> flags which was later discarded due to issues. I will change this to
>> indicate Exclusive lock.
>
> Of course, an exclusive lock doesn't guarantee anything like that...
>

Right, but we don't need that guarantee (there is no pending scan that
has seen the flag after it is cleared) to clear the flags.  It was
written in one of the previous patches where I was exploring the idea
of using cleanup lock to clear the flags and then don't use the same
during vacuum.  However, there were some problems in that design and I
have changed the code, but forgot to update the comment.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila  wrote:
> + * This function expects that the caller has acquired a cleanup lock on the
> + * primary bucket page, and will with a write lock again held on the primary
> + * bucket page.  The lock won't necessarily be held continuously, though,
> + * because we'll release it when visiting overflow pages.
>
> Looks like typo in above comment.   /will with a write lock/will
> return with a write lock

Oh, yes.  Thanks.

>> + * During scan of overflow pages, first we need to lock the next bucket and
>> + * then release the lock on current bucket.  This ensures that any 
>> concurrent
>> + * scan started after we start cleaning the bucket will always be behind the
>> + * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
>> + * required for sanctity of scan.
>>
>> This comment says that it's bad if other scans can pass our cleanup
>> scan, but it doesn't explain why.  I think it's because we don't have
>> page-at-a-time mode yet,
>>
>
> Right.
>
>> and cleanup might decrease the TIDs for
>> existing index entries.
>>
>
> I think the reason is that cleanup might move tuples around during
> which it might move previously returned TID to a position earlier than
> its current position.  This is a problem because it restarts the scan
> from previously returned offset and try to find previously returned
> tuples TID.  This has been mentioned in README as below:
>
> + It is must to
> +keep scans behind cleanup, else vacuum could remove tuples that are required
> +to complete the scan as the scan that returns multiple tuples from the same
> +bucket page always restart the scan from the previous offset number from 
> which
> +it has returned last tuple.
>
> We might want to slightly improve the README so that the reason is
> more clear and then mention in comments to refer README, but I am open
> either way, let me know which way you prefer?

I think we can give a brief explanation right in the code comment.  I
referred to "decreasing the TIDs"; you are referring to "moving tuples
around".  But I think that moving the tuples to different locations is
not the problem.  I think the problem is that a tuple might be
assigned a lower spot in the item pointer array - i.e. the TID
decreases.

>> OK, a couple things here.  First, it seems like we could also delete
>> any tuples where ItemIdIsDead, and that seems worth doing.
>
> I think we can't do that because here we want to strictly rely on
> callback function for vacuum similar to btree. The reason is explained
> as below comment in function btvacuumpage().

OK, I see.  It would probably be good to comment this, then, so that
someone later doesn't get confused as I did.

> This looks okay to me. So if you agree with my reasoning for not
> including first part, then I can take that out and keep this part in
> next patch.

Cool.

>>  I think that might be
>> clearer.  When LH_BEING_POPULATED is set, the bucket is being filled -
>> that is, populated - from the old bucket.
>
> How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP
> indicates Bucket page?

LH_BUCKET_BEING_POPULATED seems good to me.

>>  And maybe
>> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too.
>>
>
> How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP?
> I am slightly inclined to keep Bucket word, but let me know if you
> think it will make the length longer.

LH_BUCKET_NEEDS_SPLIT_CLEANUP seems good to me.

>>  How?  Can we just use an
>> if-then instead of a for-loop?
>
> I could see below two possibilities:
> First way -
>
> retry:
> mask = lowmask + 1;
> new_bucket = old_bucket | mask;
> if (new_bucket > maxbucket)
> {
> lowmask = lowmask >> 1;
> goto retry;
> }
>
> Second way -
> new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask);
> if (new_bucket > maxbucket)
> {
> lowmask = lowmask >> 1;
> new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask);
> }
>
> #define CALC_NEW_BUCKET(old_bucket, lowmask) \
> new_bucket = old_bucket | (lowmask + 1)
>
> Do you have something else in mind?

Second one would be my preference.

>> I still don't like the names of these functions very much.  If you
>> said "get X from Y", it would be clear that you put in Y and you get
>> out X.  If you say "X 2 Y", it would be clear that you put in X and
>> you get out Y.  As it is, it's not very clear which is the input and
>> which is the output.
>
> Whatever exists earlier is input and the later one is output. For
> example in existing function _hash_get_indextuple_hashkey().  However,
> feel free to suggest better names here.  How about
> _hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket()
> or simply _hash_get_newblock()?

The problem with _hash_get_newblock() is that it sounds like you are
getting a new block in the relation, not the new bucket (or
corresponding block) for some old bucket.  The name isn't specific
enough to know what "new" means.

In general, I think "new" and "old" are not very good termin

Re: [HACKERS] Hash Indexes

2016-11-09 Thread Amit Kapila
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas  wrote:
> On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila  wrote:
>> [ new patches ]
>
> Attached is yet another incremental patch with some suggested changes.
>
> + * This expects that the caller has acquired a cleanup lock on the target
> + * bucket (primary page of a bucket) and it is reponsibility of caller to
> + * release that lock.
>
> This is confusing, because it makes it sound like we retain the lock
> through the entire execution of the function, which isn't always true.
> I would say that caller must acquire a cleanup lock on the target
> primary bucket page before calling this function, and that on return
> that page will again be write-locked.  However, the lock might be
> temporarily released in the meantime, which visiting overflow pages.
> (Attached patch has a suggested rewrite.)
>

+ * This function expects that the caller has acquired a cleanup lock on the
+ * primary bucket page, and will with a write lock again held on the primary
+ * bucket page.  The lock won't necessarily be held continuously, though,
+ * because we'll release it when visiting overflow pages.

Looks like typo in above comment.   /will with a write lock/will
return with a write lock


> + * During scan of overflow pages, first we need to lock the next bucket and
> + * then release the lock on current bucket.  This ensures that any concurrent
> + * scan started after we start cleaning the bucket will always be behind the
> + * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
> + * required for sanctity of scan.
>
> This comment says that it's bad if other scans can pass our cleanup
> scan, but it doesn't explain why.  I think it's because we don't have
> page-at-a-time mode yet,
>

Right.

> and cleanup might decrease the TIDs for
> existing index entries.
>

I think the reason is that cleanup might move tuples around during
which it might move previously returned TID to a position earlier than
its current position.  This is a problem because it restarts the scan
from previously returned offset and try to find previously returned
tuples TID.  This has been mentioned in README as below:

+ It is must to
+keep scans behind cleanup, else vacuum could remove tuples that are required
+to complete the scan as the scan that returns multiple tuples from the same
+bucket page always restart the scan from the previous offset number from which
+it has returned last tuple.

We might want to slightly improve the README so that the reason is
more clear and then mention in comments to refer README, but I am open
either way, let me know which way you prefer?

>
> +if (delay)
> +vacuum_delay_point();
>
> You don't really need "delay".  If we're not in a cost-accounted
> VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
> which should be safe (and a good idea) regardless.  (Fixed in
> attached.)
>

Okay, that makes sense.

> +if (callback && callback(htup, callback_state))
> +{
> +/* mark the item for deletion */
> +deletable[ndeletable++] = offno;
> +if (tuples_removed)
> +*tuples_removed += 1;
> +}
> +else if (bucket_has_garbage)
> +{
> +/* delete the tuples that are moved by split. */
> +bucket = 
> _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
> ),
> +  maxbucket,
> +  highmask,
> +  lowmask);
> +/* mark the item for deletion */
> +if (bucket != cur_bucket)
> +{
> +/*
> + * We expect tuples to either belong to curent bucket or
> + * new_bucket.  This is ensured because we don't allow
> + * further splits from bucket that contains garbage. See
> + * comments in _hash_expandtable.
> + */
> +Assert(bucket == new_bucket);
> +deletable[ndeletable++] = offno;
> +}
> +else if (num_index_tuples)
> +*num_index_tuples += 1;
> +}
> +else if (num_index_tuples)
> +*num_index_tuples += 1;
> +}
>
> OK, a couple things here.  First, it seems like we could also delete
> any tuples where ItemIdIsDead, and that seems worth doing.

I think we can't do that because here we want to strictly rely on
callback function for vacuum similar to btree. The reason is explained
as below comment in function btvacuumpage().

/*
* During Hot Standby we currently assume that
* XLOG_BTREE_VACUUM records do not produce conflicts. That is
* only true as long as the callback function depends only
* upon whether the index tuple refers to heap tuples removed
* in the initial

Re: [HACKERS] Hash Indexes

2016-11-08 Thread Robert Haas
On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila  wrote:
> [ new patches ]

Attached is yet another incremental patch with some suggested changes.

+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.

This is confusing, because it makes it sound like we retain the lock
through the entire execution of the function, which isn't always true.
I would say that caller must acquire a cleanup lock on the target
primary bucket page before calling this function, and that on return
that page will again be write-locked.  However, the lock might be
temporarily released in the meantime, which visiting overflow pages.
(Attached patch has a suggested rewrite.)

+ * During scan of overflow pages, first we need to lock the next bucket and
+ * then release the lock on current bucket.  This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup.  Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.

This comment says that it's bad if other scans can pass our cleanup
scan, but it doesn't explain why.  I think it's because we don't have
page-at-a-time mode yet, and cleanup might decrease the TIDs for
existing index entries.  (Attached patch has a suggested rewrite, but
might need further adjustment if my understanding of the reasons is
incomplete.)

+if (delay)
+vacuum_delay_point();

You don't really need "delay".  If we're not in a cost-accounted
VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(),
which should be safe (and a good idea) regardless.  (Fixed in
attached.)

+if (callback && callback(htup, callback_state))
+{
+/* mark the item for deletion */
+deletable[ndeletable++] = offno;
+if (tuples_removed)
+*tuples_removed += 1;
+}
+else if (bucket_has_garbage)
+{
+/* delete the tuples that are moved by split. */
+bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup
),
+  maxbucket,
+  highmask,
+  lowmask);
+/* mark the item for deletion */
+if (bucket != cur_bucket)
+{
+/*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket.  This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+Assert(bucket == new_bucket);
+deletable[ndeletable++] = offno;
+}
+else if (num_index_tuples)
+*num_index_tuples += 1;
+}
+else if (num_index_tuples)
+*num_index_tuples += 1;
+}

OK, a couple things here.  First, it seems like we could also delete
any tuples where ItemIdIsDead, and that seems worth doing. In fact, I
think we should check it prior to invoking the callback, because it's
probably quite substantially cheaper than the callback.  Second,
repeating deletable[ndeletable++] = offno and *num_index_tuples += 1
doesn't seem very clean to me; I think we should introduce a new bool
to track whether we're keeping the tuple or killing it, and then use
that to drive which of those things we do.  (Fixed in attached.)

+if (H_HAS_GARBAGE(bucket_opaque) &&
+!H_INCOMPLETE_SPLIT(bucket_opaque))

This is the only place in the entire patch that use
H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even
here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here?  (And
couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no
garbage to be removed from the "new" bucket until the next split, when
it will take on the role of the "old" bucket.

I think it would be a good idea to change this so that
LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until
LH_BUCKET_OLD_PAGE_SPLIT is cleared.  The current way is confusing,
because those tuples are NOT garbage until the split is completed!
Moreover, both of the places that care about
LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that
LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about
LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would
actually simplify the code very slightly.

+#define H_OLD_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque)  ((opaque)->hasho_flag &
LH_BUCKET_NEW_PAGE_SPLIT)

The code isn't consistent about the use of these macros, or at least
not in a good way.  When you care about LH_BUCKET_OLD_PAGE_SPLIT, you
test i

Re: [HACKERS] Hash Indexes

2016-11-08 Thread Robert Haas
On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila  wrote:
>> Both the places _hash_squeezebucket() and  _hash_splitbucket can use
>> this optimization irrespective of rest of the patch.  I will prepare a
>> separate patch for these and send along with main patch after some
>> testing.
>
> Done as a separate patch skip_dead_tups_hash_index-v1.patch.

Thanks.  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-04 Thread Amit Kapila
On Fri, Nov 4, 2016 at 9:37 PM, Robert Haas  wrote:
> On Tue, Nov 1, 2016 at 9:09 PM, Robert Haas  wrote:
>> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  
>> wrote:
>>> [ new patches ]
>>
>> I looked over parts of this today, mostly the hashinsert.c changes.
>
> Some more review...
>
> @@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel,
>  IndexTuple  itup;
>  Sizeitemsz;
>
> +/* skip dead tuples */
> +if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
> +continue;
>
> Is this an optimization independent of the rest of the patch, or is
> there something in this patch that necessitates it?
>

This specific case is independent of rest of patch, but the same
optimization is used in function _hash_splitbucket_guts() which is
mandatory, because otherwise it will make a copy of that tuple without
copying dead flag.

>  i.e. Could this
> small change be committed independently?

Both the places _hash_squeezebucket() and  _hash_splitbucket can use
this optimization irrespective of rest of the patch.  I will prepare a
separate patch for these and send along with main patch after some
testing.

>  If not, then I think it
> needs a better comment explaining why it is now mandatory.
>
> - *  Caller must hold exclusive lock on the target bucket.  This allows
> + *  Caller must hold cleanup lock on the target bucket.  This allows
>   *  us to safely lock multiple pages in the bucket.
>
> The notion of a lock on a bucket no longer really exists; with this
> patch, we'll now properly speak of a lock on a primary bucket page.
> Also, I think the bit about safely locking multiple pages is bizarre
> -- that's not the issue at all: the problem is that reorganizing a
> bucket might confuse concurrent scans into returning wrong answers.
>
> I've included a broader updating of that comment, and some other
> comment changes, in the attached incremental patch, which also
> refactors your changes to _hash_freeovflpage() a bit to avoid some
> code duplication.  Please consider this for inclusion in your next
> version.
>

Your modifications looks good to me, so will include it in next version.

> In hashutil.c, I think that _hash_msb() is just a reimplementation of
> fls(), which you can rely on being present because we have our own
> implementation in src/port.  It's quite similar to yours but slightly
> shorter.  :-)   Also, some systems have a builtin fls() function which
> actually optimizes down to a single machine instruction, and which is
> therefore much faster than either version.
>

Agreed, will change as per suggestion.

> I don't like the fact that _hash_get_newblk() and _hash_get_oldblk()
> are working out the bucket number by using the HashOpaque structure
> within the bucket page they're examining.  First, it seems weird to
> pass the whole structure when you only need the bucket number out of
> it.  More importantly, the caller really ought to know what bucket
> they care about without having to discover it.
>
> For example, in _hash_doinsert(), we figure out the bucket into which
> we need to insert, and we store that in a variable called "bucket".
> Then from there we work out the primary bucket page's block number,
> which we store in "blkno".  We read the page into "buf" and put a
> pointer to that buffer's contents into "page" from which we discover
> the HashOpaque, a pointer to which we store into "pageopaque".  Then
> we pass that to _hash_get_newblk() which will go look into that
> structure to find the bucket number ... but couldn't we have just
> passed "bucket" instead?
>

Yes, it can be done.  However, note that pageopaque is not only
retrieved for passing to _hash_get_newblk(), rather it is used in
below code as well, so we can't remove that.

>  Similarly, _hash_expandtable() has the value
> available in the variable "old_bucket".
>
> The only caller of _hash_get_oldblk() is _hash_first(), which has the
> bucket number available in a variable called "bucket".
>
> So it seems to me that these functions could be simplified to take the
> bucket number as an argument directly instead of the HashOpaque.
>

Okay, I agree that it is better to use bucket number in both the
API's, so will change it accordingly.

> Generally, this pattern recurs throughout the patch.  Every time you
> use the data in the page to figure something out which the caller
> already knew, you're introducing a risk of bugs: what if the answers
> don't match?   I think you should try to root out as much of that from
> this code as you can.
>

Okay, I will review the patch once with this angle and see if I can improve it.

> As you may be able to tell, I'm working my way into this patch
> gradually, starting with peripheral parts and working toward the core
> of it.  Generally, I think it's in pretty good shape, but I still have
> quite a bit left to study.
>

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hac

Re: [HACKERS] Hash Indexes

2016-11-04 Thread Amit Kapila
On Fri, Nov 4, 2016 at 6:37 PM, Robert Haas  wrote:
> On Thu, Nov 3, 2016 at 6:25 AM, Amit Kapila  wrote:
>>> +nblkno = _hash_get_newblk(rel, pageopaque);
>>>
>>> I think this is not a great name for this function.  It's not clear
>>> what "new blocks" refers to, exactly.  I suggest
>>> FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
>>> bucket) returning a new bucket number.  I think that macro can be
>>> defined as something like this: bucket + (1 <<
>>> (fls(metap->hashm_maxbucket) - 1)).
>>>
>>
>> I think such a macro would not work for the usage of incomplete
>> splits.  The reason is that by the time we try to complete the split
>> of the current old bucket, the table half (lowmask, highmask,
>> maxbucket) would have changed and it could give you the bucket in new
>> table half.
>
> Can you provide an example of the scenario you are talking about here?
>

Consider a case as below:

First half of table
0 1 2 3
Second half of table
4 5 6 7

Now when split of bucket 2 (corresponding new bucket will be 6) is in
progress, system crashes and after restart it splits bucket number 3
(corresponding bucket will be 7).  Now after that, it will try to form
a new table half with buckets ranging from 8,9,..15.  Assume it
creates bucket 8 by splitting from bucket 0 and next if it tries to
split bucket 2, it will find an incomplete split and will attempt to
finish it.  At that time if it tries to calculate new bucket from old
bucket (2), it will calculate it as 10 (value of
metap->hashm_maxbucket will be 8 for third table half and if try it
with the above macro, it will calculate it as 10) whereas we need 6.
That is why you will see a check (if (new_bucket >
metap->hashm_maxbucket)) in _hash_get_newblk() which will ensure that
it returns the bucket number from previous half.  The basic idea is
that if there is an incomplete split from current bucket, it can't do
a new split from that bucket, so the check in _hash_get_newblk() will
give us correct value.

I can try to explain again if above is not clear enough.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-04 Thread Robert Haas
On Tue, Nov 1, 2016 at 9:09 PM, Robert Haas  wrote:
> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  wrote:
>> [ new patches ]
>
> I looked over parts of this today, mostly the hashinsert.c changes.

Some more review...

@@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel,
 IndexTuple  itup;
 Sizeitemsz;

+/* skip dead tuples */
+if (ItemIdIsDead(PageGetItemId(rpage, roffnum)))
+continue;

Is this an optimization independent of the rest of the patch, or is
there something in this patch that necessitates it?  i.e. Could this
small change be committed independently?  If not, then I think it
needs a better comment explaining why it is now mandatory.

- *  Caller must hold exclusive lock on the target bucket.  This allows
+ *  Caller must hold cleanup lock on the target bucket.  This allows
  *  us to safely lock multiple pages in the bucket.

The notion of a lock on a bucket no longer really exists; with this
patch, we'll now properly speak of a lock on a primary bucket page.
Also, I think the bit about safely locking multiple pages is bizarre
-- that's not the issue at all: the problem is that reorganizing a
bucket might confuse concurrent scans into returning wrong answers.

I've included a broader updating of that comment, and some other
comment changes, in the attached incremental patch, which also
refactors your changes to _hash_freeovflpage() a bit to avoid some
code duplication.  Please consider this for inclusion in your next
version.

In hashutil.c, I think that _hash_msb() is just a reimplementation of
fls(), which you can rely on being present because we have our own
implementation in src/port.  It's quite similar to yours but slightly
shorter.  :-)   Also, some systems have a builtin fls() function which
actually optimizes down to a single machine instruction, and which is
therefore much faster than either version.

I don't like the fact that _hash_get_newblk() and _hash_get_oldblk()
are working out the bucket number by using the HashOpaque structure
within the bucket page they're examining.  First, it seems weird to
pass the whole structure when you only need the bucket number out of
it.  More importantly, the caller really ought to know what bucket
they care about without having to discover it.

For example, in _hash_doinsert(), we figure out the bucket into which
we need to insert, and we store that in a variable called "bucket".
Then from there we work out the primary bucket page's block number,
which we store in "blkno".  We read the page into "buf" and put a
pointer to that buffer's contents into "page" from which we discover
the HashOpaque, a pointer to which we store into "pageopaque".  Then
we pass that to _hash_get_newblk() which will go look into that
structure to find the bucket number ... but couldn't we have just
passed "bucket" instead?  Similarly, _hash_expandtable() has the value
available in the variable "old_bucket".

The only caller of _hash_get_oldblk() is _hash_first(), which has the
bucket number available in a variable called "bucket".

So it seems to me that these functions could be simplified to take the
bucket number as an argument directly instead of the HashOpaque.

Generally, this pattern recurs throughout the patch.  Every time you
use the data in the page to figure something out which the caller
already knew, you're introducing a risk of bugs: what if the answers
don't match?   I think you should try to root out as much of that from
this code as you can.

As you may be able to tell, I'm working my way into this patch
gradually, starting with peripheral parts and working toward the core
of it.  Generally, I think it's in pretty good shape, but I still have
quite a bit left to study.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


hashovfl-tweaks.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-04 Thread Robert Haas
On Fri, Oct 28, 2016 at 12:33 AM, Amit Kapila  wrote:
> On Fri, Oct 28, 2016 at 2:52 AM, Robert Haas  wrote:
>> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  
>> wrote:
 Amit, can you please split the buffer manager changes in this patch
 into a separate patch?
>>>
>>> Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.
>>
>> The additional argument to ConditionalLockBuffer() doesn't seem to be
>> used anywhere in the main patch.  Do we actually need it?
>>
>
> No, with latest patch of concurrent hash index, we don't need it.  I
> have forgot to remove it.  Please find updated patch attached.  The
> usage of second parameter for ConditionalLockBuffer() is removed as we
> don't want to allow I/O across content locks, so the patch is changed
> to fallback to twice locking the metapage.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-04 Thread Robert Haas
On Thu, Nov 3, 2016 at 6:25 AM, Amit Kapila  wrote:
>> +nblkno = _hash_get_newblk(rel, pageopaque);
>>
>> I think this is not a great name for this function.  It's not clear
>> what "new blocks" refers to, exactly.  I suggest
>> FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
>> bucket) returning a new bucket number.  I think that macro can be
>> defined as something like this: bucket + (1 <<
>> (fls(metap->hashm_maxbucket) - 1)).
>>
>
> I think such a macro would not work for the usage of incomplete
> splits.  The reason is that by the time we try to complete the split
> of the current old bucket, the table half (lowmask, highmask,
> maxbucket) would have changed and it could give you the bucket in new
> table half.

Can you provide an example of the scenario you are talking about here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-03 Thread Amit Kapila
On Wed, Nov 2, 2016 at 6:39 AM, Robert Haas  wrote:
> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  wrote:
>> [ new patches ]
>
> I looked over parts of this today, mostly the hashinsert.c changes.
>
> +/*
> + * Copy bucket mapping info now;  The comment in _hash_expandtable where
> + * we copy this information and calls _hash_splitbucket explains why this
> + * is OK.
> + */
>
> So, I went and tried to find the comments to which this comment is
> referring and didn't have much luck.
>

I guess you have just tried to find it in the patch. However, the
comment I am referring above is an existing comment in
_hash_expandtable().  Refer below comment:
/*
* Copy bucket mapping info now; this saves re-accessing the meta page
* inside _hash_splitbucket's inner loop. ...

> At the point this code is
> running, we have a pin but no lock on the metapage, so this is only
> safe if changing any of these fields requires a cleanup lock on the
> metapage.  If that's true,
>

No that's not true, we need just Exclusive content lock to update
those fields and these fields should be copied when we have Share
content lock on metapage.  In version-8 of patch, it was correct, but
in last version, it seems during code re-arrangement, I have moved it.
I will change it such that these values are copied under matapage
share content lock.  I think moving it just before the preceding for
loop should be okay, let me know if you think otherwise.


> +nblkno = _hash_get_newblk(rel, pageopaque);
>
> I think this is not a great name for this function.  It's not clear
> what "new blocks" refers to, exactly.  I suggest
> FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
> bucket) returning a new bucket number.  I think that macro can be
> defined as something like this: bucket + (1 <<
> (fls(metap->hashm_maxbucket) - 1)).
>

I think such a macro would not work for the usage of incomplete
splits.  The reason is that by the time we try to complete the split
of the current old bucket, the table half (lowmask, highmask,
maxbucket) would have changed and it could give you the bucket in new
table half.

> Then do nblkno =
> BUCKET_TO_BLKNO(metap, newbucket) to get the block number.  That'd all
> be considerably simpler than what you have for hash_get_newblk().
>

I think to use BUCKET_TO_BLKNO we need either a share or exclusive
lock on metapage and as we need a lock on metapage to find new block
from old block, I thought it is better to do inside
_hash_get_newblk().

>
> Moving on ...
>
>  /*
>   * ovfl page exists; go get it.  if it doesn't have room, we'll
> - * find out next pass through the loop test above.
> + * find out next pass through the loop test above.  Retain the
> + * pin, if it is a primary bucket page.
>   */
> -_hash_relbuf(rel, buf);
> +if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
> +_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
> +else
> +_hash_relbuf(rel, buf);
>
> It seems like it would be cheaper, safer, and clearer to test whether
> buf != bucket_buf here, rather than examining the page opaque data.
> That's what you do down at the bottom of the function when you ensure
> that the pin on the primary bucket page gets released, and it seems
> like it should work up here, too.
>
> +boolretain_pin = false;
> +
> +/* page flags must be accessed before releasing lock on a page. 
> */
> +retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
>
> Similarly.
>

Agreed, will change the usage as per your suggestion.

> I have also attached a patch with some suggested comment changes.
>

I will include it in next version of patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-11-01 Thread Robert Haas
On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  wrote:
> [ new patches ]

I looked over parts of this today, mostly the hashinsert.c changes.

+/*
+ * Copy bucket mapping info now;  The comment in _hash_expandtable where
+ * we copy this information and calls _hash_splitbucket explains why this
+ * is OK.
+ */

So, I went and tried to find the comments to which this comment is
referring and didn't have much luck.  At the point this code is
running, we have a pin but no lock on the metapage, so this is only
safe if changing any of these fields requires a cleanup lock on the
metapage.  If that's true, it seems like you could just make the
comment say that; if it's false, you've got problems.

This code seems rather pointless anyway, the way it's written.  All of
these local variables are used in exactly one place, which is here:

+_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+   highmask, lowmask);

But you hold the same locks at the point where you copy those values
into local variables and the point where that code runs.  So if the
code is safe as written, you could instead just pass
metap->hashm_maxbucket, metap->hashm_highmask, and
metap->hashm_lowmask to that function instead of having these local
variables.  Or, for that matter, you could just let that function read
the data itself: it's got metabuf, after all.

+ * In future, if we want to finish the splits during insertion in new
+ * bucket, we must ensure the locking order such that old bucket is locked
+ * before new bucket.

Not if the locks are conditional anyway.

+nblkno = _hash_get_newblk(rel, pageopaque);

I think this is not a great name for this function.  It's not clear
what "new blocks" refers to, exactly.  I suggest
FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap,
bucket) returning a new bucket number.  I think that macro can be
defined as something like this: bucket + (1 <<
(fls(metap->hashm_maxbucket) - 1)). Then do nblkno =
BUCKET_TO_BLKNO(metap, newbucket) to get the block number.  That'd all
be considerably simpler than what you have for hash_get_newblk().

Here's some test code I wrote, which seems to work:

#include 
#include 
#include 
#include 

int
newbucket(int bucket, int nbuckets)
{
assert(bucket < nbuckets);
return bucket + (1 << (fls(nbuckets) - 1));
}

int
main(int argc, char **argv)
{
intnbuckets = 1;
int restartat = 1;
intsplitbucket = 0;

while (splitbucket < 32)
{
printf("old bucket %d splits to new bucket %d\n", splitbucket,
   newbucket(splitbucket, nbuckets));
if (++splitbucket >= restartat)
{
splitbucket = 0;
restartat *= 2;
}
++nbuckets;
}

exit(0);
}

Moving on ...

 /*
  * ovfl page exists; go get it.  if it doesn't have room, we'll
- * find out next pass through the loop test above.
+ * find out next pass through the loop test above.  Retain the
+ * pin, if it is a primary bucket page.
  */
-_hash_relbuf(rel, buf);
+if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+else
+_hash_relbuf(rel, buf);

It seems like it would be cheaper, safer, and clearer to test whether
buf != bucket_buf here, rather than examining the page opaque data.
That's what you do down at the bottom of the function when you ensure
that the pin on the primary bucket page gets released, and it seems
like it should work up here, too.

+boolretain_pin = false;
+
+/* page flags must be accessed before releasing lock on a page. */
+retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;

Similarly.

I have also attached a patch with some suggested comment changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


hashinsert-comments.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-27 Thread Amit Kapila
On Fri, Oct 28, 2016 at 2:52 AM, Robert Haas  wrote:
> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  wrote:
>>> Amit, can you please split the buffer manager changes in this patch
>>> into a separate patch?
>>
>> Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.
>
> The additional argument to ConditionalLockBuffer() doesn't seem to be
> used anywhere in the main patch.  Do we actually need it?
>

No, with latest patch of concurrent hash index, we don't need it.  I
have forgot to remove it.  Please find updated patch attached.  The
usage of second parameter for ConditionalLockBuffer() is removed as we
don't want to allow I/O across content locks, so the patch is changed
to fallback to twice locking the metapage.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


extend_bufmgr_api_for_hash_index_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-27 Thread Robert Haas
On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila  wrote:
>> Amit, can you please split the buffer manager changes in this patch
>> into a separate patch?
>
> Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that.

The additional argument to ConditionalLockBuffer() doesn't seem to be
used anywhere in the main patch.  Do we actually need it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-24 Thread Amit Kapila
On Mon, Oct 24, 2016 at 8:00 PM, Amit Kapila  wrote:
> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas  wrote:
>> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas  wrote:
>
>
> Thanks for the valuable feedback.
>

Forgot to mention that in addition to fixing the review comments, I
had made an additional change to skip the dead tuple while copying
tuples from old bucket to new bucket during split.  This was
previously not possible because split and scan were blocking
operations (split used to take Exclusive lock on bucket and Scan used
to hold Share lock on bucket till the operation ends), but now it is
possible and during scan some of the tuples can be marked as dead.
Similarly during squeeze operation, skipping dead tuples while moving
tuples across buckets.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-20 Thread Robert Haas
On Tue, Oct 18, 2016 at 8:27 PM, Amit Kapila  wrote:
> By this problem, I mean to say deadlocks for suspended scans, that can
> happen in btree for non-Mvcc or other type of scans where we don't
> release pin during scan.  In my mind, we have below options:
>
> a. problem of deadlocks for suspended scans should be tackled as a
> separate patch as it exists for other indexes (at least for some type
> of scans).
> b. Implement page-scan mode and then we won't have deadlock problem
> for MVCC scans.
> c. Let's not care for non-MVCC scans unless we have some way to hit
> those for hash indexes and proceed with Dead tuple marking idea.  I
> think even if we don't care for non-MVCC scans, we might hit this
> problem (deadlocks) when the index relation is unlogged.
>
> Here, even if we want to go with (b), I think we can handle it in a
> separate patch, unless you think otherwise.

After some off-list discussion with Amit, I think I get his point
here: the deadlock hazard which is introduced by this patch already
exists for btree and has for a long time, and nobody's gotten around
to fixing it (although 2ed5b87f96d473962ec5230fd820abfeaccb2069
improved things).  So it's probably OK for hash indexes to have the
same issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Amit Kapila
On Wed, Oct 19, 2016 at 5:57 AM, Amit Kapila  wrote:
> On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas  wrote:
>> On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila  wrote:
>>> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas  wrote:
 On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  
 wrote:
> I think one way to avoid the risk of deadlock in above scenario is to
> take the cleanup lock conditionally, if we get the cleanup lock then
> we will delete the items as we are doing in patch now, else it will
> just mark the tuples as dead and ensure that it won't try to remove
> tuples that are moved-by-split.  Now, I think the question is how will
> these dead tuples be removed.  We anyway need a separate mechanism to
> clear dead tuples for hash indexes as during scans we are marking the
> tuples as dead if corresponding tuple in heap is dead which are not
> removed later.  This is already taken care in btree code via
> kill_prior_tuple optimization.  So I think clearing of dead tuples can
> be handled by a separate patch.

 That seems like it could work.
>>>
>>> I have implemented this idea and it works for MVCC scans.  However, I
>>> think this might not work for non-MVCC scans.  Consider a case where
>>> in Process-1, hash scan has returned one row and before it could check
>>> it's validity in heap, vacuum marks that tuple as dead and removed the
>>> entry from heap and some new tuple has been placed at that offset in
>>> heap.
>>
>> Oops, that's bad.
>>
>>> Now when Process-1 checks the validity in heap, it will check
>>> for different tuple then what the index tuple was suppose to check.
>>> If we want, we can make it work similar to what btree does as being
>>> discussed on thread [1], but for that we need to introduce page-scan
>>> mode as well in hash indexes.   However, do we really want to solve
>>> this problem as part of this patch when this exists for other index am
>>> as well?
>>
>> For what other index AM does this problem exist?
>>
>
> By this problem, I mean to say deadlocks for suspended scans, that can
> happen in btree for non-Mvcc or other type of scans where we don't
> release pin during scan.  In my mind, we have below options:
>
> a. problem of deadlocks for suspended scans should be tackled as a
> separate patch as it exists for other indexes (at least for some type
> of scans).
> b. Implement page-scan mode and then we won't have deadlock problem
> for MVCC scans.
> c. Let's not care for non-MVCC scans unless we have some way to hit
> those for hash indexes and proceed with Dead tuple marking idea.  I
> think even if we don't care for non-MVCC scans, we might hit this
> problem (deadlocks) when the index relation is unlogged.
>

oops, my last sentence is wrong. What I wanted to say is: "I think
even if we don't care for non-MVCC scans, we might hit the problem of
TIDs reuse when the index relation is unlogged."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Amit Kapila
On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas  wrote:
> On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila  wrote:
>> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas  wrote:
>>> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  
>>> wrote:
 I think one way to avoid the risk of deadlock in above scenario is to
 take the cleanup lock conditionally, if we get the cleanup lock then
 we will delete the items as we are doing in patch now, else it will
 just mark the tuples as dead and ensure that it won't try to remove
 tuples that are moved-by-split.  Now, I think the question is how will
 these dead tuples be removed.  We anyway need a separate mechanism to
 clear dead tuples for hash indexes as during scans we are marking the
 tuples as dead if corresponding tuple in heap is dead which are not
 removed later.  This is already taken care in btree code via
 kill_prior_tuple optimization.  So I think clearing of dead tuples can
 be handled by a separate patch.
>>>
>>> That seems like it could work.
>>
>> I have implemented this idea and it works for MVCC scans.  However, I
>> think this might not work for non-MVCC scans.  Consider a case where
>> in Process-1, hash scan has returned one row and before it could check
>> it's validity in heap, vacuum marks that tuple as dead and removed the
>> entry from heap and some new tuple has been placed at that offset in
>> heap.
>
> Oops, that's bad.
>
>> Now when Process-1 checks the validity in heap, it will check
>> for different tuple then what the index tuple was suppose to check.
>> If we want, we can make it work similar to what btree does as being
>> discussed on thread [1], but for that we need to introduce page-scan
>> mode as well in hash indexes.   However, do we really want to solve
>> this problem as part of this patch when this exists for other index am
>> as well?
>
> For what other index AM does this problem exist?
>

By this problem, I mean to say deadlocks for suspended scans, that can
happen in btree for non-Mvcc or other type of scans where we don't
release pin during scan.  In my mind, we have below options:

a. problem of deadlocks for suspended scans should be tackled as a
separate patch as it exists for other indexes (at least for some type
of scans).
b. Implement page-scan mode and then we won't have deadlock problem
for MVCC scans.
c. Let's not care for non-MVCC scans unless we have some way to hit
those for hash indexes and proceed with Dead tuple marking idea.  I
think even if we don't care for non-MVCC scans, we might hit this
problem (deadlocks) when the index relation is unlogged.

Here, even if we want to go with (b), I think we can handle it in a
separate patch, unless you think otherwise.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Andres Freund
On 2016-10-18 13:38:14 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila  
> > wrote:
> >> I have implemented this idea and it works for MVCC scans.  However, I
> >> think this might not work for non-MVCC scans.  Consider a case where
> >> in Process-1, hash scan has returned one row and before it could check
> >> it's validity in heap, vacuum marks that tuple as dead and removed the
> >> entry from heap and some new tuple has been placed at that offset in
> >> heap.
> 
> > Oops, that's bad.
> 
> Do we care?  Under what circumstances would a hash index be used for a
> non-MVCC scan?

Uniqueness checks, are the most important one that comes to mind.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Tom Lane
Robert Haas  writes:
> On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila  wrote:
>> I have implemented this idea and it works for MVCC scans.  However, I
>> think this might not work for non-MVCC scans.  Consider a case where
>> in Process-1, hash scan has returned one row and before it could check
>> it's validity in heap, vacuum marks that tuple as dead and removed the
>> entry from heap and some new tuple has been placed at that offset in
>> heap.

> Oops, that's bad.

Do we care?  Under what circumstances would a hash index be used for a
non-MVCC scan?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Robert Haas
On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila  wrote:
> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas  wrote:
>> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  wrote:
>>> I think one way to avoid the risk of deadlock in above scenario is to
>>> take the cleanup lock conditionally, if we get the cleanup lock then
>>> we will delete the items as we are doing in patch now, else it will
>>> just mark the tuples as dead and ensure that it won't try to remove
>>> tuples that are moved-by-split.  Now, I think the question is how will
>>> these dead tuples be removed.  We anyway need a separate mechanism to
>>> clear dead tuples for hash indexes as during scans we are marking the
>>> tuples as dead if corresponding tuple in heap is dead which are not
>>> removed later.  This is already taken care in btree code via
>>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>>> be handled by a separate patch.
>>
>> That seems like it could work.
>
> I have implemented this idea and it works for MVCC scans.  However, I
> think this might not work for non-MVCC scans.  Consider a case where
> in Process-1, hash scan has returned one row and before it could check
> it's validity in heap, vacuum marks that tuple as dead and removed the
> entry from heap and some new tuple has been placed at that offset in
> heap.

Oops, that's bad.

> Now when Process-1 checks the validity in heap, it will check
> for different tuple then what the index tuple was suppose to check.
> If we want, we can make it work similar to what btree does as being
> discussed on thread [1], but for that we need to introduce page-scan
> mode as well in hash indexes.   However, do we really want to solve
> this problem as part of this patch when this exists for other index am
> as well?

For what other index AM does this problem exist?  Kevin has been
careful not to create this problem for btree, or at least I think he
has.  That's why the pin still has to be held on the index page when
it's a non-MVCC scan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-18 Thread Amit Kapila
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas  wrote:
> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  wrote:
>> I think one way to avoid the risk of deadlock in above scenario is to
>> take the cleanup lock conditionally, if we get the cleanup lock then
>> we will delete the items as we are doing in patch now, else it will
>> just mark the tuples as dead and ensure that it won't try to remove
>> tuples that are moved-by-split.  Now, I think the question is how will
>> these dead tuples be removed.  We anyway need a separate mechanism to
>> clear dead tuples for hash indexes as during scans we are marking the
>> tuples as dead if corresponding tuple in heap is dead which are not
>> removed later.  This is already taken care in btree code via
>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>> be handled by a separate patch.
>
> That seems like it could work.
>

I have implemented this idea and it works for MVCC scans.  However, I
think this might not work for non-MVCC scans.  Consider a case where
in Process-1, hash scan has returned one row and before it could check
it's validity in heap, vacuum marks that tuple as dead and removed the
entry from heap and some new tuple has been placed at that offset in
heap.  Now when Process-1 checks the validity in heap, it will check
for different tuple then what the index tuple was suppose to check.
If we want, we can make it work similar to what btree does as being
discussed on thread [1], but for that we need to introduce page-scan
mode as well in hash indexes.   However, do we really want to solve
this problem as part of this patch when this exists for other index am
as well?


[1]  - 
https://www.postgresql.org/message-id/CACjxUsNtBXe1OfRp%3DacB%2B8QFAVWJ%3Dnr55_HMmqQYceCzVGF4tQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-10 Thread Amit Kapila
On Mon, Oct 10, 2016 at 10:07 PM, Jeff Janes  wrote:
> On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila 
> wrote:
>>
>> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila 
>> wrote:
>> > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas 
>> > wrote:
>> >
>> >> Another thing I don't quite understand about this algorithm is that in
>> >> order to conditionally lock the target primary bucket page, we'd first
>> >> need to read and pin it.  And that doesn't seem like a good thing to
>> >> do while we're holding a shared content lock on the metapage, because
>> >> of the principle that we don't want to hold content locks across I/O.
>> >>
>> >
>>
>> Aren't we already doing this during BufferAlloc() when the buffer
>> selected by StrategyGetBuffer() is dirty?
>
>
> Right, you probably shouldn't allocate another buffer while holding a
> content lock on a different one, if you can help it.
>

I don't see the problem in that, but I guess the simple rule is that
we should not hold content locks for longer duration, which could
happen if we do I/O, or need to allocate a new buffer.

> But, BufferAlloc
> doesn't do that internally, does it?
>

You are right that BufferAlloc() doesn't allocate a new buffer while
holding content lock on another buffer, but it does perform I/O while
holding content lock.

>  It is only a problem if you make it be
> one by the way you use it.  Am I missing something?
>
>>
>>
>> > I think we can release metapage content lock before reading the buffer.
>> >
>>
>> On thinking about this again, if we release the metapage content lock
>> before reading and pinning the primary bucket page, then we need to
>> take it again to verify if the split has happened during the time we
>> don't have a lock on a metapage.  Releasing and again taking content
>> lock on metapage is not
>> good from the performance aspect.  Do you have some other idea for this?
>
>
> Doesn't the relcache patch effectively deal wit hthis?  If this is a
> sticking point, maybe the relcache patch could be incorporated into this
> one.
>

Yeah, relcache patch would eliminate the need for metapage locking,
but that is not a blocking point.  As this patch is mainly to enable
WAL logging, so there is no urgency to incorporate relcache patch,
even if we have to go with an algorithm where we need to take the
metapage lock twice to verify the splits.  Having said that, I am
okay, if Robert and or others are also in favour of combining the two
patches (patch in this thread and cache the metapage patch).   If we
don't want to hold content lock across another ReadBuffer call, then
another option could be to modify the read algorithm as below:

read the metapage
compute bucket number for target hash key based on metapage contents
read the required block
loop:
acquire shared content lock on metapage
recompute bucket number for target hash key based on metapage contents
   if the recomputed block number is not same as the block number we read
  release meta page content lock
  read the recomputed block number
   else
   break;
if (we can't get a shared content lock on the target bucket without blocking)
loop:
release meta page content lock
take a shared content lock on the target primary bucket page
take a shared content lock on the metapage
if (previously-computed target bucket has not been split)
break;

The basic change here is that first we compute the target block number
*without* locking metapage and then after locking the metapage, if
both doesn't match, then we need to again read the computed block
number.

Thoughts?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-10 Thread Jeff Janes
On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila 
wrote:

> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila 
> wrote:
> > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas 
> wrote:
> >
> >> Another thing I don't quite understand about this algorithm is that in
> >> order to conditionally lock the target primary bucket page, we'd first
> >> need to read and pin it.  And that doesn't seem like a good thing to
> >> do while we're holding a shared content lock on the metapage, because
> >> of the principle that we don't want to hold content locks across I/O.
> >>
> >
>
> Aren't we already doing this during BufferAlloc() when the buffer
> selected by StrategyGetBuffer() is dirty?
>

Right, you probably shouldn't allocate another buffer while holding a
content lock on a different one, if you can help it. But, BufferAlloc
doesn't do that internally, does it?  It is only a problem if you make it
be one by the way you use it.  Am I missing something?


>
> > I think we can release metapage content lock before reading the buffer.
> >
>
> On thinking about this again, if we release the metapage content lock
> before reading and pinning the primary bucket page, then we need to
> take it again to verify if the split has happened during the time we
> don't have a lock on a metapage.  Releasing and again taking content
> lock on metapage is not
> good from the performance aspect.  Do you have some other idea for this?
>

Doesn't the relcache patch effectively deal wit hthis?  If this is a
sticking point, maybe the relcache patch could be incorporated into this
one.

Cheers,

Jeff


Re: [HACKERS] Hash Indexes

2016-10-10 Thread Amit Kapila
On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila  wrote:
> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas  wrote:
>
>> Another thing I don't quite understand about this algorithm is that in
>> order to conditionally lock the target primary bucket page, we'd first
>> need to read and pin it.  And that doesn't seem like a good thing to
>> do while we're holding a shared content lock on the metapage, because
>> of the principle that we don't want to hold content locks across I/O.
>>
>

Aren't we already doing this during BufferAlloc() when the buffer
selected by StrategyGetBuffer() is dirty?

> I think we can release metapage content lock before reading the buffer.
>

On thinking about this again, if we release the metapage content lock
before reading and pinning the primary bucket page, then we need to
take it again to verify if the split has happened during the time we
don't have a lock on a metapage.  Releasing and again taking content
lock on metapage is not
good from the performance aspect.  Do you have some other idea for this?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-06 Thread Amit Kapila
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas  wrote:
> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  wrote:
>> I think one way to avoid the risk of deadlock in above scenario is to
>> take the cleanup lock conditionally, if we get the cleanup lock then
>> we will delete the items as we are doing in patch now, else it will
>> just mark the tuples as dead and ensure that it won't try to remove
>> tuples that are moved-by-split.  Now, I think the question is how will
>> these dead tuples be removed.  We anyway need a separate mechanism to
>> clear dead tuples for hash indexes as during scans we are marking the
>> tuples as dead if corresponding tuple in heap is dead which are not
>> removed later.  This is already taken care in btree code via
>> kill_prior_tuple optimization.  So I think clearing of dead tuples can
>> be handled by a separate patch.
>
> That seems like it could work.  The hash scan code will need to be
> made smart enough to ignore any tuples marked dead, if it isn't
> already.
>

It already takes care of ignoring killed tuples in below code, though
the way is much less efficient as compare to btree.  Basically, it
fetches the matched tuple and then check if it is dead where as in
btree while matching the key, it does the same check.  It might be
efficient to do it before matching the hashkey, but I think it is a
matter of separate patch.
hashgettuple()
{
..
/*
* Skip killed tuples if asked to.
*/
if (scan->ignore_killed_tuples)
}



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-05 Thread Robert Haas
On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila  wrote:
> I think one way to avoid the risk of deadlock in above scenario is to
> take the cleanup lock conditionally, if we get the cleanup lock then
> we will delete the items as we are doing in patch now, else it will
> just mark the tuples as dead and ensure that it won't try to remove
> tuples that are moved-by-split.  Now, I think the question is how will
> these dead tuples be removed.  We anyway need a separate mechanism to
> clear dead tuples for hash indexes as during scans we are marking the
> tuples as dead if corresponding tuple in heap is dead which are not
> removed later.  This is already taken care in btree code via
> kill_prior_tuple optimization.  So I think clearing of dead tuples can
> be handled by a separate patch.

That seems like it could work.  The hash scan code will need to be
made smart enough to ignore any tuples marked dead, if it isn't
already.  More aggressive cleanup can be left for another patch.

> I have also thought about using page-scan-at-a-time idea which has
> been discussed upthread[1], but I think we can't completely eliminate
> the need to out-wait scans (cleanup lock requirement) for scans that
> are started when split-in-progress or for non-MVCC scans as described
> in that e-mail [1].  We might be able to find some way to solve the
> problem with this approach, but I think it will be slightly
> complicated and much more work is required as compare to previous
> approach.

There are several levels of aggressiveness here with different locking
requirements:

1. Mark line items dead without reorganizing the page.  Needs an
exclusive content lock, no more.  Even a shared content lock may be
OK, as for other opportunistic bit-flipping.
2. Mark line items dead and compact the tuple data.  If a pin is
sufficient to look at tuple data, as it is for the heap, then a
cleanup lock is required here.  But if we always hold a shared content
lock when looking at the tuple data, it might be possible to do this
with just an exclusive content lock.
3. Remove dead line items completely, compacting the tuple data and
the item-pointer array.  Doing this with only an exclusive content
lock certainly needs page-at-a-time mode because otherwise a searcher
that resumes a scan later might resume from the wrong place.  It also
needs the guarantee mentioned for point #2, namely that nobody will be
examining the tuple data without a shared content lock.
4. Squeezing the bucket.  This is probably always going to require a
cleanup lock, because otherwise it's pretty unclear how a concurrent
scan could be made safe.  I suppose the scan could remember every TID
it has seen, somehow detect that a squeeze had happened, and rescan
the whole bucket ignoring TIDs already returned, but that seems to
require the client to use potentially unbounded amounts of memory to
remember already-returned TIDs, plus an as-yet-uninvented mechanism
for detecting that a squeeze has happened.  So this seems like a
dead-end to me.

I think that it is very much worthwhile to reduce the required lock
strength from cleanup-lock to exclusive-lock in as many cases as
possible, but I don't think it will be possible to completely
eliminate the need to take the cleanup lock in some cases.  However,
if we can always take the cleanup lock conditionally and never be in a
situation where it's absolutely required, we should be OK - and even
level (1) gives you that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-03 Thread Amit Kapila
On Tue, Oct 4, 2016 at 10:06 AM, Amit Kapila  wrote:
> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila  wrote:
>> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas  wrote:
>>> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas  wrote:
>>>
>>> As I was looking at the old text regarding deadlock risk, I realized
>>> what may be a serious problem.  Suppose process A is performing a scan
>>> of some hash index.  While the scan is suspended, it attempts to take
>>> a lock and is blocked by process B.  Process B, meanwhile, is running
>>> VACUUM on that hash index.  Eventually, it will do
>>> LockBufferForCleanup() on the hash bucket on which process A holds a
>>> buffer pin, resulting in an undetected deadlock. In the current
>>> coding, A would hold a heavyweight lock and B would attempt to acquire
>>> a conflicting heavyweight lock, and the deadlock detector would kill
>>> one of them.  This patch probably breaks that.  I notice that that's
>>> the only place where we attempt to acquire a buffer cleanup lock
>>> unconditionally; every place else, we acquire the lock conditionally,
>>> so there's no deadlock risk.  Once we resolve this problem, the
>>> paragraph about deadlock risk in this section should be revised to
>>> explain whatever solution we come up with.
>>>
>>> By the way, since VACUUM must run in its own transaction, B can't be
>>> holding arbitrary locks, but that doesn't seem quite sufficient to get
>>> us out of the woods.  It will at least hold ShareUpdateExclusiveLock
>>> on the relation being vacuumed, and process A could attempt to acquire
>>> that same lock.
>>>
>>
>> Right, I think there is a danger of deadlock in above situation.
>> Needs some more thoughts.
>>
>
> I think one way to avoid the risk of deadlock in above scenario is to
> take the cleanup lock conditionally, if we get the cleanup lock then
> we will delete the items as we are doing in patch now, else it will
> just mark the tuples as dead and ensure that it won't try to remove
> tuples that are moved-by-split.  Now, I think the question is how will
> these dead tuples be removed.  We anyway need a separate mechanism to
> clear dead tuples for hash indexes as during scans we are marking the
> tuples as dead if corresponding tuple in heap is dead which are not
> removed later.  This is already taken care in btree code via
> kill_prior_tuple optimization.  So I think clearing of dead tuples can
> be handled by a separate patch.
>

I think we can also remove the dead tuples next time when vacuum
visits the bucket and is able to acquire the cleanup lock.  Right now,
we are just checking if the corresponding heap tuple is dead, we can
have an additional check as well to ensure if the current item is dead
in index, then consider it in list of deletable items.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-03 Thread Amit Kapila
On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila  wrote:
> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas  wrote:
>> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas  wrote:
>>
>> As I was looking at the old text regarding deadlock risk, I realized
>> what may be a serious problem.  Suppose process A is performing a scan
>> of some hash index.  While the scan is suspended, it attempts to take
>> a lock and is blocked by process B.  Process B, meanwhile, is running
>> VACUUM on that hash index.  Eventually, it will do
>> LockBufferForCleanup() on the hash bucket on which process A holds a
>> buffer pin, resulting in an undetected deadlock. In the current
>> coding, A would hold a heavyweight lock and B would attempt to acquire
>> a conflicting heavyweight lock, and the deadlock detector would kill
>> one of them.  This patch probably breaks that.  I notice that that's
>> the only place where we attempt to acquire a buffer cleanup lock
>> unconditionally; every place else, we acquire the lock conditionally,
>> so there's no deadlock risk.  Once we resolve this problem, the
>> paragraph about deadlock risk in this section should be revised to
>> explain whatever solution we come up with.
>>
>> By the way, since VACUUM must run in its own transaction, B can't be
>> holding arbitrary locks, but that doesn't seem quite sufficient to get
>> us out of the woods.  It will at least hold ShareUpdateExclusiveLock
>> on the relation being vacuumed, and process A could attempt to acquire
>> that same lock.
>>
>
> Right, I think there is a danger of deadlock in above situation.
> Needs some more thoughts.
>

I think one way to avoid the risk of deadlock in above scenario is to
take the cleanup lock conditionally, if we get the cleanup lock then
we will delete the items as we are doing in patch now, else it will
just mark the tuples as dead and ensure that it won't try to remove
tuples that are moved-by-split.  Now, I think the question is how will
these dead tuples be removed.  We anyway need a separate mechanism to
clear dead tuples for hash indexes as during scans we are marking the
tuples as dead if corresponding tuple in heap is dead which are not
removed later.  This is already taken care in btree code via
kill_prior_tuple optimization.  So I think clearing of dead tuples can
be handled by a separate patch.

I have also thought about using page-scan-at-a-time idea which has
been discussed upthread[1], but I think we can't completely eliminate
the need to out-wait scans (cleanup lock requirement) for scans that
are started when split-in-progress or for non-MVCC scans as described
in that e-mail [1].  We might be able to find some way to solve the
problem with this approach, but I think it will be slightly
complicated and much more work is required as compare to previous
approach.

What is your preference among above approaches to resolve this problem
or let me know if you have a better idea to solve it?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1Jj1UqneTXrywr%3DGg87vgmnMma87LuscN_r3hKaHd%3DL2g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-03 Thread Tom Lane
Jeff Janes  writes:
> I've done a simple comparison using pgbench's default transaction, in which
> all the primary keys have been dropped and replaced with indexes of either
> hash or btree type, alternating over many rounds.

> I run 'pgbench -c16 -j16 -T 900 -M prepared' on an 8 core machine with a
> scale of 40.  All the data fits in RAM, but not in shared_buffers (128MB).

> I find a 4% improvement for hash indexes over btree indexes, 9324.744
> vs 9727.766.  The difference is significant at p-value of 1.9e-9.

Thanks for doing this work!

> The four versions of hash indexes (HEAD, concurrent, wal, cache, applied
> cumulatively) have no statistically significant difference in performance
> from each other.

Interesting.

> I think I don't see improvement in hash performance with the concurrent and
> cache patches because I don't have enough cores to get to the contention
> that those patches are targeted at.

Possibly.  However, if the cache patch is not a prerequisite to the WAL
fixes, IMO somebody would have to demonstrate that it has a measurable
performance benefit before it would get in.  It certainly doesn't look
like it's simplifying the code, so I wouldn't take it otherwise.

I think, though, that this is enough to put to bed the argument that
we should toss the hash AM entirely.  If it's already competitive with
btree today, despite the lack of attention that it's gotten, there is
reason to hope that it will be a significant win (for some use-cases,
obviously) in future.  We should now get back to reviewing these patches
on their own merits.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-03 Thread Jeff Janes
On Thu, Sep 29, 2016 at 5:14 PM, Robert Haas  wrote:

> On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan  wrote:
> > On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund 
> wrote:
> >> On 2016-09-28 15:04:30 -0400, Robert Haas wrote:
> >>> Andres already
> >>> stated that he things working on btree-over-hash would be more
> >>> beneficial than fixing hash, but at this point it seems like he's the
> >>> only one who takes that position.
> >>
> >> Note that I did *NOT* take that position. I was saying that I think we
> >> should evaluate whether that's not a better approach, doing some simple
> >> performance comparisons.
> >
> > I, for one, agree with this position.
>
> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
> bring this up only after the code has already been written.  The first
> post on this thread was on May 10th.  The first version of the patch
> was posted on June 16th.  This position was first articulated on
> September 15th.
>
> But, by all means, please feel free to do the performance comparison
> and post the results.  I'd be curious to see them myself.
>


I've done a simple comparison using pgbench's default transaction, in which
all the primary keys have been dropped and replaced with indexes of either
hash or btree type, alternating over many rounds.

I run 'pgbench -c16 -j16 -T 900 -M prepared' on an 8 core machine with a
scale of 40.  All the data fits in RAM, but not in shared_buffers (128MB).

I find a 4% improvement for hash indexes over btree indexes, 9324.744
vs 9727.766.  The difference is significant at p-value of 1.9e-9.

The four versions of hash indexes (HEAD, concurrent, wal, cache, applied
cumulatively) have no statistically significant difference in performance
from each other.

I certainly don't see how btree-over-hash-over-integer could be better than
direct btree-over-integer.

I think I don't see improvement in hash performance with the concurrent and
cache patches because I don't have enough cores to get to the contention
that those patches are targeted at.  But since the concurrent patch is a
prerequisite to the wal patch, that is enough to justify it even without a
demonstrated performance boost.  A 4% gain is not astonishing, but is nice
to have provided we can get it without giving up crash safety.

Cheers,

Jeff


Re: [HACKERS] Hash Indexes

2016-10-02 Thread Michael Paquier
On Mon, Oct 3, 2016 at 12:42 AM, Pavel Stehule  wrote:
>
>
> 2016-10-02 12:40 GMT+02:00 Michael Paquier :
>>
>> On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark  wrote:
>> > On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas 
>> > wrote:
>> >> For one thing, we can stop shipping a totally broken feature in release
>> >> after release
>> >
>> > For what it's worth I'm for any patch that can accomplish that.
>> >
>> > In retrospect I think we should have done the hash-over-btree thing
>> > ten years ago but we didn't and if Amit's patch makes hash indexes
>> > recoverable today then go for it.
>>
>> +1.
>
> +1

And moved to next CF to make it breath.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-02 Thread Pavel Stehule
2016-10-02 12:40 GMT+02:00 Michael Paquier :

> On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark  wrote:
> > On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas 
> wrote:
> >> For one thing, we can stop shipping a totally broken feature in release
> after release
> >
> > For what it's worth I'm for any patch that can accomplish that.
> >
> > In retrospect I think we should have done the hash-over-btree thing
> > ten years ago but we didn't and if Amit's patch makes hash indexes
> > recoverable today then go for it.
>
> +1.
>

+1

Pavel


> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Hash Indexes

2016-10-02 Thread Michael Paquier
On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark  wrote:
> On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas  wrote:
>> For one thing, we can stop shipping a totally broken feature in release 
>> after release
>
> For what it's worth I'm for any patch that can accomplish that.
>
> In retrospect I think we should have done the hash-over-btree thing
> ten years ago but we didn't and if Amit's patch makes hash indexes
> recoverable today then go for it.

+1.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-01 Thread Greg Stark
On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas  wrote:
> For one thing, we can stop shipping a totally broken feature in release after 
> release

For what it's worth I'm for any patch that can accomplish that.

In retrospect I think we should have done the hash-over-btree thing
ten years ago but we didn't and if Amit's patch makes hash indexes
recoverable today then go for it.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-01 Thread ktm

Andres Freund :


On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote:

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas  wrote:
> I would just be very disappointed if, after the amount of work that
> Amit and others have put into this project, the code gets rejected
> because somebody thinks a different project would have been more worth
> doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this making the hash index code any less useful
from the user's perspective.


Me neither.

I'm concerned that this is a heck of a lot of work, and I don't think
we've reached the end of it by a good bit. I think it would have, and
probably still is, a more efficient use of time to go for the
hash-via-btree method, and rip out the current hash indexes.  But that's
just me.

I find it more than a bit odd to be accused of trying to waste others
time by saying this, and that this is too late because time has already
been invested. Especially the latter never has been a standard in the
community, and while excruciatingly painful when one is the person(s)
having invested the time, it probably shouldn't be.



> The fact that we have hash indexes already and cannot remove them
> because too much other code depends on hash opclasses is also, in my
> opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?


Yes, that was what I was pretty much thinking. I was kind of guessing
that this might be easiest implemented as a separate AM ("hash2" ;))
that's just a layer ontop of nbtree.

Greetings,

Andres Freund


Hi,

There have been benchmarks posted over the years were even the non-WAL  
logged hash out performed the btree usage variant. You cannot argue  
against O(1) algorithm behavior. We need to have a usable hash index  
so that others can help improve it.


My 2 cents.

Regards,
Ken




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Amit Kapila
On 30-Sep-2016 10:26 PM, "Peter Geoghegan"  wrote:
>
> On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas 
wrote:
> > I would just be very disappointed if, after the amount of work that
> > Amit and others have put into this project, the code gets rejected
> > because somebody thinks a different project would have been more worth
> > doing.
>
> I wouldn't presume to tell anyone else how to spend their time, and am
> not concerned about this patch making the hash index code any less
> useful from the user's perspective. If this is how we remove the wart
> of hash indexes not being WAL-logged, that's fine by me. I'm trying to
> be helpful.
>

If that is fine, then I think we should do that.  I want to bring it to
your notice that we have already seen and reported that with proposed set
of patches, hash indexes are good bit faster than btree, so that adds
additional value in making them WAL-logged.

> > As Tom said upthread: $But to kick the hash AM as such to the
> > curb is to say
> > "sorry, there will never be O(1) index lookups in Postgres".$  I
> > think that's correct and a sufficiently-good reason to pursue this
> > work, regardless of the merits (or lack of merits) of hash-over-btree.
>
> I don't think that "O(1) index lookups" is a useful guarantee with a
> very expensive constant factor.

The constant factor doesn't play much role when data doesn't have
duplicates or have lesser duplicates.

Amit seemed to agree with this, since
> he spoke of the importance of both theoretical performance benefits
> and practically realizable performance benefits.
>

No, I don't agree with that rather I think hash indexes are theoretically
faster than btree and we have seen that practically as well for quite a few
cases (for read workloads - when used with unique data and also in nest
loops).

With Regards,
Amit Kapila


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Andres Freund
On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote:
> On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas  wrote:
> > I would just be very disappointed if, after the amount of work that
> > Amit and others have put into this project, the code gets rejected
> > because somebody thinks a different project would have been more worth
> > doing.
> 
> I wouldn't presume to tell anyone else how to spend their time, and am
> not concerned about this making the hash index code any less useful
> from the user's perspective.

Me neither.

I'm concerned that this is a heck of a lot of work, and I don't think
we've reached the end of it by a good bit. I think it would have, and
probably still is, a more efficient use of time to go for the
hash-via-btree method, and rip out the current hash indexes.  But that's
just me.

I find it more than a bit odd to be accused of trying to waste others
time by saying this, and that this is too late because time has already
been invested. Especially the latter never has been a standard in the
community, and while excruciatingly painful when one is the person(s)
having invested the time, it probably shouldn't be.


> > The fact that we have hash indexes already and cannot remove them
> > because too much other code depends on hash opclasses is also, in my
> > opinion, a sufficiently good reason to pursue improving them.
> 
> I think that Andres was suggesting that hash index opclasses would be
> usable with hash-over-btree, so you might still not end up with the
> wart of having hash opclasses without hash indexes (an idea that has
> been proposed and rejected at least once before now). Andres?

Yes, that was what I was pretty much thinking. I was kind of guessing
that this might be easiest implemented as a separate AM ("hash2" ;))
that's just a layer ontop of nbtree.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Tom Lane
Peter Geoghegan  writes:
> On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas  wrote:
>> The fact that we have hash indexes already and cannot remove them
>> because too much other code depends on hash opclasses is also, in my
>> opinion, a sufficiently good reason to pursue improving them.

> I think that Andres was suggesting that hash index opclasses would be
> usable with hash-over-btree, so you might still not end up with the
> wart of having hash opclasses without hash indexes (an idea that has
> been proposed and rejected at least once before now). Andres?

That's an interesting point.  If we were to flat-out replace the hash AM
with a hash-over-btree AM, the existing hash opclasses would just migrate
to that unchanged.  But if someone wanted to add hash-over-btree alongside
the hash AM, it would be necessary to clone all those opclass entries,
or else find a way for the two AMs to share pg_opclass etc entries.
Either one of those is kind of annoying.  (Although if we did do the work
of implementing the latter, it might come in handy in future; you could
certainly imagine that there will be cases like a next-generation GIST AM
wanting to reuse the opclasses of existing GIST, say.)

But having said that, I remain opposed to removing the hash AM.
If someone wants to implement hash-over-btree, that's cool with me,
but I'd much rather put it in beside plain hash and let them duke
it out in the field.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Peter Geoghegan
On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas  wrote:
> I would just be very disappointed if, after the amount of work that
> Amit and others have put into this project, the code gets rejected
> because somebody thinks a different project would have been more worth
> doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this patch making the hash index code any less
useful from the user's perspective. If this is how we remove the wart
of hash indexes not being WAL-logged, that's fine by me. I'm trying to
be helpful.

> As Tom said upthread: $But to kick the hash AM as such to the
> curb is to say
> "sorry, there will never be O(1) index lookups in Postgres".$  I
> think that's correct and a sufficiently-good reason to pursue this
> work, regardless of the merits (or lack of merits) of hash-over-btree.

I don't think that "O(1) index lookups" is a useful guarantee with a
very expensive constant factor. Amit seemed to agree with this, since
he spoke of the importance of both theoretical performance benefits
and practically realizable performance benefits.

> The fact that we have hash indexes already and cannot remove them
> because too much other code depends on hash opclasses is also, in my
> opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Peter Geoghegan
On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas  wrote:
> I would just be very disappointed if, after the amount of work that
> Amit and others have put into this project, the code gets rejected
> because somebody thinks a different project would have been more worth
> doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this making the hash index code any less useful
from the user's perspective. If this is how we remove the wart of hash
indexes not being WAL-logged, that's fine by me. I am trying to be
helpful.

> As Tom said upthread: $But to kick the hash AM as such to the
> curb is to say
> "sorry, there will never be O(1) index lookups in Postgres".$  I
> think that's correct and a sufficiently-good reason to pursue this
> work, regardless of the merits (or lack of merits) of hash-over-btree.

I don't think that "O(1) index lookups" is a useful guarantee with a
very expensive constant factor. Amit said: "I think any discussion on
whether we should consider not to improve current hash indexes is only
meaningful if some one has a  code which can prove both theoretically
and practically that it is better than hash indexes for all usages",
so I think that he shares this view.

> The fact that we have hash indexes already and cannot remove them
> because too much other code depends on hash opclasses is also, in my
> opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?

To be clear: I haven't expressed any opinion on this patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Robert Haas
On Fri, Sep 30, 2016 at 7:47 AM, Peter Geoghegan  wrote:
> My only firm position is that it wouldn't be very hard to investigate
> hash-over-btree to Andres' satisfaction, say, so, why not? I'm
> surprised that this has caused consternation -- ISTM that Andres'
> suggestion is *perfectly* reasonable. It doesn't appear to be an
> objection to anything in particular.

I would just be very disappointed if, after the amount of work that
Amit and others have put into this project, the code gets rejected
because somebody thinks a different project would have been more worth
doing.  As Tom said upthread: $$But to kick the hash AM as such to the
curb is to say
"sorry, there will never be O(1) index lookups in Postgres".$$  I
think that's correct and a sufficiently-good reason to pursue this
work, regardless of the merits (or lack of merits) of hash-over-btree.
The fact that we have hash indexes already and cannot remove them
because too much other code depends on hash opclasses is also, in my
opinion, a sufficiently good reason to pursue improving them.  I don't
think the project needs the additional justification of outperforming
a hash-over-btree in order to exist, even if such a comparison could
be done fairly, which I suspect is harder than you're crediting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Peter Geoghegan
On Fri, Sep 30, 2016 at 9:07 AM, Amit Kapila  wrote:
> Considering,  I have missed the real intention of their suggestions, I think
> such a serious objection on any work should be discussed on list.  To answer
> the actual objection, I have already mentioned upthread that we can deduce
> from the current tests done by Jesper and Mithun that there are some cases
> where hash index will be better than hash-over-btree (tests done over
> integer columns).  I think any discussion on whether we should consider not
> to improve current hash indexes is only meaningful if some one has a  code
> which can prove both theoretically and practically that it is better than
> hash indexes for all usages.

I cannot speak for Andres, but you judged my intent here correctly. I
have no firm position on any of this just yet; I haven't even read the
patch. I just think that it is worth doing some simple analysis of a
hash-over-btree implementation, with simple prototyping and a simple
test-case. I would consider that a due-diligence thing, because,
honestly, it seems obvious to me that it should be at least checked
out informally.

I wasn't aware that there was already some analysis of this. Robert
did just acknowledge that it is *possible* that "the hash-over-btree
idea completely trounces hash indexes", so the general tone of this
thread suggested to me that there was little or no analysis of
hash-over-btree. I'm willing to believe that I'm wrong to be
dismissive of the hash AM in general, and I'm even willing to be
flexible on crediting the hash AM with being less optimized overall
(assuming we can see a way past that).

My only firm position is that it wouldn't be very hard to investigate
hash-over-btree to Andres' satisfaction, say, so, why not? I'm
surprised that this has caused consternation -- ISTM that Andres'
suggestion is *perfectly* reasonable. It doesn't appear to be an
objection to anything in particular.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-30 Thread Amit Kapila
On 30-Sep-2016 6:24 AM, "Robert Haas"  wrote:
>
> On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund  wrote:
> > On September 29, 2016 5:28:00 PM PDT, Robert Haas 
wrote:
> >>On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund 
> >>wrote:
>  Well, I, for one, find it frustrating.  It seems pretty unhelpful to
>  bring this up only after the code has already been written.
> >>>
> >>> I brought this up in person at pgcon too.
> >>
> >>To whom?  In what context?
> >
> > Amit, over dinner.
>
> OK, well, I can't really comment on that, then, except to say that if
> you waited three months to follow up on the mailing list, you probably
> can't blame Amit if he thought that it was more of a casual suggestion
> than a serious objection.  Maybe it was?  I don't know.
>

Both of them have talked about hash indexes with me offline. Peter
mentioned that it would be better to improve btree rather than hash
indexes. IIRC, Andres asked me mainly about what use cases I have in mind
for hash indexes and then we do have some further discussion on the same
thing where he was not convinced that there is any big use case for hash
indexes even though there may be some cases. In that discussion, as he is
saying and I don't doubt him, he would have told me the alternative, but it
was not apparent to me that he is expecting some sort of comparison.

What I got from both the discussions was a friendly gesture that it might
be a better use of my time, if I work on some other problem.  I really
respect suggestions from both of them, but it was no where clear to me that
any one of  them is expecting any comparison of other approach.

Considering,  I have missed the real intention of their suggestions, I
think such a serious objection on any work should be discussed on list.  To
answer the actual objection, I have already mentioned upthread that we can
deduce from the current tests done by Jesper and Mithun that there are some
cases where hash index will be better than hash-over-btree (tests done over
integer columns).  I think any discussion on whether we should consider not
to improve current hash indexes is only meaningful if some one has a  code
which can prove both theoretically and practically that it is better than
hash indexes for all usages.

Note - excuse me for formatting of this email as I am on travel and using
my phone.

With Regards,
Amit Kapila.


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Robert Haas
On Thu, Sep 29, 2016 at 8:53 PM, Peter Geoghegan  wrote:
> On Fri, Sep 30, 2016 at 1:14 AM, Robert Haas  wrote:
>>> I, for one, agree with this position.
>>
>> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
>> bring this up only after the code has already been written.  The first
>> post on this thread was on May 10th.  The first version of the patch
>> was posted on June 16th.  This position was first articulated on
>> September 15th.
>
> Really, what do we have to lose at this point? It's not very difficult
> to do what Andres proposes.

Well, first of all, I can't, because I don't really understand what
tests he has in mind.  Maybe somebody else does, in which case perhaps
they could do the work and post the results.  If the tests really are
simple, that shouldn't be much of a burden.

But, second, suppose we do the tests and find out that the
hash-over-btree idea completely trounces hash indexes.  What then?  I
don't think that would really prove anything because, as I said in my
email to Andres, the current hash index code is severely
under-optimized, so it's not really an apples-to-apples comparison.
But even if it did prove something, is the idea then that Amit (with
help from Mithun and Ashutosh Sharma) should throw away the ~8 months
of development work that's been done on hash indexes in favor of
starting all over with a new and probably harder project to build a
whole new AM, and just leave hash indexes broken?  That doesn't seem
like a very reasonable think to ask.  Leaving hash indexes broken
fixes no problem that we have.

On the other hand, applying those patches (after they've been suitably
reviewed and fixed up) does fix several things.  For one thing, we can
stop shipping a totally broken feature in release after release.  For
another thing, those hash indexes do in fact outperform btree on some
workloads, and with more work they can probably beat btree on more
workloads.  And if somebody later wants to write hash-over-btree and
that turns out to be better still, great!  I'm not blocking anyone
from doing that.

The only argument that's been advanced for not fixing hash indexes is
that we'd then have to give people accurate guidance on whether to
choose hash or btree, but that would also be true of a hypothetical
hash-over-btree AM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Robert Haas
On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund  wrote:
> On September 29, 2016 5:28:00 PM PDT, Robert Haas  
> wrote:
>>On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund 
>>wrote:
 Well, I, for one, find it frustrating.  It seems pretty unhelpful to
 bring this up only after the code has already been written.
>>>
>>> I brought this up in person at pgcon too.
>>
>>To whom?  In what context?
>
> Amit, over dinner.

OK, well, I can't really comment on that, then, except to say that if
you waited three months to follow up on the mailing list, you probably
can't blame Amit if he thought that it was more of a casual suggestion
than a serious objection.  Maybe it was?  I don't know.

For  my part, I don't really understand how you think that we could
find anything out via relatively simple tests.  The hash index code is
horribly under-maintained, which is why Amit is able to get large
performance improvements out of improving it.  If you compare it to
btree in some way, it's probably going to lose.  But I don't think
that answers the question of whether a hash AM that somebody's put
some work into will win or lose against a hypothetical hash-over-btree
AM that nobody's written.  Even if it wins, is that really a reason to
leave the hash index code itself in a state of disrepair?  We probably
would have removed it already except that the infrastructure is used
for hash joins and hash aggregation, so we really can't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Peter Geoghegan
On Fri, Sep 30, 2016 at 1:14 AM, Robert Haas  wrote:
>> I, for one, agree with this position.
>
> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
> bring this up only after the code has already been written.  The first
> post on this thread was on May 10th.  The first version of the patch
> was posted on June 16th.  This position was first articulated on
> September 15th.

Really, what do we have to lose at this point? It's not very difficult
to do what Andres proposes.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Peter Geoghegan
On Fri, Sep 30, 2016 at 1:29 AM, Andres Freund  wrote:
>>To whom?  In what context?
>
> Amit, over dinner.

In case it matters, I also talked to Amit about this privately.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Andres Freund


On September 29, 2016 5:28:00 PM PDT, Robert Haas  wrote:
>On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund 
>wrote:
>>> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
>>> bring this up only after the code has already been written.
>>
>> I brought this up in person at pgcon too.
>
>To whom?  In what context?

Amit, over dinner.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Robert Haas
On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund  wrote:
>> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
>> bring this up only after the code has already been written.
>
> I brought this up in person at pgcon too.

To whom?  In what context?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Andres Freund
On 2016-09-29 20:14:40 -0400, Robert Haas wrote:
> On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan  wrote:
> > On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund  wrote:
> >> On 2016-09-28 15:04:30 -0400, Robert Haas wrote:
> >>> Andres already
> >>> stated that he things working on btree-over-hash would be more
> >>> beneficial than fixing hash, but at this point it seems like he's the
> >>> only one who takes that position.
> >>
> >> Note that I did *NOT* take that position. I was saying that I think we
> >> should evaluate whether that's not a better approach, doing some simple
> >> performance comparisons.
> >
> > I, for one, agree with this position.
>
> Well, I, for one, find it frustrating.  It seems pretty unhelpful to
> bring this up only after the code has already been written.

I brought this up in person at pgcon too.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Robert Haas
On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan  wrote:
> On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund  wrote:
>> On 2016-09-28 15:04:30 -0400, Robert Haas wrote:
>>> Andres already
>>> stated that he things working on btree-over-hash would be more
>>> beneficial than fixing hash, but at this point it seems like he's the
>>> only one who takes that position.
>>
>> Note that I did *NOT* take that position. I was saying that I think we
>> should evaluate whether that's not a better approach, doing some simple
>> performance comparisons.
>
> I, for one, agree with this position.

Well, I, for one, find it frustrating.  It seems pretty unhelpful to
bring this up only after the code has already been written.  The first
post on this thread was on May 10th.  The first version of the patch
was posted on June 16th.  This position was first articulated on
September 15th.

But, by all means, please feel free to do the performance comparison
and post the results.  I'd be curious to see them myself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Peter Geoghegan
On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund  wrote:
> On 2016-09-28 15:04:30 -0400, Robert Haas wrote:
>> Andres already
>> stated that he things working on btree-over-hash would be more
>> beneficial than fixing hash, but at this point it seems like he's the
>> only one who takes that position.
>
> Note that I did *NOT* take that position. I was saying that I think we
> should evaluate whether that's not a better approach, doing some simple
> performance comparisons.

I, for one, agree with this position.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-29 Thread Amit Kapila
On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas  wrote:
> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas  wrote:
>> I'll write another email with my thoughts about the rest of the patch.
>
> I think that the README changes for this patch need a fairly large
> amount of additional work.  Here are a few things I notice:
>
> - The confusion between buckets and pages hasn't been completely
> cleared up.  If you read the beginning of the README, the terminology
> is clearly set forth.  It says:
>
>>> A hash index consists of two or more "buckets", into which tuples are 
>>> placed whenever their hash key maps to the bucket number.  Each bucket in 
>>> the hash index comprises one or more index pages.  The bucket's first page 
>>> is permanently assigned to it when the bucket is created. Additional pages, 
>>> called "overflow pages", are added if the bucket receives too many tuples 
>>> to fit in the primary bucket page."
>
> But later on, you say:
>
>>> Scan will take a lock in shared mode on the primary bucket or on one of the 
>>> overflow page.
>
> So the correct terminology here would be "primary bucket page" not
> "primary bucket".
>
> - In addition, notice that there are two English errors in this
> sentence: the word "the" needs to be added to the beginning of the
> sentence, and the last word needs to be "pages" rather than "page".
> There are a considerable number of similar minor errors; if you can't
> fix them, I'll make a pass over it and clean it up.
>
> - The whole "lock definitions" section seems to me to be pretty loose
> and imprecise about what is happening.  For example, it uses the term
> "split-in-progress" without first defining it.  The sentence quoted
> above says that scans take a lock in shared mode either on the primary
> page or on one of the overflow pages, but it's not to document code by
> saying that it will do either A or B without explaining which one!  In
> fact, I think that a scan will take a content lock first on the
> primary bucket page and then on each overflow page in sequence,
> retaining a pin on the primary buffer page throughout the scan.  So it
> is not one or the other but both in a particular sequence, and that
> can and should be explained.
>
> Another problem with this section is that even when it's precise about
> what is going on, it's probably duplicating what is or should be in
> the following sections where the algorithms for each operation are
> explained.  In the original wording, this section explains what each
> lock protects, and then the following sections explain the algorithms
> in the context of those definitions.  Now, this section contains a
> sketch of the algorithm, and then the following sections lay it out
> again in more detail.  The question of what each lock protects has
> been lost.  Here's an attempt at some text to replace what you have
> here:
>
> ===
> Concurrency control for hash indexes is provided using buffer content
> locks, buffer pins, and cleanup locks.   Here as elsewhere in
> PostgreSQL, cleanup lock means that we hold an exclusive lock on the
> buffer and have observed at some point after acquiring the lock that
> we hold the only pin on that buffer.  For hash indexes, a cleanup lock
> on a primary bucket page represents the right to perform an arbitrary
> reorganization of the entire bucket, while a cleanup lock on an
> overflow page represents the right to perform a reorganization of just
> that page.  Therefore, scans retain a pin on both the primary bucket
> page and the overflow page they are currently scanning, if any.
>

I don't think we take cleanup lock on overflow page, so I will edit that part.

> Splitting a bucket requires a cleanup lock on both the old and new
> primary bucket pages.  VACUUM therefore takes a cleanup lock on every
> bucket page in turn order to remove tuples.  It can also remove tuples
> copied to a new bucket by any previous split operation, because the
> cleanup lock taken on the primary bucket page guarantees that no scans
> which started prior to the most recent split can still be in progress.
> After cleaning each page individually, it attempts to take a cleanup
> lock on the primary bucket page in order to "squeeze" the bucket down
> to the minimum possible number of pages.
> ===
>
> As I was looking at the old text regarding deadlock risk, I realized
> what may be a serious problem.  Suppose process A is performing a scan
> of some hash index.  While the scan is suspended, it attempts to take
> a lock and is blocked by process B.  Process B, meanwhile, is running
> VACUUM on that hash index.  Eventually, it will do
> LockBufferForCleanup() on the hash bucket on which process A holds a
> buffer pin, resulting in an undetected deadlock. In the current
> coding, A would hold a heavyweight lock and B would attempt to acquire
> a conflicting heavyweight lock, and the deadlock detector would kill
> one of them.  This patch probably breaks that.  I notice that that's
> the only place where we att

Re: [HACKERS] Hash Indexes

2016-09-28 Thread Robert Haas
On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas  wrote:
> I'll write another email with my thoughts about the rest of the patch.

I think that the README changes for this patch need a fairly large
amount of additional work.  Here are a few things I notice:

- The confusion between buckets and pages hasn't been completely
cleared up.  If you read the beginning of the README, the terminology
is clearly set forth.  It says:

>> A hash index consists of two or more "buckets", into which tuples are placed 
>> whenever their hash key maps to the bucket number.  Each bucket in the hash 
>> index comprises one or more index pages.  The bucket's first page is 
>> permanently assigned to it when the bucket is created. Additional pages, 
>> called "overflow pages", are added if the bucket receives too many tuples to 
>> fit in the primary bucket page."

But later on, you say:

>> Scan will take a lock in shared mode on the primary bucket or on one of the 
>> overflow page.

So the correct terminology here would be "primary bucket page" not
"primary bucket".

- In addition, notice that there are two English errors in this
sentence: the word "the" needs to be added to the beginning of the
sentence, and the last word needs to be "pages" rather than "page".
There are a considerable number of similar minor errors; if you can't
fix them, I'll make a pass over it and clean it up.

- The whole "lock definitions" section seems to me to be pretty loose
and imprecise about what is happening.  For example, it uses the term
"split-in-progress" without first defining it.  The sentence quoted
above says that scans take a lock in shared mode either on the primary
page or on one of the overflow pages, but it's not to document code by
saying that it will do either A or B without explaining which one!  In
fact, I think that a scan will take a content lock first on the
primary bucket page and then on each overflow page in sequence,
retaining a pin on the primary buffer page throughout the scan.  So it
is not one or the other but both in a particular sequence, and that
can and should be explained.

Another problem with this section is that even when it's precise about
what is going on, it's probably duplicating what is or should be in
the following sections where the algorithms for each operation are
explained.  In the original wording, this section explains what each
lock protects, and then the following sections explain the algorithms
in the context of those definitions.  Now, this section contains a
sketch of the algorithm, and then the following sections lay it out
again in more detail.  The question of what each lock protects has
been lost.  Here's an attempt at some text to replace what you have
here:

===
Concurrency control for hash indexes is provided using buffer content
locks, buffer pins, and cleanup locks.   Here as elsewhere in
PostgreSQL, cleanup lock means that we hold an exclusive lock on the
buffer and have observed at some point after acquiring the lock that
we hold the only pin on that buffer.  For hash indexes, a cleanup lock
on a primary bucket page represents the right to perform an arbitrary
reorganization of the entire bucket, while a cleanup lock on an
overflow page represents the right to perform a reorganization of just
that page.  Therefore, scans retain a pin on both the primary bucket
page and the overflow page they are currently scanning, if any.
Splitting a bucket requires a cleanup lock on both the old and new
primary bucket pages.  VACUUM therefore takes a cleanup lock on every
bucket page in turn order to remove tuples.  It can also remove tuples
copied to a new bucket by any previous split operation, because the
cleanup lock taken on the primary bucket page guarantees that no scans
which started prior to the most recent split can still be in progress.
After cleaning each page individually, it attempts to take a cleanup
lock on the primary bucket page in order to "squeeze" the bucket down
to the minimum possible number of pages.
===

As I was looking at the old text regarding deadlock risk, I realized
what may be a serious problem.  Suppose process A is performing a scan
of some hash index.  While the scan is suspended, it attempts to take
a lock and is blocked by process B.  Process B, meanwhile, is running
VACUUM on that hash index.  Eventually, it will do
LockBufferForCleanup() on the hash bucket on which process A holds a
buffer pin, resulting in an undetected deadlock. In the current
coding, A would hold a heavyweight lock and B would attempt to acquire
a conflicting heavyweight lock, and the deadlock detector would kill
one of them.  This patch probably breaks that.  I notice that that's
the only place where we attempt to acquire a buffer cleanup lock
unconditionally; every place else, we acquire the lock conditionally,
so there's no deadlock risk.  Once we resolve this problem, the
paragraph about deadlock risk in this section should be revised to
explain whatever solution we come up with.

By the 

Re: [HACKERS] Hash Indexes

2016-09-28 Thread Robert Haas
On Wed, Sep 28, 2016 at 3:06 PM, Andres Freund  wrote:
> On 2016-09-28 15:04:30 -0400, Robert Haas wrote:
>> Andres already
>> stated that he things working on btree-over-hash would be more
>> beneficial than fixing hash, but at this point it seems like he's the
>> only one who takes that position.
>
> Note that I did *NOT* take that position. I was saying that I think we
> should evaluate whether that's not a better approach, doing some simple
> performance comparisons.

OK, sorry.  I evidently misunderstood your position, for which I apologize.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-28 Thread Andres Freund
On 2016-09-28 15:04:30 -0400, Robert Haas wrote:
> Andres already
> stated that he things working on btree-over-hash would be more
> beneficial than fixing hash, but at this point it seems like he's the
> only one who takes that position.

Note that I did *NOT* take that position. I was saying that I think we
should evaluate whether that's not a better approach, doing some simple
performance comparisons.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-28 Thread Robert Haas
On Tue, Sep 27, 2016 at 3:06 PM, Jesper Pedersen
 wrote:
> I have been running various tests, and applications with this patch together
> with the WAL v5 patch [1].
>
> As I havn't seen any failures and doesn't currently have additional feedback
> I'm moving this patch to "Ready for Committer" for their feedback.

Cool!  Thanks for reviewing.

Amit, can you please split the buffer manager changes in this patch
into a separate patch?  I think those changes can be committed first
and then we can try to deal with the rest of it.  Instead of adding
ConditionalLockBufferShared, I think we should add an "int mode"
argument to the existing ConditionalLockBuffer() function.  That way
is more consistent with LockBuffer().  It means an API break for any
third-party code that's calling this function, but that doesn't seem
like a big problem.  There are only 10 callers of
ConditionalLockBuffer() in our source tree and only one of those is in
contrib, so probably there isn't much third-party code that will be
affected by this, and I think it's worth it for the long-term
cleanliness.

As for CheckBufferForCleanup, I think that looks OK, but: (1) please
add an Assert() that we hold an exclusive lock on the buffer, using
LWLockHeldByMeInMode; and (2) I think we should rename it to something
like IsBufferCleanupOK.  Then, when it's used, it reads like English:
if (IsBufferCleanupOK(buf)) { /* clean up the buffer */ }.

I'll write another email with my thoughts about the rest of the patch.
For the record, Amit and I have had extensive discussions about this
effort off-list, and as Amit noted in his original post, the design is
based on suggestions which I previously posted to the list suggesting
how the issues with hash indexes might get fixed.  Therefore, I don't
expect to have too many basic disagreements regarding the design of
the patch; if anyone else does, please speak up.  Andres already
stated that he things working on btree-over-hash would be more
beneficial than fixing hash, but at this point it seems like he's the
only one who takes that position.  Even if we accept that working on
the hash AM is a reasonable thing to do, it doesn't follow that the
design Amit has adopted here is ideal.  I think it's reasonably good,
but that's only to be expected considering that I drafted the original
version of it and have been involved in subsequent discussions;
someone else might dislike something that I thought was OK, and any
such opinions certainly deserve a fair hearing.  To be clear, It's
been a long time since I've looked at any of the actual code in this
patch and I have at no point studied it deeply, so I expect that I may
find a fair number of things that I'm not happy with in detail, and
I'll write those up along with any design-level concerns that I do
have.  This should in no way forestall review from anyone else who
wants to get involved.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-27 Thread Jesper Pedersen

On 09/20/2016 09:02 AM, Amit Kapila wrote:

On Fri, Sep 16, 2016 at 11:22 AM, Amit Kapila  wrote:


I do want to work on it, but it is always possible that due to some
other work this might get delayed.  Also, I think there is always a
chance that while doing that work, we face some problem due to which
we might not be able to use that optimization.  So I will go with your
suggestion of removing hashscan.c and it's usage for now and then if
required we will pull it back.  If nobody else thinks otherwise, I
will update this in next patch version.



In the attached patch, I have removed the support of hashscans.  I
think it might improve performance by few percentage (especially for
single row fetch transactions) as we have registration and destroy of
hashscans.




I have been running various tests, and applications with this patch 
together with the WAL v5 patch [1].


As I havn't seen any failures and doesn't currently have additional 
feedback I'm moving this patch to "Ready for Committer" for their feedback.


If others have comments, move the patch status back in the CommitFest 
application, please.


[1] 
https://www.postgresql.org/message-id/CAA4eK1KE%3D%2BkkowyYD0vmch%3Dph4ND3H1tViAB%2B0cWTHqjZDDfqg%40mail.gmail.com


Best regards,
 Jesper



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Mark Kirkwood



On 25/09/16 18:18, Amit Kapila wrote:

On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark  wrote:

On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:

But to kick the hash AM as such to the curb is to say
"sorry, there will never be O(1) index lookups in Postgres".

Well there's plenty of halfway solutions for that. We could move hash
indexes to contrib or even have them in core as experimental_hash or
unlogged_hash until the day they achieve their potential.

We definitely shouldn't discourage people from working on hash indexes


Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated.  Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.



+1

I think so too - I've seen many email threads over the years on this 
list that essentially state "we need hash indexes wal logged to make 
progress with them"...and Amit et al has/have done this (more than this 
obviously - made 'em better too) and I'm astonished that folk are 
suggesting anything other than 'commit this great patch now!'...


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Amit Kapila
On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark  wrote:
> On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:
>> But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".
>
> Well there's plenty of halfway solutions for that. We could move hash
> indexes to contrib or even have them in core as experimental_hash or
> unlogged_hash until the day they achieve their potential.
>
> We definitely shouldn't discourage people from working on hash indexes
>

Okay, but to me it appears that naming it as experimental_hash or
moving it to contrib could discourage people or at the very least
people will be less motivated.  Thinking on those lines a year or so
back would have been a wise direction, but now when already there is
lot of work done (patches to make it wal-enabled, more concurrent and
performant, page inspect module are available) for hash indexes and
still more is in progress, that sounds like a step backward then step
forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-09-24 Thread Tom Lane
Greg Stark  writes:
> On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane  wrote:
>> But to kick the hash AM as such to the curb is to say
>> "sorry, there will never be O(1) index lookups in Postgres".

> Well there's plenty of halfway solutions for that. We could move hash
> indexes to contrib or even have them in core as experimental_hash or
> unlogged_hash until the day they achieve their potential.

> We definitely shouldn't discourage people from working on hash indexes
> but we probably shouldn't have released ten years worth of a feature
> marked "please don't use this" that's guaranteed to corrupt your
> database and cause weird problems if you use it a any of a number of
> supported situations (including non-replicated system recovery that
> has been a bedrock feature of Postgres for over a decade).

Obviously that has not been a good situation, but we lack a time
machine to retroactively make it better, so I don't see much point
in fretting over what should have been done in the past.

> Arguably adding a hashed btree opclass and relegating the existing
> code to an experimental state would actually encourage development
> since a) Users would actually be likely to use the hashed btree
> opclass so any work on a real hash opclass would have a real userbase
> ready and waiting for delivery, b) delivering a real hash opclass
> wouldn't involve convincing users to unlearn a million instructions
> warning not to use this feature and c) The fear of breaking existing
> users use cases and databases would be less and pg_upgrade would be an
> ignorable problem at least until the day comes for the big cutover of
> the default to the new opclass.

I'm not following your point here.  There is no hash-over-btree AM and
nobody (including Andres) has volunteered to create one.  Meanwhile,
we have a patch in hand to WAL-enable the hash AM.  Why would we do
anything other than apply that patch and stop saying hash is deprecated?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >