Re: [HACKERS] Parallel Bitmap Heap Scan - Prefetch pages are not updated properly

2017-04-04 Thread Robert Haas
On Tue, Apr 4, 2017 at 6:24 AM, Dilip Kumar  wrote:
> While analyzing the coverage for the prefetching part, I found an
> issue that prefetch_pages were not updated properly while executing in
> parallel mode.
>
> Attached patch fixes the same.

Wow, OK.  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel Bitmap Heap Scan - Prefetch pages are not updated properly

2017-04-04 Thread Dilip Kumar
While analyzing the coverage for the prefetching part, I found an
issue that prefetch_pages were not updated properly while executing in
parallel mode.

Attached patch fixes the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel_bitmap_prefetch_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-27 Thread Dilip Kumar
On Mon, Mar 27, 2017 at 5:58 PM, Robert Haas  wrote:
> Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
> the ptbase == NULL check shouldn't be needed, because we're not
> passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
> be called from SH_CREATE, which isn't prepared for a NULL return
> anyway.
>
> Am I all wet?

Yes you are right that we are not passing DSA_ALLOC_NO_OOM, so
dsa_allocate_extended will return error in case if allocation failed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-27 Thread Robert Haas
On Mon, Mar 27, 2017 at 5:02 AM, Dilip Kumar  wrote:
> On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
>  wrote:
>> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
>> for Q6, the test environment is as follows,
>> work_mem = 1GB,
>> shared_buffers = 10 GB,
>> Effective_cache_size = 10GB
>> random_page_cost = seq+page_cost =0.1
>>
>> The error message is, ERROR:  invalid DSA memory alloc request size 
>> 1610612740
>> In case required, stack trace is as follows,
>
> Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
> missing while allocating the memory from the DSA.  I have also handled
> the NULL pointer return from dsa_get_address.
>
> Please verify with the attached patch.

Failing to pass DSA_ALLOC_HUGE is an obvious oversight, but I think
the ptbase == NULL check shouldn't be needed, because we're not
passing DSA_ALLOC_NO_OOM.  And that's good, because this is going to
be called from SH_CREATE, which isn't prepared for a NULL return
anyway.

Am I all wet?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-27 Thread Dilip Kumar
On Mon, Mar 27, 2017 at 12:53 PM, Rafia Sabih
 wrote:
> Recently, on testing TPC-H 300 scale factor I stumbled on to a error
> for Q6, the test environment is as follows,
> work_mem = 1GB,
> shared_buffers = 10 GB,
> Effective_cache_size = 10GB
> random_page_cost = seq+page_cost =0.1
>
> The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
> In case required, stack trace is as follows,

Thanks for reporting.  In pagetable_allocate, DSA_ALLOC_HUGE flag were
missing while allocating the memory from the DSA.  I have also handled
the NULL pointer return from dsa_get_address.

Please verify with the attached patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


bitmap_hugepage_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-27 Thread Rafia Sabih
On Wed, Mar 8, 2017 at 11:52 PM, Robert Haas  wrote:
> On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> What I'm using is:
>>
>>> Configured with:
>>> --prefix=/Applications/Xcode.app/Contents/Developer/usr
>>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>>> Apple LLVM version 7.0.2 (clang-700.1.81)
>>> Target: x86_64-apple-darwin14.5.0
>>> Thread model: posix
>>
>> Hm.  I noticed that longfin didn't spit up on it either, despite having
>> -Werror turned on.  That's a slightly newer version, but still Apple's
>> clang:
>>
>> $ gcc -v
>> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr 
>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>> Apple LLVM version 8.0.0 (clang-800.0.42.1)
>> Target: x86_64-apple-darwin16.4.0
>> Thread model: posix
>> InstalledDir: 
>> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
>
> Yeah.  I think on my previous MacBook Pro, you could do this without
> generating a warning:
>
> int x;
> printf("%d\n", x);
>
> The compiler on this one detects that case, but that seems to be about
> as far as it goes.

Recently, on testing TPC-H 300 scale factor I stumbled on to a error
for Q6, the test environment is as follows,
work_mem = 1GB,
shared_buffers = 10 GB,
Effective_cache_size = 10GB
random_page_cost = seq+page_cost =0.1

The error message is, ERROR:  invalid DSA memory alloc request size 1610612740
In case required, stack trace is as follows,

#0  errfinish (dummy=0) at elog.c:415
#1  0x10738820 in elog_finish (elevel=20, fmt=0x109115d0
"invalid DSA memory alloc request size %zu") at elog.c:1378
#2  0x10778824 in dsa_allocate_extended (area=0x1001b53b138,
size=1610612740, flags=4) at dsa.c:676
#3  0x103a3e60 in pagetable_allocate (pagetable=0x1001b52a590,
size=1610612736) at tidbitmap.c:1521
#4  0x103a0488 in pagetable_grow (tb=0x1001b52a590,
newsize=33554432) at ../../../src/include/lib/simplehash.h:379
#5  0x103a07b0 in pagetable_insert (tb=0x1001b52a590,
key=34962730, found=0x3fffc705ba48 "") at
../../../src/include/lib/simplehash.h:504
#6  0x103a3354 in tbm_get_pageentry (tbm=0x1001b526fa0,
pageno=34962730) at tidbitmap.c:1235
#7  0x103a1614 in tbm_add_tuples (tbm=0x1001b526fa0,
tids=0x1001b51d22a, ntids=1, recheck=0 '\000') at tidbitmap.c:425
#8  0x100fdf24 in btgetbitmap (scan=0x1001b51c4a0,
tbm=0x1001b526fa0) at nbtree.c:460
#9  0x100f2c48 in index_getbitmap (scan=0x1001b51c4a0,
bitmap=0x1001b526fa0) at indexam.c:726
#10 0x1033c7d8 in MultiExecBitmapIndexScan
(node=0x1001b51c180) at nodeBitmapIndexscan.c:91
#11 0x1031c0b4 in MultiExecProcNode (node=0x1001b51c180) at
execProcnode.c:611
#12 0x1033a8a8 in BitmapHeapNext (node=0x1001b5187a8) at
nodeBitmapHeapscan.c:143
#13 0x1032a094 in ExecScanFetch (node=0x1001b5187a8,
accessMtd=0x1033a6c8 , recheckMtd=0x1033bab8
) at execScan.c:95
#14 0x1032a194 in ExecScan (node=0x1001b5187a8,
accessMtd=0x1033a6c8 , recheckMtd=0x1033bab8
) at execScan.c:162
#15 0x1033bb88 in ExecBitmapHeapScan (node=0x1001b5187a8) at
nodeBitmapHeapscan.c:667

Please feel free to ask if any more information is required for this.
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 1:18 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> What I'm using is:
>
>> Configured with:
>> --prefix=/Applications/Xcode.app/Contents/Developer/usr
>> --with-gxx-include-dir=/usr/include/c++/4.2.1
>> Apple LLVM version 7.0.2 (clang-700.1.81)
>> Target: x86_64-apple-darwin14.5.0
>> Thread model: posix
>
> Hm.  I noticed that longfin didn't spit up on it either, despite having
> -Werror turned on.  That's a slightly newer version, but still Apple's
> clang:
>
> $ gcc -v
> Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr 
> --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 8.0.0 (clang-800.0.42.1)
> Target: x86_64-apple-darwin16.4.0
> Thread model: posix
> InstalledDir: 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Yeah.  I think on my previous MacBook Pro, you could do this without
generating a warning:

int x;
printf("%d\n", x);

The compiler on this one detects that case, but that seems to be about
as far as it goes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Tom Lane
Robert Haas  writes:
> What I'm using is:

> Configured with:
> --prefix=/Applications/Xcode.app/Contents/Developer/usr
> --with-gxx-include-dir=/usr/include/c++/4.2.1
> Apple LLVM version 7.0.2 (clang-700.1.81)
> Target: x86_64-apple-darwin14.5.0
> Thread model: posix

Hm.  I noticed that longfin didn't spit up on it either, despite having
-Werror turned on.  That's a slightly newer version, but still Apple's
clang:

$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr 
--with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin16.4.0
Thread model: posix
InstalledDir: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 12:53 PM, Tom Lane  wrote:
>> Thanks.  Sorry for the hassle; my compiler isn't as picky about this
>> as I would like, and apparently Dilip's isn't either.
>
> Might be interesting to see whether -O level affects it.  In principle,
> whether you get the warning should depend on how much the compiler has
> analyzed the logic flow ...

What I'm using is:

Configured with:
--prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin14.5.0
Thread model: posix

While I haven't experimented with this too extensively, my general
impression is that this thing is extremely tolerant of uninitialized
variables.  I just tried compiling nodeBitmapHeapscan.c with -Wall
-Werror and each of -O0, -O1, -O2, and -O3, and none of those produced
any warnings.  I've been reluctant to go to the hassle of installing a
different compiler...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Tom Lane
Robert Haas  writes:
> On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane  wrote:
>> Jeff Janes  writes:
>>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
>>> 4.4.7-17) (GCC):

>> Me too.  Fix pushed.

> Thanks.  Sorry for the hassle; my compiler isn't as picky about this
> as I would like, and apparently Dilip's isn't either.

Might be interesting to see whether -O level affects it.  In principle,
whether you get the warning should depend on how much the compiler has
analyzed the logic flow ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 12:45 PM, Tom Lane  wrote:
> Jeff Janes  writes:
>> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
>> 4.4.7-17) (GCC):
>
> Me too.  Fix pushed.

Thanks.  Sorry for the hassle; my compiler isn't as picky about this
as I would like, and apparently Dilip's isn't either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Tom Lane
Jeff Janes  writes:
> I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-17) (GCC):

Me too.  Fix pushed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Jeff Janes
On Wed, Mar 8, 2017 at 9:08 AM, Robert Haas  wrote:

> On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar 
> wrote:
> > Right, done that way
>
> This didn't compile because you bobbled some code in
> src/backend/nodes, but it was a trivial mistake so I fixed it.
>
> Committed with that fix and a bunch of minor cosmetic changes.
>


I'm getting some compiler warnings in gcc version 4.4.7 20120313 (Red Hat
4.4.7-17) (GCC):

nodeBitmapHeapscan.c: In function 'BitmapHeapNext':
nodeBitmapHeapscan.c:79: warning: 'tbmiterator' may be used uninitialized
in this function
nodeBitmapHeapscan.c:80: warning: 'shared_tbmiterator' may be used
uninitialized in this function

Cheers,

Jeff


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 11:20 AM, Dilip Kumar  wrote:
> Right, done that way

This didn't compile because you bobbled some code in
src/backend/nodes, but it was a trivial mistake so I fixed it.

Committed with that fix and a bunch of minor cosmetic changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Dilip Kumar
On Wed, Mar 8, 2017 at 8:28 PM, Robert Haas  wrote:
> How about adding a regression test?

Added
>
> bitmap_subplan_mark_shared could use castNode(), which seems like it
> would be better style.  Maybe some other places, too.
>
> + ParallelBitmapPopulate
> + Waiting for the leader to populate the TidBitmap.
> +
> +
>
> If you build the documentation, you'll find that this doesn't come out
> right; you need to add 1 to the value of the nearest preceding
> "morerows".  (I fixed a similar issue with 0001 while committing.)

Fixed
>
> +/*---
> + * Check the current state
> + * If state is
> + * BM_INITIAL : We become the leader and set it to BM_INPROGRESS
> + * BM_INPROGRESS : We need to wait till leader creates bitmap
> + * BM_FINISHED   : bitmap is ready so no need to wait
> + *---
>
> The formatting of this comment is slightly off - the comment for
> BM_INITIAL isn't aligned the same as the others.  But I would just
> delete the whole comment, since more or less it recapitulates the
> function header comment anyway.

Removed.
>
> I wonder if BitmapShouldInitializeSharedState couldn't be written a
> little more compactly overall, like this:
>
> {
> SharedBitmapState   state;
>
> while (1)
> {
> SpinLockAcquire(&pstate->mutex);
> state = pstate->state;
> if (pstate->state == BM_INITIAL)
> pstate->state = BM_INPROGRESS;
> SpinLockRelease(&pstate->mutex);
>
> /* If we are leader or leader has already created a TIDBITMAP */
> if (state != BM_INPROGRESS)
> break;
>
> /* Sleep until leader finishes creating the bitmap */
> ConditionVariableSleep(&pstate->cv, WAIT_EVENT_PARALLEL_BITMAP_SCAN);
> }
>
> ConditionVariableCancelSleep();
>
> return (state == BM_INITIAL);
> }

This looks good, done this way

>
> +/*
> + * By this time we have already populated the TBM and
> + * initialized the shared iterators so set the state to
> + * BM_FINISHED and wake up others.
> + */
> +SpinLockAcquire(&pstate->mutex);
> +pstate->state = BM_FINISHED;
> +SpinLockRelease(&pstate->mutex);
> +ConditionVariableBroadcast(&pstate->cv);
>
> I think it would be good to have a function for this, like
> BitmapDoneInitializingSharedState(), and just call that function here.

Done
>
> +SpinLockAcquire(&pstate->mutex);
> +
> +/*
> + * Recheck under the mutex, If some other process has already 
> done
> + * the enough prefetch then we need not to do anything.
> + */
> +if (pstate->prefetch_pages >= pstate->prefetch_target)
> +SpinLockRelease(&pstate->mutex);
> +return;
> +SpinLockRelease(&pstate->mutex);
>
> I think it would be clearer to write this as:
>
> SpinLockAcquire(&pstate->mutex);
> do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target);
> SpinLockRelease(&pstate->mutex);
> if (!do_prefetch)
>return;
>
> Then it's more obvious what's going on with the spinlock.  But
> actually what I would do is also roll in the increment to prefetch
> pages in there, so that you don't have to reacquire the spinlock after
> calling PrefetchBuffer:
>
> bool do_prefetch = false;
> SpinLockAcquire(&pstate->mutex);
> if (pstate->prefetch_pages < pstate->prefetch_target)
> {
> pstate->prefetch_pages++;
> do_prefetch = true;
> }
> SpinLockRelease(&pstate->mutex);
>
> That seems like it will reduce the amount of excess prefetching
> considerably, and also simplify the code and cut the spinlock
> acquisitions by 50%.
>
Right, done that way
> Overall I think this is in pretty good shape.

Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0003-parallel-bitmap-heapscan-v9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Robert Haas
Reviewing 0003:

How about adding a regression test?

bitmap_subplan_mark_shared could use castNode(), which seems like it
would be better style.  Maybe some other places, too.

+ ParallelBitmapPopulate
+ Waiting for the leader to populate the TidBitmap.
+
+

If you build the documentation, you'll find that this doesn't come out
right; you need to add 1 to the value of the nearest preceding
"morerows".  (I fixed a similar issue with 0001 while committing.)

+/*---
+ * Check the current state
+ * If state is
+ * BM_INITIAL : We become the leader and set it to BM_INPROGRESS
+ * BM_INPROGRESS : We need to wait till leader creates bitmap
+ * BM_FINISHED   : bitmap is ready so no need to wait
+ *---

The formatting of this comment is slightly off - the comment for
BM_INITIAL isn't aligned the same as the others.  But I would just
delete the whole comment, since more or less it recapitulates the
function header comment anyway.

I wonder if BitmapShouldInitializeSharedState couldn't be written a
little more compactly overall, like this:

{
SharedBitmapState   state;

while (1)
{
SpinLockAcquire(&pstate->mutex);
state = pstate->state;
if (pstate->state == BM_INITIAL)
pstate->state = BM_INPROGRESS;
SpinLockRelease(&pstate->mutex);

/* If we are leader or leader has already created a TIDBITMAP */
if (state != BM_INPROGRESS)
break;

/* Sleep until leader finishes creating the bitmap */
ConditionVariableSleep(&pstate->cv, WAIT_EVENT_PARALLEL_BITMAP_SCAN);
}

ConditionVariableCancelSleep();

return (state == BM_INITIAL);
}

+/*
+ * By this time we have already populated the TBM and
+ * initialized the shared iterators so set the state to
+ * BM_FINISHED and wake up others.
+ */
+SpinLockAcquire(&pstate->mutex);
+pstate->state = BM_FINISHED;
+SpinLockRelease(&pstate->mutex);
+ConditionVariableBroadcast(&pstate->cv);

I think it would be good to have a function for this, like
BitmapDoneInitializingSharedState(), and just call that function here.

+SpinLockAcquire(&pstate->mutex);
+
+/*
+ * Recheck under the mutex, If some other process has already done
+ * the enough prefetch then we need not to do anything.
+ */
+if (pstate->prefetch_pages >= pstate->prefetch_target)
+SpinLockRelease(&pstate->mutex);
+return;
+SpinLockRelease(&pstate->mutex);

I think it would be clearer to write this as:

SpinLockAcquire(&pstate->mutex);
do_prefetch = (pstate->prefetch_pages >= pstate->prefetch_target);
SpinLockRelease(&pstate->mutex);
if (!do_prefetch)
   return;

Then it's more obvious what's going on with the spinlock.  But
actually what I would do is also roll in the increment to prefetch
pages in there, so that you don't have to reacquire the spinlock after
calling PrefetchBuffer:

bool do_prefetch = false;
SpinLockAcquire(&pstate->mutex);
if (pstate->prefetch_pages < pstate->prefetch_target)
{
pstate->prefetch_pages++;
do_prefetch = true;
}
SpinLockRelease(&pstate->mutex);

That seems like it will reduce the amount of excess prefetching
considerably, and also simplify the code and cut the spinlock
acquisitions by 50%.

Overall I think this is in pretty good shape.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Dilip Kumar
On Wed, Mar 8, 2017 at 6:42 PM, Robert Haas  wrote:
> I don't think I understand exactly why this system might be prone to a
> little bit of extra prefetching - can you explain further?
Let me explain with an example, suppose there are 2 workers
prefetching jointly, lets assume
prefetch_target is 3, so for example when both the workers prefetch 1
page each then prefetch_page count will be 2, so both the workers will
go for next prefetch because prefetch_page is still less than prefetch
target, so again both the workers will do prefetch and totally they
will prefetch 4 pages.

> I don't think it will hurt anything as long as we are talking about a small
> number of extra prefetches here and there.

I completely agree with this point and I mentioned in the mail so that
it don't go unnoticed.

And, whatever extra prefetch we have done we will anyway use it unless
we stop the execution because of some limit clause.

>
>> Fixed
>
> Committed 0001.

Thanks


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 1:49 AM, Dilip Kumar  wrote:
> On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar  wrote:
>>> It's not about speed.  It's about not forgetting to prefetch.  Suppose
>>> that worker 1 becomes the prefetch worker but then doesn't return to
>>> the Bitmap Heap Scan node for a long time because it's busy in some
>>> other part of the plan tree.  Now you just stop prefetching; that's
>>> bad.  You want prefetching to continue regardless of which workers are
>>> busy doing what; as long as SOME worker is executing the parallel
>>> bitmap heap scan, prefetching should continue as needed.
>>
>> Right, I missed this part. I will fix this.
> I have fixed this part, after doing that I realised if multiple
> processes are prefetching then it may be possible that in boundary
> cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2)
> there may be some extra prefetch but finally those prefetched blocks
> will be used.  Another, solution to this problem is that we can
> increase the prefetch_pages in advance then call tbm_shared_iterate,
> this will avoid extra prefetch.  But I am not sure what will be best
> here.

I don't think I understand exactly why this system might be prone to a
little bit of extra prefetching - can you explain further? I don't
think it will hurt anything as long as we are talking about a small
number of extra prefetches here and there.

> Fixed

Committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Dilip Kumar
On Tue, Mar 7, 2017 at 10:10 PM, Dilip Kumar  wrote:
>> It's not about speed.  It's about not forgetting to prefetch.  Suppose
>> that worker 1 becomes the prefetch worker but then doesn't return to
>> the Bitmap Heap Scan node for a long time because it's busy in some
>> other part of the plan tree.  Now you just stop prefetching; that's
>> bad.  You want prefetching to continue regardless of which workers are
>> busy doing what; as long as SOME worker is executing the parallel
>> bitmap heap scan, prefetching should continue as needed.
>
> Right, I missed this part. I will fix this.
I have fixed this part, after doing that I realised if multiple
processes are prefetching then it may be possible that in boundary
cases (e.g. suppose prefetch_target is 3 and prefetch_pages is at 2)
there may be some extra prefetch but finally those prefetched blocks
will be used.  Another, solution to this problem is that we can
increase the prefetch_pages in advance then call tbm_shared_iterate,
this will avoid extra prefetch.  But I am not sure what will be best
here.


On Tue, Mar 7, 2017 at 9:41 PM, Robert Haas  wrote:
> +if (--ptbase->refcount == 0)
> +dsa_free(dsa, istate->pagetable);
> +
> +if (istate->spages)
> +{
> +ptpages = dsa_get_address(dsa, istate->spages);
> +if (--ptpages->refcount == 0)
> +dsa_free(dsa, istate->spages);
> +}
> +if (istate->schunks)
> +{
> +ptchunks = dsa_get_address(dsa, istate->schunks);
> +if (--ptchunks->refcount == 0)
> +dsa_free(dsa, istate->schunks);
> +}
>
> This doesn't involve any locking, which I think will happen to work
> with the current usage pattern but doesn't seem very robust in
> general.  I think you either need the refcounts to be protected by a
> spinlock, or maybe better, use pg_atomic_uint32 for them.  You want
> something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) {
> dsa_free(...) }
>
> Otherwise, there's no guarantee it will get freed exactly once.

Fixed


On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas  wrote:
> You've still got this:
>
> +   if (DsaPointerIsValid(node->pstate->tbmiterator))
> +   tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +
> +   if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +   dsa_free(dsa, node->pstate->prefetch_iterator);
>
> I'm trying to get to a point where both calls use
> tbm_free_shared_area() - i.e. no peeking behind the abstraction layer.

Fixed

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-tidbitmap-support-shared-v8.patch
Description: Binary data


0003-parallel-bitmap-heapscan-v8.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Dilip Kumar
On Tue, Mar 7, 2017 at 10:07 PM, Robert Haas  wrote:
> It's not about speed.  It's about not forgetting to prefetch.  Suppose
> that worker 1 becomes the prefetch worker but then doesn't return to
> the Bitmap Heap Scan node for a long time because it's busy in some
> other part of the plan tree.  Now you just stop prefetching; that's
> bad.  You want prefetching to continue regardless of which workers are
> busy doing what; as long as SOME worker is executing the parallel
> bitmap heap scan, prefetching should continue as needed.

Right, I missed this part. I will fix this.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Robert Haas
On Tue, Mar 7, 2017 at 11:27 AM, Dilip Kumar  wrote:
> On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas  wrote:
>> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
>> And that's the slow part.  The tiny amount of time we spend updating
>> the prefetch information under the mutex should be insignificant
>> compared to the cost of actually reading the buffer.  Unless I'm
>> missing something.
>
> Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer
> if it's not already in shared buffer? so If instead of one process is
> making multiple async calls to PrefetchBuffer, if we make it by
> multiple processes will it be any faster?  Or you are thinking that at
> least we can make BufTableLookup call parallel because that is not an
> async call.

It's not about speed.  It's about not forgetting to prefetch.  Suppose
that worker 1 becomes the prefetch worker but then doesn't return to
the Bitmap Heap Scan node for a long time because it's busy in some
other part of the plan tree.  Now you just stop prefetching; that's
bad.  You want prefetching to continue regardless of which workers are
busy doing what; as long as SOME worker is executing the parallel
bitmap heap scan, prefetching should continue as needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Dilip Kumar
On Tue, Mar 7, 2017 at 9:44 PM, Robert Haas  wrote:
> I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
> And that's the slow part.  The tiny amount of time we spend updating
> the prefetch information under the mutex should be insignificant
> compared to the cost of actually reading the buffer.  Unless I'm
> missing something.

Okay, but IIUC, the PrefetchBuffer is an async call to load the buffer
if it's not already in shared buffer? so If instead of one process is
making multiple async calls to PrefetchBuffer, if we make it by
multiple processes will it be any faster?  Or you are thinking that at
least we can make BufTableLookup call parallel because that is not an
async call.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Robert Haas
On Tue, Mar 7, 2017 at 11:09 AM, Dilip Kumar  wrote:
> On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas  wrote:
>>
>> +   if (DsaPointerIsValid(node->pstate->tbmiterator))
>> +   tbm_free_shared_area(dsa, node->pstate->tbmiterator);
>> +
>> +   if (DsaPointerIsValid(node->pstate->prefetch_iterator))
>> +   dsa_free(dsa, node->pstate->prefetch_iterator);
>
> As per latest code, both should be calling to tbm_free_shared_area
> because tbm_free_shared_area is capable of handling that. My silly
> mistake. I will fix it.

Thanks.  I don't think I believe this rationale:

+   /*
+* If one of the process has already
identified that we need
+* to do prefetch then let it perform
the prefetch and allow
+* others to proceed with the work in
hand.  Another option
+* could be that we allow all of them
to participate in
+* prefetching.  But, most of this
work done under mutex or
+* LWLock so ultimately we may end up
in prefetching
+* sequentially.
+*/

I mean, IIUC, the call to PrefetchBuffer() is not done under any lock.
And that's the slow part.  The tiny amount of time we spend updating
the prefetch information under the mutex should be insignificant
compared to the cost of actually reading the buffer.  Unless I'm
missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Robert Haas
(On Tue, Feb 28, 2017 at 10:48 AM, Dilip Kumar  wrote:
> 0001- same as previous with some changes for freeing the shared memory stuff.

+if (--ptbase->refcount == 0)
+dsa_free(dsa, istate->pagetable);
+
+if (istate->spages)
+{
+ptpages = dsa_get_address(dsa, istate->spages);
+if (--ptpages->refcount == 0)
+dsa_free(dsa, istate->spages);
+}
+if (istate->schunks)
+{
+ptchunks = dsa_get_address(dsa, istate->schunks);
+if (--ptchunks->refcount == 0)
+dsa_free(dsa, istate->schunks);
+}

This doesn't involve any locking, which I think will happen to work
with the current usage pattern but doesn't seem very robust in
general.  I think you either need the refcounts to be protected by a
spinlock, or maybe better, use pg_atomic_uint32 for them.  You want
something like if (pg_atomic_sub_fetch_u32(&refcount, 1) == 0) {
dsa_free(...) }

Otherwise, there's no guarantee it will get freed exactly once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Dilip Kumar
On Tue, Mar 7, 2017 at 9:34 PM, Robert Haas  wrote:
>
> +   if (DsaPointerIsValid(node->pstate->tbmiterator))
> +   tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +
> +   if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +   dsa_free(dsa, node->pstate->prefetch_iterator);

As per latest code, both should be calling to tbm_free_shared_area
because tbm_free_shared_area is capable of handling that. My silly
mistake. I will fix it.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-07 Thread Robert Haas
On Mon, Mar 6, 2017 at 12:35 AM, Dilip Kumar  wrote:
> On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas  wrote:
>> 0002 wasn't quite careful enough about the placement of #ifdef
>> USE_PREFETCH, but otherwise looks OK.  Committed after changing that
>> and getting rid of the local variable prefetch_iterator, which seemed
>> to be adding rather than removing complexity after this refactoring.
>
> 0003 is rebased after this commit.

You've still got this:

+   if (DsaPointerIsValid(node->pstate->tbmiterator))
+   tbm_free_shared_area(dsa, node->pstate->tbmiterator);
+
+   if (DsaPointerIsValid(node->pstate->prefetch_iterator))
+   dsa_free(dsa, node->pstate->prefetch_iterator);

I'm trying to get to a point where both calls use
tbm_free_shared_area() - i.e. no peeking behind the abstraction layer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-05 Thread Dilip Kumar
On Thu, Mar 2, 2017 at 6:52 PM, Robert Haas  wrote:
> 0002 wasn't quite careful enough about the placement of #ifdef
> USE_PREFETCH, but otherwise looks OK.  Committed after changing that
> and getting rid of the local variable prefetch_iterator, which seemed
> to be adding rather than removing complexity after this refactoring.

0003 is rebased after this commit.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0003-parallel-bitmap-heapscan-v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-03-02 Thread Robert Haas
On Tue, Feb 28, 2017 at 9:18 PM, Dilip Kumar  wrote:
> 0001- same as previous with some changes for freeing the shared memory stuff.
> 0002- nodeBitmapHeapScan refactoring, this applies independently
> 0003- actual parallel bitmap stuff applies on top of 0001 and 0002

0002 wasn't quite careful enough about the placement of #ifdef
USE_PREFETCH, but otherwise looks OK.  Committed after changing that
and getting rid of the local variable prefetch_iterator, which seemed
to be adding rather than removing complexity after this refactoring.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-28 Thread Dilip Kumar
On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas  wrote:
> I'm not entirely sure about the calling convention for
> tbm_free_shared_area() but the rest seems OK.

Changed
>
>> 2. Now tbm_free is not freeing any of the shared members which can be
>> accessed by other worker so tbm_free is safe to call from
>> ExecEndBitmapHeapScan without any safety check or ref count.
>
> That also seems fine. We ended up with something very similar in the
> Parallel Index Scan patch.
>
>> 0002:
>> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
>> not freeing any shared member in ExecEndBitmapHeapScan.
>> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
>> free the shared members of the TBM.
>> 3. After that, we will free TBMSharedIteratorState what we allocated
>> using tbm_prepare_shared_iterate.
>
> Check.  But I think tbm_free_shared_area() should also free the object
> itself, instead of making the caller do that separately.

Right, done that way.
>
> +if (DsaPointerIsValid(node->pstate->tbmiterator))
> +{
> +/* First we free the shared TBM members using the shared state */
> +tbm_free_shared_area(dsa, node->pstate->tbmiterator);
> +dsa_free(dsa, node->pstate->tbmiterator);
> +}
> +
> +if (DsaPointerIsValid(node->pstate->prefetch_iterator))
> +dsa_free(dsa, node->pstate->prefetch_iterator);
>
> The fact that these cases aren't symmetric suggests that your
> abstraction is leaky.  I'm guessing that you can't call
> tbm_free_shared_area because the two iterators share one copy of the
> underlying iteration arrays, and the TBM code isn't smart enough to
> avoid freeing them twice.  You're going to have to come up with a
> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
> about the way the underlying storage details are managed.  (Maybe you
> need to reference-count the iterator arrays?)

Converted iterator arrays to structure and maintained the refcount. I
had to do the same thing for pagetable also because that is also
shared across iterator.

>
> +if (node->inited)
> +goto start_iterate;
>
> My first programming teacher told me not to use goto.  I've
> occasionally violated that rule, but I need a better reason than you
> have here.  It looks very easy to avoid.

Changed
>
> +pbms_set_parallel(outerPlanState(node));
>
> I think this should be a flag in the plan, and the planner should set
> it correctly, instead of having it be a flag in the executor that the
> executor sets.  Also, the flag itself should really be called
> something that involves the word "shared" rather than "parallel",
> because the bitmap will not be created in parallel, but it will be
> shared.
Done
>
> Have you checked whether this patch causes any regression in the
> non-parallel case?  It adds a bunch of "if" statements, so it might.
> Hopefully not, but it needs to be carefully tested.

With new patch I tested in my local machine, perform multiple
executions and it doesn't show any regression.  Attached file
perf_result contains the explain analyze output for one of the query
(head, patch with 0 workers and patch with 2 workers).  I will perform
the test with all the TPC-H queries which using bitmap plan on the
bigger machine and will post the results next week.
>
> @@ -48,10 +48,11 @@
>  #include "utils/snapmgr.h"
>  #include "utils/tqual.h"
>
> -
>  static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
>
> Unnecessary.
Fixed
>
> +static bool pbms_is_leader(ParallelBitmapState *pbminfo);
> +static void pbms_set_parallel(PlanState *node);
>
> I don't think this "pbms" terminology is very good.  It's dissimilar
> to the way other functions in this file are named.  Maybe
> BitmapShouldInitializeSharedState().
Changed
>
> I think that some of the bits that this function makes conditional on
> pstate should be factored out into inline functions.  Like:
>
> -if (node->prefetch_target >= node->prefetch_maximum)
> - /* don't increase any further */ ;
> -else if (node->prefetch_target >= node->prefetch_maximum / 2)
> -node->prefetch_target = node->prefetch_maximum;
> -else if (node->prefetch_target > 0)
> -node->prefetch_target *= 2;
> -else
> -node->prefetch_target++;
> +if (!pstate)
> +{
> +if (node->prefetch_target >= node->prefetch_maximum)
> + /* don't increase any further */ ;
> +else if (node->prefetch_target >= node->prefetch_maximum / 2)
> +node->prefetch_target = node->prefetch_maximum;
> +else if (node->prefetch_target > 0)
> +node->prefetch_target *= 2;
> +else
> +node->prefetch_target++;
> +}
> +else if (pstate->prefetch_target < node->prefetch_maximum)
> +   

Re: [HACKERS] Parallel bitmap heap scan

2017-02-26 Thread Robert Haas
On Mon, Feb 27, 2017 at 10:30 AM, Dilip Kumar  wrote:
> On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas  wrote:
>> tbm_free_shared_area because the two iterators share one copy of the
>> underlying iteration arrays, and the TBM code isn't smart enough to
>> avoid freeing them twice.  You're going to have to come up with a
>> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
>> about the way the underlying storage details are managed.  (Maybe you
>> need to reference-count the iterator arrays?)
>
> Yeah, I also think current way doesn't look so clean, currently, these
> arrays are just integers array, may be we can use a first slot of the
> array for reference-count? or convert to the structure which has space
> for reference-count and an integers array. What do you suggest?

Maybe something like typedef struct { int refcnt; SomeType
somename[FLEXIBLE_ARRAY_MEMBER]; } SomeOtherType; would be a good
approach.

>> +if (node->inited)
>> +goto start_iterate;
>>
>> My first programming teacher told me not to use goto.  I've
>> occasionally violated that rule, but I need a better reason than you
>> have here.  It looks very easy to avoid.
>
> Yes, this can be avoided, I was just trying to get rid of multi-level
> if nesting and end up with the "goto".

That's what I figured.

>> +pbms_set_parallel(outerPlanState(node));
>>
>> I think this should be a flag in the plan, and the planner should set
>> it correctly, instead of having it be a flag in the executor that the
>> executor sets.  Also, the flag itself should really be called
>> something that involves the word "shared" rather than "parallel",
>> because the bitmap will not be created in parallel, but it will be
>> shared.
>
> Earlier, I thought that it will be not a good idea to set that flag in
> BitmapIndexScan path because the same path can be used either under
> ParallelBitmapHeapPath or under normal BitmapHeapPath.  But, now after
> putting some thought, I realised that we can do that in create_plan.
> Therefore, I will change this.

Cool.

>> pbms_is_leader() looks horrifically inefficient.  Every worker will
>> reacquire the spinlock for every tuple.  You should only have to enter
>> this spinlock-acquiring loop for the first tuple.  After that, either
>> you became the leader, did the initialization, and set the state to
>> PBM_FINISHED, or you waited until the pre-existing leader did the
>> same.  You should have a backend-local flag that keeps you from
>> touching the spinlock for every tuple.  I wouldn't be surprised if
>> fixing this nets a noticeable performance increase for this patch at
>> high worker counts.
>
> I think there is some confusion, if you notice, below code will avoid
> calling pbms_is_leader for every tuple.
> It will be called only first time. And, after that node->inited will
> be true and it will directly jump to start_iterate for subsequent
> calls.  Am I missing something?
>
>> +if (node->inited)
>> +goto start_iterate;

Oh, OK.  I guess I was just confused, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-26 Thread Dilip Kumar
On Sun, Feb 26, 2017 at 9:26 PM, Robert Haas  wrote:

Thanks for the review, I will work on these. There are some comments I
need suggestions.

> tbm_free_shared_area because the two iterators share one copy of the
> underlying iteration arrays, and the TBM code isn't smart enough to
> avoid freeing them twice.  You're going to have to come up with a
> better solution to that problem; nodeBitmapHeapScan.c shouldn't know
> about the way the underlying storage details are managed.  (Maybe you
> need to reference-count the iterator arrays?)

Yeah, I also think current way doesn't look so clean, currently, these
arrays are just integers array, may be we can use a first slot of the
array for reference-count? or convert to the structure which has space
for reference-count and an integers array. What do you suggest?

>
> +if (node->inited)
> +goto start_iterate;
>
> My first programming teacher told me not to use goto.  I've
> occasionally violated that rule, but I need a better reason than you
> have here.  It looks very easy to avoid.

Yes, this can be avoided, I was just trying to get rid of multi-level
if nesting and end up with the "goto".

>
> +pbms_set_parallel(outerPlanState(node));
>
> I think this should be a flag in the plan, and the planner should set
> it correctly, instead of having it be a flag in the executor that the
> executor sets.  Also, the flag itself should really be called
> something that involves the word "shared" rather than "parallel",
> because the bitmap will not be created in parallel, but it will be
> shared.

Earlier, I thought that it will be not a good idea to set that flag in
BitmapIndexScan path because the same path can be used either under
ParallelBitmapHeapPath or under normal BitmapHeapPath.  But, now after
putting some thought, I realised that we can do that in create_plan.
Therefore, I will change this.

>
> Have you checked whether this patch causes any regression in the
> non-parallel case?  It adds a bunch of "if" statements, so it might.
> Hopefully not, but it needs to be carefully tested.

During the initial patch development, I have tested this aspect also
but never published any results for the same. I will do that along
with my next patch and post the results.
>
> pbms_is_leader() looks horrifically inefficient.  Every worker will
> reacquire the spinlock for every tuple.  You should only have to enter
> this spinlock-acquiring loop for the first tuple.  After that, either
> you became the leader, did the initialization, and set the state to
> PBM_FINISHED, or you waited until the pre-existing leader did the
> same.  You should have a backend-local flag that keeps you from
> touching the spinlock for every tuple.  I wouldn't be surprised if
> fixing this nets a noticeable performance increase for this patch at
> high worker counts.

I think there is some confusion, if you notice, below code will avoid
calling pbms_is_leader for every tuple.
It will be called only first time. And, after that node->inited will
be true and it will directly jump to start_iterate for subsequent
calls.  Am I missing something?

> +if (node->inited)
> +goto start_iterate;


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-26 Thread Robert Haas
On Tue, Feb 21, 2017 at 10:20 AM, Dilip Kumar  wrote:
> 0001:
> 1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
> dsa_pointer dp)"  which will be responsible for freeing all the shared
> members (pagetable, ptpage and ptchunk).  Actually, we can not do this
> in tbm_free itself because the only leader will have a local tbm with
> reference to all these pointers and our parallel bitmap leader may not
> necessarily be the actual backend.

I'm not entirely sure about the calling convention for
tbm_free_shared_area() but the rest seems OK.

> 2. Now tbm_free is not freeing any of the shared members which can be
> accessed by other worker so tbm_free is safe to call from
> ExecEndBitmapHeapScan without any safety check or ref count.

That also seems fine. We ended up with something very similar in the
Parallel Index Scan patch.

> 0002:
> 1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
> not freeing any shared member in ExecEndBitmapHeapScan.
> 2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
> free the shared members of the TBM.
> 3. After that, we will free TBMSharedIteratorState what we allocated
> using tbm_prepare_shared_iterate.

Check.  But I think tbm_free_shared_area() should also free the object
itself, instead of making the caller do that separately.

+if (DsaPointerIsValid(node->pstate->tbmiterator))
+{
+/* First we free the shared TBM members using the shared state */
+tbm_free_shared_area(dsa, node->pstate->tbmiterator);
+dsa_free(dsa, node->pstate->tbmiterator);
+}
+
+if (DsaPointerIsValid(node->pstate->prefetch_iterator))
+dsa_free(dsa, node->pstate->prefetch_iterator);

The fact that these cases aren't symmetric suggests that your
abstraction is leaky.  I'm guessing that you can't call
tbm_free_shared_area because the two iterators share one copy of the
underlying iteration arrays, and the TBM code isn't smart enough to
avoid freeing them twice.  You're going to have to come up with a
better solution to that problem; nodeBitmapHeapScan.c shouldn't know
about the way the underlying storage details are managed.  (Maybe you
need to reference-count the iterator arrays?)

+if (node->inited)
+goto start_iterate;

My first programming teacher told me not to use goto.  I've
occasionally violated that rule, but I need a better reason than you
have here.  It looks very easy to avoid.

+pbms_set_parallel(outerPlanState(node));

I think this should be a flag in the plan, and the planner should set
it correctly, instead of having it be a flag in the executor that the
executor sets.  Also, the flag itself should really be called
something that involves the word "shared" rather than "parallel",
because the bitmap will not be created in parallel, but it will be
shared.

Have you checked whether this patch causes any regression in the
non-parallel case?  It adds a bunch of "if" statements, so it might.
Hopefully not, but it needs to be carefully tested.

@@ -48,10 +48,11 @@
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"

-
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);

Unnecessary.

+static bool pbms_is_leader(ParallelBitmapState *pbminfo);
+static void pbms_set_parallel(PlanState *node);

I don't think this "pbms" terminology is very good.  It's dissimilar
to the way other functions in this file are named.  Maybe
BitmapShouldInitializeSharedState().

I think that some of the bits that this function makes conditional on
pstate should be factored out into inline functions.  Like:

-if (node->prefetch_target >= node->prefetch_maximum)
- /* don't increase any further */ ;
-else if (node->prefetch_target >= node->prefetch_maximum / 2)
-node->prefetch_target = node->prefetch_maximum;
-else if (node->prefetch_target > 0)
-node->prefetch_target *= 2;
-else
-node->prefetch_target++;
+if (!pstate)
+{
+if (node->prefetch_target >= node->prefetch_maximum)
+ /* don't increase any further */ ;
+else if (node->prefetch_target >= node->prefetch_maximum / 2)
+node->prefetch_target = node->prefetch_maximum;
+else if (node->prefetch_target > 0)
+node->prefetch_target *= 2;
+else
+node->prefetch_target++;
+}
+else if (pstate->prefetch_target < node->prefetch_maximum)
+{
+SpinLockAcquire(&pstate->prefetch_mutex);
+if (pstate->prefetch_target >= node->prefetch_maximum)
+ /* don't increase any further */ ;
+else if (pstate->prefetch_target >=
+ node->prefetch_maximum / 2)
+pstate->prefetch_target = node->prefetch_maximum;

Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Dilip Kumar
On Mon, Feb 20, 2017 at 11:18 PM, Robert Haas  wrote:
> On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
>  wrote:
>> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
>>> So basically, what I want to propose is that Only during
>>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>>> that time we can be sure that all the workers have completed there
>>> task and we are safe to free. (And we don't free any DSA memory at
>>> ExecEndBitmapHeapScan).
>>
>> I think this works.
>
> OK.

In my latest version of the patch, I have fixed it as described above
i.e only cleanup in ExecReScanBitmapHeapScan.

For getting this there is some change in both the patches.

0001:
1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
dsa_pointer dp)"  which will be responsible for freeing all the shared
members (pagetable, ptpage and ptchunk).  Actually, we can not do this
in tbm_free itself because the only leader will have a local tbm with
reference to all these pointers and our parallel bitmap leader may not
necessarily be the actual backend.

If we want to get this done using tbm, then we need to create a local
tbm in each worker and get the shared member reference copied into tbm
using tbm_attach_shared_iterate like we were doing in the earlier
version.

2. Now tbm_free is not freeing any of the shared members which can be
accessed by other worker so tbm_free is safe to call from
ExecEndBitmapHeapScan without any safety check or ref count.

0002:
1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
not freeing any shared member in ExecEndBitmapHeapScan.

2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
free the shared members of the TBM.
3. After that, we will free TBMSharedIteratorState what we allocated
using tbm_prepare_shared_iterate.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-tidbitmap-support-shared-v5.patch
Description: Binary data


0002-parallel-bitmap-heapscan-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Tom Lane
Robert Haas  writes:
> On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane  wrote:
>> The thing that you really have to worry about for this kind of proposal
>> is "what if the query errors out and we never get to ExecEndNode"?
>> It's particularly nasty if you're talking about parallel queries where
>> maybe only one or some of the processes involved detect an error.

> I think that's not actually a problem, because we've already got code
> to make sure that all DSM resources associated with the query get
> blown away in that case.  Of course, that code might have bugs, but if
> it does, I think it's better to try to fix those bugs than to insert
> some belt-and-suspenders mechanism for reclaiming every possible chunk
> of memory in retail fashion, just like we blow up es_query_cxt rather
> than trying to pfree allocations individually.

Actually, I think we're saying the same thing: rely on the general DSM
cleanup mechanism, don't insert extra stuff that you expect will get
done by executor shutdown.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> One practical problem that came up was the need for executor nodes to
>> get a chance to do that kind of cleanup before the DSM segment is
>> detached.  In my patch series I introduced a new node API
>> ExecNodeDetach to allow for that.  Andres objected that the need for
>> that is evidence that the existing protocol is broken and should be
>> fixed instead.  I'm looking into that.
>
> The thing that you really have to worry about for this kind of proposal
> is "what if the query errors out and we never get to ExecEndNode"?
> It's particularly nasty if you're talking about parallel queries where
> maybe only one or some of the processes involved detect an error.

I think that's not actually a problem, because we've already got code
to make sure that all DSM resources associated with the query get
blown away in that case.  Of course, that code might have bugs, but if
it does, I think it's better to try to fix those bugs than to insert
some belt-and-suspenders mechanism for reclaiming every possible chunk
of memory in retail fashion, just like we blow up es_query_cxt rather
than trying to pfree allocations individually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
 wrote:
> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
>> So basically, what I want to propose is that Only during
>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>> that time we can be sure that all the workers have completed there
>> task and we are safe to free. (And we don't free any DSA memory at
>> ExecEndBitmapHeapScan).
>
> I think this works.

OK.

> Some hand-wavy thoughts on this topic in the context of hash joins:
>
> The argument for cleaning up sooner rather than later would be that it
> could reduce the total peak memory usage of large execution plans.  Is
> that a reasonable goal and can we achieve it?  I suspect the answer is
> yes in theory but no in practice, and we don't even try to achieve it
> in non-parallel queries as far as I know.

We're pretty stupid about causing nodes to stop eating up resources as
early as we could; for example, when a Limit is filled, we don't make
any attempt to have scans underneath it release pins or memory or
anything else.  But we don't usually let the same node consume memory
multiple times. ExecReScanBitmapHeapScan frees all of the memory used
for the previous bitmap in the non-parallel case, so it should
probably do that in the parallel case also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Dilip Kumar
On Sun, Feb 19, 2017 at 10:34 PM, Robert Haas  wrote:
> Yeah, but it looks like ExecReScanGather gets rid of the workers, but
> reuses the existing DSM.  I'm not quite sure what happens to the DSA.
> It looks like it probably just hangs around from the previous
> iteration, which means that any allocations will also hang around.

Yes right, they hang around. But, during rescan
(ExecReScanBitmapHeapScan) we can free all these DSA pointers ? That
mean before reallocating the DSA pointers we would have already got
rid of the old ones.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Tom Lane
Thomas Munro  writes:
> One practical problem that came up was the need for executor nodes to
> get a chance to do that kind of cleanup before the DSM segment is
> detached.  In my patch series I introduced a new node API
> ExecNodeDetach to allow for that.  Andres objected that the need for
> that is evidence that the existing protocol is broken and should be
> fixed instead.  I'm looking into that.

The thing that you really have to worry about for this kind of proposal
is "what if the query errors out and we never get to ExecEndNode"?
It's particularly nasty if you're talking about parallel queries where
maybe only one or some of the processes involved detect an error.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Thomas Munro
On Sun, Feb 19, 2017 at 10:34 PM, Robert Haas  wrote:
> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
>> I can imagine it can get executed over and over if plan is something like 
>> below.
>>
>> NestLoopJoin
>> -> SeqScan
>> -> Gather
>> -> Parallel Bitmap Heap Scan
>>
>> But in such case every time the Inner node of the NLJ will be
>> rescanned i.e. Gather will be rescanned which in turn shutdown
>> workers.
>
> Yeah, but it looks like ExecReScanGather gets rid of the workers, but
> reuses the existing DSM.  I'm not quite sure what happens to the DSA.
> It looks like it probably just hangs around from the previous
> iteration, which means that any allocations will also hang around.

Yes, it hangs around.  Being able to reuse state in a rescan is a
feature: you might be able to reuse a hash table or a bitmap.

In the Parallel Shared Hash patch, the last participant to detach from
the shared hash table barrier gets a different return code and runs
some cleanup code.  The alternative was to make the leader wait for
the workers to finish accessing the hash table so that it could do
that.  I had it that way in an early version, but my goal is to
minimise synchronisation points so now it's just 'last to leave turns
out the lights' with no waiting.

One practical problem that came up was the need for executor nodes to
get a chance to do that kind of cleanup before the DSM segment is
detached.  In my patch series I introduced a new node API
ExecNodeDetach to allow for that.  Andres objected that the need for
that is evidence that the existing protocol is broken and should be
fixed instead.  I'm looking into that.

On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
> So basically, what I want to propose is that Only during
> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
> that time we can be sure that all the workers have completed there
> task and we are safe to free. (And we don't free any DSA memory at
> ExecEndBitmapHeapScan).

I think this works.

I also grappled a bit with the question of whether it's actually worth
trying to free DSA memory when you're finished with it, eating
precious CPU cycles at end of a join, or just letting the the
executor's DSA area get nuked at end of parallel execution.  As you
say, there is a special case for rescans to avoid leaks.  I described
this as a potential approach in a TODO note in my v5 patch, but
currently my code just does the clean-up every time on the grounds
that it's simple and hasn't shown up as a performance problem yet.

Some hand-wavy thoughts on this topic in the context of hash joins:

The argument for cleaning up sooner rather than later would be that it
could reduce the total peak memory usage of large execution plans.  Is
that a reasonable goal and can we achieve it?  I suspect the answer is
yes in theory but no in practice, and we don't even try to achieve it
in non-parallel queries as far as I know.

My understanding is that PostgreSQL's planner can generate left-deep,
bushy and right-deep hash join plans:

N nested left-deep hash joins:  Each hash join is on the outer side of
its parent, supplying tuples to probe the parent hash table.  Their
probe phases overlap, so all N hash tables must exist and be fully
loaded at the same time.

N nested right-deep hash joins:  Each hash join is on the inner side
of its parent, supplying tuples to build the hash table of its parent.
Theoretically you only need two full hash tables in memory at peak,
because you'll finish probing each one while build its parent's hash
table and then not need the child's hash table again (unless you need
to rescan).

N nested bushy hash joins:  Somewhere in between.

But we don't actually take advantage of that opportunity to reduce
peak memory today.  We always have N live hash tables and don't free
them until standard_ExecutorEnd runs ExecProcEnd on the top of the
plan.  Perhaps we could teach hash tables to free themselves ASAP at
the end of their probe phase unless they know a rescan is possible.
But that just opens a whole can of worms:  if we care about total peak
memory usage, should it become a planner goal that might favour
right-deep hash joins?  I guess similar questions must arise for
bitmap heap scan and anything else holding memory that it doesn't
technically need anymore, and parallel query doesn't really change
anything about the situation, except maybe that Gather nodes provide a
point of scoping somewhere in between 'eager destruction' and 'hog all
the space until end of plan' which makes things a bit better.  I don't
know anywhere near enough about query planners to say whether such
ideas about planning are reasonable, and am quite aware that it's
difficult terrain, and I have other fish to fry, so I'm going to put
down the can opener and back away.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.

Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Robert Haas
On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
> I can imagine it can get executed over and over if plan is something like 
> below.
>
> NestLoopJoin
> -> SeqScan
> -> Gather
> -> Parallel Bitmap Heap Scan
>
> But in such case every time the Inner node of the NLJ will be
> rescanned i.e. Gather will be rescanned which in turn shutdown
> workers.

Yeah, but it looks like ExecReScanGather gets rid of the workers, but
reuses the existing DSM.  I'm not quite sure what happens to the DSA.
It looks like it probably just hangs around from the previous
iteration, which means that any allocations will also hang around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Dilip Kumar
On Sun, Feb 19, 2017 at 7:44 PM, Robert Haas  wrote:
> It's probably OK if tbm_free() doesn't free the memory allocated from
> DSA, and we just let cleanup at end of query do it.  However, that
> could cause some trouble if the Parallel Bitmap Heap Scan gets
> executed over and over and keeps allocating more and more memory from
> DSA.

Is it possible that Parallel Bitmap Heap Scan will be executed
multiple time without shutting down the Workers?

I can imagine it can get executed over and over if plan is something like below.

NestLoopJoin
-> SeqScan
-> Gather
-> Parallel Bitmap Heap Scan

But in such case every time the Inner node of the NLJ will be
rescanned i.e. Gather will be rescanned which in turn shutdown
workers.

So basically, what I want to propose is that Only during
ExecReScanBitmapHeapScan we can free all the DSA pointers because at
that time we can be sure that all the workers have completed there
task and we are safe to free. (And we don't free any DSA memory at
ExecEndBitmapHeapScan).

 I think the way to fix that would be to maintain a reference
> count that starts at 1 when the iterator arrays are created and gets
> incremented every time a TBMSharedIteratorState is created.  It gets
> decremented when the TIDBitmap is destroyed that has iterator arrays
> is destroyed, and each time a TBMSharedIteratorState is destroyed.
> When it reaches 0, the process that reduces the reference count to 0
> calls dsa_free on the DSA pointers for pagetable, spages, and schunks.
> (Also, if a TIDBitmap is freed before iteration begins, it frees the
> DSA pointer for the pagetable only; the others won't have values yet.)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Robert Haas
On Sun, Feb 19, 2017 at 7:18 PM, Dilip Kumar  wrote:
> I have observed one problem with 0002 and I though of sharing that
> before fixing the same because we might have encountered the same
> problem in some other patches i.e parallel shared hash and there might
> be already a way to handle that.
>
> The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein,
> it frees local pagetable memory and the shared memory using callback
> routine and if other than the Backend (actual backend for the client
> which is executing gather node) node any other worker become the
> leader (worker which is responsible for creating the shared TBM) then
> it will finish it work and free the shared memory by calling
> ExecEndBitmapHeapScan, and it's possible that other worker are still
> operating on the shared memory.
>
> So far I never observed this issue in our test may be because either
> Backend become the leader or by the time leader (non backend) free the
> TBM others also finishes there work.
>
> Solution:
> - One solution to this problem can be that leader should not complete
> the scan unless all other worker have finished the work.
> - We can also think of that we don't make anyone wait but we make a
> arrangement that last worker to finish the bitmapscan only free the
> memory, but one problem with this solution is that last worker can be
> non-leader also, which will have access to shared memory but how to
> free the pagetable local memory (it's local to the leader).
>
> Any suggestion on this ?

It's probably OK if tbm_free() doesn't free the memory allocated from
DSA, and we just let cleanup at end of query do it.  However, that
could cause some trouble if the Parallel Bitmap Heap Scan gets
executed over and over and keeps allocating more and more memory from
DSA.  I think the way to fix that would be to maintain a reference
count that starts at 1 when the iterator arrays are created and gets
incremented every time a TBMSharedIteratorState is created.  It gets
decremented when the TIDBitmap is destroyed that has iterator arrays
is destroyed, and each time a TBMSharedIteratorState is destroyed.
When it reaches 0, the process that reduces the reference count to 0
calls dsa_free on the DSA pointers for pagetable, spages, and schunks.
(Also, if a TIDBitmap is freed before iteration begins, it frees the
DSA pointer for the pagetable only; the others won't have values yet.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Amit Kapila
On Sun, Feb 19, 2017 at 7:18 PM, Dilip Kumar  wrote:
> On Sat, Feb 18, 2017 at 10:45 PM, Dilip Kumar  wrote:
>> in 0002:
>> - Improved comments.
>> - Code refactoring in BitmapHeapNext.
>> - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
>> it's of no use.
>
> I have observed one problem with 0002 and I though of sharing that
> before fixing the same because we might have encountered the same
> problem in some other patches i.e parallel shared hash and there might
> be already a way to handle that.
>
> The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein,
> it frees local pagetable memory and the shared memory using callback
> routine and if other than the Backend (actual backend for the client
> which is executing gather node) node any other worker become the
> leader (worker which is responsible for creating the shared TBM) then
> it will finish it work and free the shared memory by calling
> ExecEndBitmapHeapScan, and it's possible that other worker are still
> operating on the shared memory.
> So far I never observed this issue in our test may be because either
> Backend become the leader or by the time leader (non backend) free the
> TBM others also finishes there work.
>
> Solution:
> - One solution to this problem can be that leader should not complete
> the scan unless all other worker have finished the work.

For Parallel Seq Scan, we do wait for parallel workers to finish
before freeing the shared memory.  Why the case is different here?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-19 Thread Dilip Kumar
On Sat, Feb 18, 2017 at 10:45 PM, Dilip Kumar  wrote:
> in 0002:
> - Improved comments.
> - Code refactoring in BitmapHeapNext.
> - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap
> it's of no use.

I have observed one problem with 0002 and I though of sharing that
before fixing the same because we might have encountered the same
problem in some other patches i.e parallel shared hash and there might
be already a way to handle that.

The problem is that In ExecEndBitmapHeapScan we do tbm_free. Wherein,
it frees local pagetable memory and the shared memory using callback
routine and if other than the Backend (actual backend for the client
which is executing gather node) node any other worker become the
leader (worker which is responsible for creating the shared TBM) then
it will finish it work and free the shared memory by calling
ExecEndBitmapHeapScan, and it's possible that other worker are still
operating on the shared memory.

So far I never observed this issue in our test may be because either
Backend become the leader or by the time leader (non backend) free the
TBM others also finishes there work.

Solution:
- One solution to this problem can be that leader should not complete
the scan unless all other worker have finished the work.
- We can also think of that we don't make anyone wait but we make a
arrangement that last worker to finish the bitmapscan only free the
memory, but one problem with this solution is that last worker can be
non-leader also, which will have access to shared memory but how to
free the pagetable local memory (it's local to the leader).

Any suggestion on this ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-18 Thread Dilip Kumar
On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas  wrote:
> + * in order to share relptrs of the chunk and the pages arrays and other
> + * TBM related information with other workers.
>
> No relptrs any more.

Fixed
>
> +dsa_pointer dsapagetable;/* dsa_pointer to the element array */
> +dsa_pointer dsapagetableold;/* dsa_pointer to the old element array 
> */
> +dsa_area   *dsa;/* reference to per-query dsa area */
> +dsa_pointer ptpages;/* dsa_pointer to the page array */
> +dsa_pointer ptchunks;/* dsa_pointer to the chunk array */
>
> Let's put the DSA pointer first and then the other stuff after it.
> That seems more logical.

Done that way
>
> +typedef struct TBMSharedIteratorState
> +{
> +intspageptr;/* next spages index */
> +intschunkptr;/* next schunks index */
> +intschunkbit;/* next bit to check in current schunk 
> */
> +LWLocklock;/* lock to protect above members */
> +dsa_pointer pagetable;/* dsa pointers to head of pagetable data 
> */
> +dsa_pointer spages;/* dsa pointer to page array */
> +dsa_pointer schunks;/* dsa pointer to chunk array */
> +intnentries;/* number of entries in pagetable */
> +intmaxentries;/* limit on same to meet maxbytes */
> +intnpages;/* number of exact entries in
> pagetable */
> +intnchunks;/* number of lossy entries in pagetable */
> +} TBMSharedIteratorState;
>
> I think you've got this largely backwards from the most logical order.
> Let's order it like this: nentries, maxentries, npages, nchunks,
> pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
> chunkptr, schunkbit.

Done
>
> +struct TBMSharedIterator
> +{
> +PagetableEntry *ptbase;/* pointer to the pagetable element array 
> */
> +dsa_area   *dsa;/* reference to per-query dsa area */
> +int   *ptpages;/* sorted exact page index list */
> +int   *ptchunks;/* sorted lossy page index list */
> +TBMSharedIteratorState *state;/* shared members */
> +TBMIterateResult output;/* MUST BE LAST (because variable-size) */
> +};
>
> Do we really need "dsa" here when it's already present in the shared
> state?  It doesn't seem like we even use it for anything.  It's needed
> to create each backend's TBMSharedIterator, but not afterwards, I
> think.

Right, removed.

>
> In terms of ordering, I'd move "state" to the top of the structure,
> just like "tbm" comes first in a TBMIterator.

Yeah, that looks better. Done that way.
>
> + * total memory consumption.  If dsa passed to this function is not NULL
> + * then the memory for storing elements of the underlying page table will
> + * be allocated from the DSA.
>
> Notice that you capitalized "DSA" in two different ways in the same
> sentence; I'd go for the all-caps version.  Also, you need the word
> "the" before the first one.

Fixed, all such instances.

>
> +if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))
>
> Excess parentheses.
Fixed
>
> + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
> + * by multiple workers using shared iterator.
>
> Prepare to iterate, not prepare to iterator.  I suggest rephrasing
> this as "prepare shared iteration state for a TIDBitmap".

Fixed.
>
> + * The TBMSharedIteratorState will be allocated from DSA so that multiple
> + * worker can attach to it and iterate jointly.
>
> Maybe change to "The necessary shared state will be allocated from the
> DSA passed to tbm_create, so that multiple workers can attach to it
> and iterate jointly".

Done.
>
> + * This will convert the pagetable hash into page and chunk array of the 
> index
> + * into pagetable array so that multiple workers can use it to get the actual
> + * page entry.
>
> I think you can leave off everything starting from "so that".  It's
> basically redundant with what you already said.

Done
>
> +dsa_pointer iterator;
> +TBMSharedIteratorState *iterator_state;
>
> These aren't really great variable names, because the latter isn't the
> state associated with the former.  They're both the same object.
> Maybe just "dp" and "istate".

Done
>
> I think this function should Assert(tbm->dsa != NULL) and
> Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
> tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

Done
>
> +/*
> + * If we have a hashtable, create and fill the sorted page lists, unless
> + * we already did that for a previous iterator.  Note that the lists are
> + * attached to the bitmap not the iterator, so they can be used by more
> + * than one iterator.  However, we keep dsa_pointer to these in the 
> shared
> + * iterator so that other workers can access them directly.
>

Re: [HACKERS] Parallel bitmap heap scan

2017-02-16 Thread Robert Haas
On Thu, Feb 16, 2017 at 10:54 AM, Dilip Kumar  wrote:
> interface_dsa_allocate0.patch : Provides new interface dsa_allocate0,
> which is used by 0001

Committed.  I moved the function prototype for dsa_allocate0() next to
the existing prototype for dsa_allocate().  Please try to think about
things like this when you add functions or prototypes or structure
members: put them in a natural place where they are close to related
things, not just wherever happens to be convenient.

> gather_shutdown_childeren_first.patch : Processing the child node

This same problem came up on the Parallel Hash thread.  I mentioned
this proposed fix over there; let's see what he (or anyone else) has
to say about it.

> 0001: tidbimap change to provide the interfaces for shared iterator.

+ * in order to share relptrs of the chunk and the pages arrays and other
+ * TBM related information with other workers.

No relptrs any more.

+dsa_pointer dsapagetable;/* dsa_pointer to the element array */
+dsa_pointer dsapagetableold;/* dsa_pointer to the old element array */
+dsa_area   *dsa;/* reference to per-query dsa area */
+dsa_pointer ptpages;/* dsa_pointer to the page array */
+dsa_pointer ptchunks;/* dsa_pointer to the chunk array */

Let's put the DSA pointer first and then the other stuff after it.
That seems more logical.

+typedef struct TBMSharedIteratorState
+{
+intspageptr;/* next spages index */
+intschunkptr;/* next schunks index */
+intschunkbit;/* next bit to check in current schunk */
+LWLocklock;/* lock to protect above members */
+dsa_pointer pagetable;/* dsa pointers to head of pagetable data */
+dsa_pointer spages;/* dsa pointer to page array */
+dsa_pointer schunks;/* dsa pointer to chunk array */
+intnentries;/* number of entries in pagetable */
+intmaxentries;/* limit on same to meet maxbytes */
+intnpages;/* number of exact entries in
pagetable */
+intnchunks;/* number of lossy entries in pagetable */
+} TBMSharedIteratorState;

I think you've got this largely backwards from the most logical order.
Let's order it like this: nentries, maxentries, npages, nchunks,
pagetable, spages, schunks, lock (to protect BELOW members), spageptr,
chunkptr, schunkbit.

+struct TBMSharedIterator
+{
+PagetableEntry *ptbase;/* pointer to the pagetable element array */
+dsa_area   *dsa;/* reference to per-query dsa area */
+int   *ptpages;/* sorted exact page index list */
+int   *ptchunks;/* sorted lossy page index list */
+TBMSharedIteratorState *state;/* shared members */
+TBMIterateResult output;/* MUST BE LAST (because variable-size) */
+};

Do we really need "dsa" here when it's already present in the shared
state?  It doesn't seem like we even use it for anything.  It's needed
to create each backend's TBMSharedIterator, but not afterwards, I
think.

In terms of ordering, I'd move "state" to the top of the structure,
just like "tbm" comes first in a TBMIterator.

+ * total memory consumption.  If dsa passed to this function is not NULL
+ * then the memory for storing elements of the underlying page table will
+ * be allocated from the DSA.

Notice that you capitalized "DSA" in two different ways in the same
sentence; I'd go for the all-caps version.  Also, you need the word
"the" before the first one.

+if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING))

Excess parentheses.

+ * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap
+ * by multiple workers using shared iterator.

Prepare to iterate, not prepare to iterator.  I suggest rephrasing
this as "prepare shared iteration state for a TIDBitmap".

+ * The TBMSharedIteratorState will be allocated from DSA so that multiple
+ * worker can attach to it and iterate jointly.

Maybe change to "The necessary shared state will be allocated from the
DSA passed to tbm_create, so that multiple workers can attach to it
and iterate jointly".

+ * This will convert the pagetable hash into page and chunk array of the index
+ * into pagetable array so that multiple workers can use it to get the actual
+ * page entry.

I think you can leave off everything starting from "so that".  It's
basically redundant with what you already said.

+dsa_pointer iterator;
+TBMSharedIteratorState *iterator_state;

These aren't really great variable names, because the latter isn't the
state associated with the former.  They're both the same object.
Maybe just "dp" and "istate".

I think this function should Assert(tbm->dsa != NULL) and
Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly
tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED).

+/*
+ * If

Re: [HACKERS] Parallel bitmap heap scan

2017-02-16 Thread Dilip Kumar
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas  wrote:
> What is the point of having a TBMSharedIterator contain a TIDBitmap
> pointer?  All the values in that TIDBitmap are populated from the
> TBMSharedInfo, but it seems to me that the fields that are copied over
> unchanged (like status and nchunks) could just be used directly from
> the TBMSharedInfo, and the fields that are computed (like relpages and
> relchunks) could be stored directly in the TBMSharedIterator.
> tbm_shared_iterate() is already separate code from tbm_iterate(), so
> it can be changed to refer to whichever data structure contains the
> data it needs.

Done

>
> Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
> like you could merge those two into a single structure.

Done
>
> I think we can simplify things here a bit by having shared iterators
> not support single-page mode.  In the backend-private case,
> tbm_begin_iterate() really doesn't need to do anything with the
> pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
> got to anyway copy the pagetable into shared memory.  So it seems to
> me that it would be simpler just to transform it into a standard
> iteration array while we're at it, instead of copying it into entry1.
> In other words, I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

Done

>
> I think "dsa_data" is poorly named; it should be something like
> "dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
> think you should should move the "base", "relpages", and "relchunks"
> into TBMSharedIterator and give them better names, like "ptbase",
> "ptpages" and "ptchunks".  "base" isn't clear that we're talking about
> the pagetable's base as opposed to anything else, and "relpages" and
> "relchunks" are named based on the fact that the pointers are relative
> rather than named based on what data they point at, which doesn't seem
> like the right choice.

Done
>
> I suggest putting the parallel functions next to each other in the
> file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
> tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
> tbm_end_shared_iterate().

Done
>
> +if (tbm->dsa == NULL)
> +return pfree(pointer);
>
> Don't try to return a value of type void.  The correct spelling of
> this is { pfree(pointer); return; }.  Formatted appropriately across 4
> lines, of course.

Fixed

>
> +/*
> + * If TBM is in iterating phase that means pagetable is already created
> + * and we have come here during tbm_free. By this time we are already
> + * detached from the DSA because the GatherNode would have been shutdown.
> + */
> +if (tbm->iterating)
> +return;
>
> This seems like something of a kludge, although not a real easy one to
> fix.  The problem here is that tidbitmap.c ideally shouldn't have to
> know that it's being used by the executor or that there's a Gather
> node involved, and certainly not the order of operations around
> shutdown.  It should really be the problem of 0002 to handle this kind
> of problem, without 0001 having to know anything about it.  It strikes
> me that it would probably be a good idea to have Gather clean up its
> children before it gets cleaned up itself.  So in ExecShutdownNode, we
> could do something like this:
>
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> index 0dd95c6..5ccc2e8 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
>  if (node == NULL)
>  return false;
>
> +planstate_tree_walker(node, ExecShutdownNode, NULL);
> +
>  switch (nodeTag(node))
>  {
>  case T_GatherState:
> @@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
>  break;
>  }
>
> -return planstate_tree_walker(node, ExecShutdownNode, NULL);
> +return false;
>  }
>
> Also, in ExecEndGather, something like this:
>
> diff --git a/src/backend/executor/nodeGather.c
> b/src/backend/executor/nodeGather.c
> index a1a3561..32c97d3 100644
> --- a/src/backend/executor/nodeGather.c
> +++ b/src/backend/executor/nodeGather.c
> @@ -229,10 +229,10 @@ ExecGather(GatherState *node)
>  void
>  ExecEndGather(GatherState *node)
>  {
> +ExecEndNode(outerPlanState(node));  /* let children clean up first */
>  ExecShutdownGather(node);
>  ExecFreeExprContext(&node->ps);
>  ExecClearTuple(node->ps.ps_ResultTupleSlot);
> -ExecEndNode(outerPlanState(node));
>  }
>
>  /*
>
> With those changes and an ExecShutdownBitmapHeapScan() called from
> ExecShutdownNode(), it should be possible (I think) for us to always
> have the bitmap heap scan node shut down before the Gather node shuts
> down, which I think would let you avoid having a special case for this
> inside the TBM code.

Done
(gather_shutdown_children_first.patch does what y

Re: [HACKERS] Parallel bitmap heap scan

2017-02-15 Thread Robert Haas
On Wed, Feb 15, 2017 at 7:17 AM, Dilip Kumar  wrote:
> This is easily doable and it will look much cleaner. While doing this
> I am facing one problem related to
> relptr_store and relptr_access. The problem is that when I call
> "relptr_store  (base, rp, val)" with both base and val as a same
> address then it will store offset as 0 which is fine, but when we use
> relptr_access with 0 offset it is returning NULL instead of giving the
> actual address. I expect it should return directly base when offset is
> 0. (I am facing this problem because in this case (TBM_ONE_PAGE),
> there will be only one page and address of that will be same as
> base.).
>
> There can be multiple solutions to this but none of them looks cleaner to me.
> sol1: If relptr_access return NULL then directly use the base address
> as our PagetableEntry.
> sol2: Instead of using base as start of the element array, just try to
> use some modified base as e.g base=base - some number.
> sol3: change relptr_access to not return NULL in this case. But, this
> will change the current behaviour of this interface and need to
> analyze the side effects.

Hmm, yeah, that's a problem.  How about not using relative pointers
here, and instead just using array indexes?

>> I don't see why you think you need to add sizeof(dsa_pointer) to the
>> allocation and store an extra copy of the dsa_pointer in the
>> additional space.   You are already storing it in tbm->dsa_data and
>> that seems good enough.  pagetable_free() needs the value, but it has
>> a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
>> instead of fishing the pointer out of the extra space.
>
> If you see the code of SH_GROW, first it needs to allocate the bigger
> chunk copy data from smaller chunk to the bigger chunk and then free
> the smaller chunk.

Oh, I see.

> So by the time it call the pagetable_free, it would have already
> called the pagetable_allocate for the newer bigger chunk of memory so
> now, tbm->dsa_data points to the latest memory, but pagetable_free
> wants to free older one.
>
> One solution to this could be that I keep two dsa_pointer in TBM, one
> point to old memory and another to new. (After this here we will get
> the same problem of relptr_access because now we will have first entry
> pointer is same as base pointer)

Yes, two dsa_pointers seems fine.  Maybe that's not totally beautiful,
but it seems better than what you had before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-15 Thread Dilip Kumar
On Tue, Feb 14, 2017 at 10:18 PM, Robert Haas  wrote:
> Thanks, the external interface to this looks much cleaner now.
> Scrutinizing the internals:

Thanks for the comments, I am working on these. I have few doubts for
some of the comments.

> I suggest removing both "entry1" and "status" from
> TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
> compensate.

This is easily doable and it will look much cleaner. While doing this
I am facing one problem related to
relptr_store and relptr_access. The problem is that when I call
"relptr_store  (base, rp, val)" with both base and val as a same
address then it will store offset as 0 which is fine, but when we use
relptr_access with 0 offset it is returning NULL instead of giving the
actual address. I expect it should return directly base when offset is
0. (I am facing this problem because in this case (TBM_ONE_PAGE),
there will be only one page and address of that will be same as
base.).

The question can be why we did not face this issue while converting
the pagetable to page array, because at least one entry will be there
which should have the same address as the base. But coincidently we
did not get that problem because in that case as of now we were
storing dsa_pointer in the head of the element array, therefore first
pagetable element was not same as the base.

There can be multiple solutions to this but none of them looks cleaner to me.
sol1: If relptr_access return NULL then directly use the base address
as our PagetableEntry.
sol2: Instead of using base as start of the element array, just try to
use some modified base as e.g base=base - some number.
sol3: change relptr_access to not return NULL in this case. But, this
will change the current behaviour of this interface and need to
analyze the side effects.

>
> I don't see why you think you need to add sizeof(dsa_pointer) to the
> allocation and store an extra copy of the dsa_pointer in the
> additional space.   You are already storing it in tbm->dsa_data and
> that seems good enough.  pagetable_free() needs the value, but it has
> a pointer to the TIDBitmap and could just pass tbm->dsa_data directly
> instead of fishing the pointer out of the extra space.

If you see the code of SH_GROW, first it needs to allocate the bigger
chunk copy data from smaller chunk to the bigger chunk and then free
the smaller chunk.

So by the time it call the pagetable_free, it would have already
called the pagetable_allocate for the newer bigger chunk of memory so
now, tbm->dsa_data points to the latest memory, but pagetable_free
wants to free older one.

One solution to this could be that I keep two dsa_pointer in TBM, one
point to old memory and another to new. (After this here we will get
the same problem of relptr_access because now we will have first entry
pointer is same as base pointer)

Please provide your thought.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-14 Thread Robert Haas
On Tue, Feb 14, 2017 at 12:18 AM, Dilip Kumar  wrote:
> Fixed.

Thanks, the external interface to this looks much cleaner now.
Scrutinizing the internals:

What is the point of having a TBMSharedIterator contain a TIDBitmap
pointer?  All the values in that TIDBitmap are populated from the
TBMSharedInfo, but it seems to me that the fields that are copied over
unchanged (like status and nchunks) could just be used directly from
the TBMSharedInfo, and the fields that are computed (like relpages and
relchunks) could be stored directly in the TBMSharedIterator.
tbm_shared_iterate() is already separate code from tbm_iterate(), so
it can be changed to refer to whichever data structure contains the
data it needs.

Why is TBMSharedInfo separate from TBMSharedIteratorState?  It seems
like you could merge those two into a single structure.

I think we can simplify things here a bit by having shared iterators
not support single-page mode.  In the backend-private case,
tbm_begin_iterate() really doesn't need to do anything with the
pagetable in the TBM_ONE_PAGE case, but for a shared iterator, we've
got to anyway copy the pagetable into shared memory.  So it seems to
me that it would be simpler just to transform it into a standard
iteration array while we're at it, instead of copying it into entry1.
In other words, I suggest removing both "entry1" and "status" from
TBMSharedInfo and making tbm_prepare_shared_iterate smarter to
compensate.

I think "dsa_data" is poorly named; it should be something like
"dsapagetable" in TIDBitmap and just "pagetable" in TBMSharedInfo.  I
think you should should move the "base", "relpages", and "relchunks"
into TBMSharedIterator and give them better names, like "ptbase",
"ptpages" and "ptchunks".  "base" isn't clear that we're talking about
the pagetable's base as opposed to anything else, and "relpages" and
"relchunks" are named based on the fact that the pointers are relative
rather than named based on what data they point at, which doesn't seem
like the right choice.

I suggest putting the parallel functions next to each other in the
file: tbm_begin_iterate(), tbm_prepare_shared_iterate(),
tbm_iterate(), tbm_shared_iterate(), tbm_end_iterate(),
tbm_end_shared_iterate().

+if (tbm->dsa == NULL)
+return pfree(pointer);

Don't try to return a value of type void.  The correct spelling of
this is { pfree(pointer); return; }.  Formatted appropriately across 4
lines, of course.

+/*
+ * If TBM is in iterating phase that means pagetable is already created
+ * and we have come here during tbm_free. By this time we are already
+ * detached from the DSA because the GatherNode would have been shutdown.
+ */
+if (tbm->iterating)
+return;

This seems like something of a kludge, although not a real easy one to
fix.  The problem here is that tidbitmap.c ideally shouldn't have to
know that it's being used by the executor or that there's a Gather
node involved, and certainly not the order of operations around
shutdown.  It should really be the problem of 0002 to handle this kind
of problem, without 0001 having to know anything about it.  It strikes
me that it would probably be a good idea to have Gather clean up its
children before it gets cleaned up itself.  So in ExecShutdownNode, we
could do something like this:

diff --git a/src/backend/executor/execProcnode.c
b/src/backend/executor/execProcnode.c
index 0dd95c6..5ccc2e8 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -815,6 +815,8 @@ ExecShutdownNode(PlanState *node)
 if (node == NULL)
 return false;

+planstate_tree_walker(node, ExecShutdownNode, NULL);
+
 switch (nodeTag(node))
 {
 case T_GatherState:
@@ -824,5 +826,5 @@ ExecShutdownNode(PlanState *node)
 break;
 }

-return planstate_tree_walker(node, ExecShutdownNode, NULL);
+return false;
 }

Also, in ExecEndGather, something like this:

diff --git a/src/backend/executor/nodeGather.c
b/src/backend/executor/nodeGather.c
index a1a3561..32c97d3 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -229,10 +229,10 @@ ExecGather(GatherState *node)
 void
 ExecEndGather(GatherState *node)
 {
+ExecEndNode(outerPlanState(node));  /* let children clean up first */
 ExecShutdownGather(node);
 ExecFreeExprContext(&node->ps);
 ExecClearTuple(node->ps.ps_ResultTupleSlot);
-ExecEndNode(outerPlanState(node));
 }

 /*

With those changes and an ExecShutdownBitmapHeapScan() called from
ExecShutdownNode(), it should be possible (I think) for us to always
have the bitmap heap scan node shut down before the Gather node shuts
down, which I think would let you avoid having a special case for this
inside the TBM code.

+char   *ptr;
+dsaptr = dsa_allocate(tbm->dsa, size + sizeof(dsa_pointer));
+tbm->dsa_data = dsaptr;
+ptr = dsa_get_address(tbm->dsa, dsaptr);
+memset(ptr, 0, size + sizeof(ds

Re: [HACKERS] Parallel bitmap heap scan

2017-02-13 Thread Dilip Kumar
On Tue, Feb 14, 2017 at 6:51 AM, Haribabu Kommi
 wrote:
> +#if SIZEOF_DSA_POINTER == 4
> +typedef uint32 dsa_pointer;
> +#else
> +typedef uint64 dsa_pointer;
> +#endif
>
> I feel the declaration of the above typdef can be moved into the
> section above if we going with the current move into postgres.h
> file.

Robert has already given the patch[1] for the same so now don't need
to do anything for this. Therefore, I included the dsa.h in
tidbitmap.h.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com
>
> +/*
> + * tbm_alloc_shared
> + *
> + * Callback function for allocating the memory for hashtable elements.
> + * It allocates memory from DSA if tbm holds a reference to a dsa.
> + */
> +static inline void *
> +pagetable_allocate(pagetable_hash *pagetable, Size size)
>
> Function name and comments mismatch?

Fixed.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-tidbitmap-support-shared-v2.patch
Description: Binary data


0002-parallel-bitmap-heapscan-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-13 Thread Robert Haas
On Mon, Feb 13, 2017 at 8:48 AM, Dilip Kumar  wrote:
> On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas  wrote:
>> I don't think it's acceptable (or necessary) to move the DSA
>> definitions into postgres.h.  Why do you think you need to do that,
>> vs. just including dsa.h in a few more places?
>
> I need to access dsa_pointer in tidbitmap.h, which is included from
> FRONTEND as well. Now, problem is that dsa.h is including #include
> "port/atomics.h", but atomic.h can not be included if FRONTEND is
> defined.
>
> #ifndef ATOMICS_H
> #define ATOMICS_H
> #ifdef FRONTEND
> #error "atomics.h may not be included from frontend code"
> #endif
>
> Is there any other solution to this ?

Well, any problem like this generally has a bunch of solutions, so
I'll say yes.  I spent a good chunk of today studying the issue and
started a new thread devoted specifically to it:

https://www.postgresql.org/message-id/CA%2BTgmoZ%3DF%3DGkxV0YEv-A8tb%2BAEGy_Qa7GSiJ8deBKFATnzfEug%40mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-13 Thread Haribabu Kommi
On Tue, Feb 14, 2017 at 12:48 AM, Dilip Kumar  wrote:

> On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas 
> wrote:
> > I don't think it's acceptable (or necessary) to move the DSA
> > definitions into postgres.h.  Why do you think you need to do that,
> > vs. just including dsa.h in a few more places?
>
> I need to access dsa_pointer in tidbitmap.h, which is included from
> FRONTEND as well. Now, problem is that dsa.h is including #include
> "port/atomics.h", but atomic.h can not be included if FRONTEND is
> defined.
>
> #ifndef ATOMICS_H
> #define ATOMICS_H
> #ifdef FRONTEND
> #error "atomics.h may not be included from frontend code"
> #endif
>
> Is there any other solution to this ?


How about creating another header file with the parallel changes
and include it only in necessary places?

Following are my observations, while going through the patch.

+#if SIZEOF_DSA_POINTER == 4
+typedef uint32 dsa_pointer;
+#else
+typedef uint64 dsa_pointer;
+#endif

I feel the declaration of the above typdef can be moved into the
section above if we going with the current move into postgres.h
file.

+/*
+ * tbm_alloc_shared
+ *
+ * Callback function for allocating the memory for hashtable elements.
+ * It allocates memory from DSA if tbm holds a reference to a dsa.
+ */
+static inline void *
+pagetable_allocate(pagetable_hash *pagetable, Size size)

Function name and comments mismatch?


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Parallel bitmap heap scan

2017-02-13 Thread Dilip Kumar
On Mon, Feb 13, 2017 at 6:24 PM, Robert Haas  wrote:
> I don't think it's acceptable (or necessary) to move the DSA
> definitions into postgres.h.  Why do you think you need to do that,
> vs. just including dsa.h in a few more places?

I need to access dsa_pointer in tidbitmap.h, which is included from
FRONTEND as well. Now, problem is that dsa.h is including #include
"port/atomics.h", but atomic.h can not be included if FRONTEND is
defined.

#ifndef ATOMICS_H
#define ATOMICS_H
#ifdef FRONTEND
#error "atomics.h may not be included from frontend code"
#endif

Is there any other solution to this ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-13 Thread Robert Haas
On Sat, Feb 11, 2017 at 1:41 AM, Dilip Kumar  wrote:
> I tried my best, please check the latest patch (0001).

I don't think it's acceptable (or necessary) to move the DSA
definitions into postgres.h.  Why do you think you need to do that,
vs. just including dsa.h in a few more places?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-10 Thread Dilip Kumar
On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas  wrote:

Thanks for the detailed review and your valuable feedback.

> I think it would be useful to break the remaining patch into two
> parts, one that enhances the tidbitmap.c API and another that uses
> that to implement Parallel Bitmap Heap Scan.

I have done that.

> There are several cosmetic problems here.  You may have noticed that
> all function prototypes in PostgreSQL header files are explicitly
> declared extern; yours should be, too.  Also, there is extra
> whitespace before some of the variable names here, like
> "ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If
> that is due to pgindent, the solution is to add the typedef names to
> your local typedef list.  Also, tbm_restore_shared_members doesn't
> actually exist.

Fixed

> 1. Add an additional argument to tbm_create(), dsa_area *dsa.  If it's
> NULL, we allocate a backend-private memory for the hash entries as
> normal.  If it's true, we use the dsa_area to store the hash entries,
> using the infrastructure added by your 0002 and revised in
> c3c4f6e1740be256178cd7551d5b8a7677159b74.  (You can use a flag in the
> BitmapIndexScan and BitmapOr to decide whether to pass NULL or
> es_query_dsa.)
Done

>
> 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
> *tbm) which allocates shared iteration arrays using the DSA passed to
> tbm_create() and returns a dsa_pointer to the result.  Arrange this so
> that if it's called more than once, each call returns a separate
> iterator, so that you can call it once to get the main iterator and a
> second time for the prefetch iterator, but have both of those point to
> the same underlying iteration arrays.
>
> 3. Add a new function TBMSharedIterator
> *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
> once per backend and gets passed the dsa_pointer from the previous
> step.
>
> 4. Add a new function TBMIterateResult
> *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

I have tried to get these three API's as you explained with one
change. In tbm_attach_shared_iterate I need to pass TBM also because
each worker will create their own copy of TBM. Those workers need to
get the TBM-related information from the shared location. Even though
I try to access some of the information like chunk, npages from
directly shared location, but some other information like base pointer
to dsa element, RelptrPagetableEntry etc. should be local to each
worker (after conversion from dsa pointer to local pointer).

>
> As compared with your proposal, this involves only 3 functions instead
> of 7 (plus one new argument to another function), and I think it's a
> lot clearer what those functions are actually doing.  You don't need
> tbm_estimate_parallel_tbminfo() any more because the data being passed
> from one backend to another is precisely a dsa_pointer, and the bitmap
> scan can just leave space for that in its own estimator function.  You
> don't need tbm_update_shared_members() separately from
> tbm_begin_shared_iterate() separately from tbm_init_shared_iterator()
> because tbm_prepare_shared_iterate() can do all that stuff.  You don't
> need tbm_set_parallel() because the additional argument to
> tbm_create() takes care of that need.

Right

>
> The way you've got ParallelTBMInfo set up right now doesn't respect
> the separation of concerns between nodeBitmapHeapscan.c and
> tidbitmap.c properly. tidbitmap.c shouldn't know that the caller
> intends on creating two iterators, one of which is for prefetching.
> The design above hopefully addresses that: each call to
> tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk
> of shared iterator state, but tidbitmap.c does not need to know how
> many times that will be called.

Done
>
> Apart from the above, this patch will need a rebase over today's
> commits,

Done
and please make sure all functions you add have high-quality
> header comments.

I tried my best, please check the latest patch (0001).

Apart from those, there are some more changes.
1. I have moved the dsa_pointer and dsa_area declaration from dsa.h to
postgres .h, an independent patch is attached for the same. Because we
need to declare function headers with dsa_pointer in tidbitmap.h, but
tidbitmap.h also used in FRONTEND, therefore, this can not include
dsa.h (as it contains atomics.h).

2. I noticed there was one defect in my code related to handling the
TBM_ONE_PAGE, In initial version, there was no problem, but it got
introduced somewhere in intermediate versions.

I have fixed the same. There were two option to do that

1. Don’t switch to  TBM_ONE_PAGE in case of parallel mode (ideally
this can not happen only worst estimation can get us there) and
directly got to TBM_HASH
2. In shared information keep space for sharing a PagetableEntry.


I have implemented 2nd (in the initial versions I implemented with 1st).

Note: Patch is also rebased on top of guc_parallel_index

Re: [HACKERS] Parallel bitmap heap scan

2017-02-09 Thread Robert Haas
On Wed, Feb 8, 2017 at 10:59 AM, Robert Haas  wrote:
> Looks good to me.  If nobody has further ideas here, I'll push this
> and your previous patch tomorrow.

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Robert Haas
On Wed, Feb 8, 2017 at 8:58 AM, Dilip Kumar  wrote:
> On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas  wrote:
>> You can store whatever you want in SH_TYPE's private_data member.
>> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
>> have access to that.  Hmm, but there's no way to get that set in
>> SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
>> private_data argument to SH_CREATE.  execGrouping.c could use that
>> instead of frobbing private_data directly:
>>
>> -hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
>> -hashtable->hashtab->private_data = hashtable;
>> +hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);
>
> Okay, will go ahead as you suggested. Patch attached for the same.

Looks good to me.  If nobody has further ideas here, I'll push this
and your previous patch tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Dilip Kumar
On Wed, Feb 8, 2017 at 7:01 PM, Robert Haas  wrote:
> You can store whatever you want in SH_TYPE's private_data member.
> SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
> have access to that.  Hmm, but there's no way to get that set in
> SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
> private_data argument to SH_CREATE.  execGrouping.c could use that
> instead of frobbing private_data directly:
>
> -hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
> -hashtable->hashtab->private_data = hashtable;
> +hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);

Okay, will go ahead as you suggested. Patch attached for the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


hash_create_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Robert Haas
On Wed, Feb 8, 2017 at 1:59 AM, Dilip Kumar  wrote:
> IIUC, tbm_prepare_shared_iterate will be called only by the leader,
> for tbmiterator as well as for the prefetch_iterator. And,
> tbm_attach_shared_iterate will be called by each backend and for both
> the iterators.

That's what I had in mind.

> IMHO, tbm_attach_shared_iterate should return TBMIterator with
> reference to TBMSharedIterator inside it. The reason behind same is
> that we can not keep TBMIterateResult inside TBMSharedIterator
> otherwise, results will also go in shared memory but we want to have
> local result memory for each worker so that other worker doesn't
> disturb it.

No, I don't agree.  I think TBMSharedIterator should be an unshared
structure created by tbm_attach_shared_iterate, which can internally
contain backend-private state like a TBMIterateResult, and which can
also contain a pointer to the shared-memory structure previously
created by tbm_prepare_shared_iterate.  That thing needs to be called
something other than a TBMSharedIterator, like TBMSharedIterationState
or something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Robert Haas
On Wed, Feb 8, 2017 at 4:20 AM, Dilip Kumar  wrote:
> The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any
> option to supply arguments to it. Our callback functions need access
> to TBM.
>
> Is it expected that if the user of SH_CREATE who doesn't want to pass
> a "MemoryContext" then we can pass arguments instead of ctx?

You can store whatever you want in SH_TYPE's private_data member.
SH_ALLOCATE and SH_FREE both get a pointer to the SH_TYPE, so they
have access to that.  Hmm, but there's no way to get that set in
SH_CREATE before SH_ALLOCATE is called.  Maybe we need to add a
private_data argument to SH_CREATE.  execGrouping.c could use that
instead of frobbing private_data directly:

-hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
-hashtable->hashtab->private_data = hashtable;
+hashtable->hashtab = tuplehash_create(tablecxt, nbuckets, hashtable);

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Robert Haas
On Wed, Feb 8, 2017 at 5:21 AM, Dilip Kumar  wrote:
> On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas  wrote:
 +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
 +
>>>
>>> That should probably be documented in the file header.
>>
>> Right.  OK, did that and a few other cleanups, and committed.
>
> I think we need to have prototype for the default allocator outside of
> #ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c
> who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the
> allocator function definition but it can not have the declaration of
> those function as that function take SH_TYPE as input and that will be
> only defined once we include the simplehash.h.
>
> So basically we can not declare prototype before including
> simplehash.h for allocator. And, if we don't declare we will get
> "implicit declaration warning" because simplehash itself is using
> those functions.
>
> The solution is simplehash.h, should always declare it, and provide
> the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined.
> Attached patch does that.

Makes sense, will commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Dilip Kumar
On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas  wrote:
>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>> +
>>
>> That should probably be documented in the file header.
>
> Right.  OK, did that and a few other cleanups, and committed.

I think we need to have prototype for the default allocator outside of
#ifndef SH_USE_NONDEFAULT_ALLOCATOR. Because the file e.g. tidbitmap.c
who wants to use SH_USE_NONDEFAULT_ALLOCATOR will provide the
allocator function definition but it can not have the declaration of
those function as that function take SH_TYPE as input and that will be
only defined once we include the simplehash.h.

So basically we can not declare prototype before including
simplehash.h for allocator. And, if we don't declare we will get
"implicit declaration warning" because simplehash itself is using
those functions.

The solution is simplehash.h, should always declare it, and provide
the definitions only if SH_USE_NONDEFAULT_ALLOCATOR is not defined.
Attached patch does that.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


hash_allocate_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-08 Thread Dilip Kumar
On Wed, Feb 8, 2017 at 3:44 AM, Robert Haas  wrote:
>>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>>> +
>>
>> That should probably be documented in the file header.
>
> Right.  OK, did that and a few other cleanups, and committed.

The new SH_CREATE(MemoryContext ctx, uint32 nelements) don't have any
option to supply arguments to it. Our callback functions need access
to TBM.

Is it expected that if the user of SH_CREATE who doesn't want to pass
a "MemoryContext" then we can pass arguments instead of ctx?

something like this ?
if (!tbm->dsa)
   tbm->pagetable = pagetable_create(tbm->mcxt, 128);
else
   tbm->pagetable = pagetable_create((MemoryContext)tbm, 128);

And, In allocation function, we can access this context and typecast to tbm?

As shown below.
static void *
pagetable_allocate(pagetable_hash *pagetable, Size size)
{
TIDBitmap  *tbm = pagetable->ctx;


So Is it expected to do like I explained above, or we missed to have
an arg parameter to SH_CREATE as well as in SH_TYPE structure or there
is some other way you have in mind?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Dilip Kumar
On Wed, Feb 8, 2017 at 8:07 AM, Robert Haas  wrote:

.Thanks for your input, I have few queries about these comments.

> 2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
> *tbm) which allocates shared iteration arrays using the DSA passed to
> tbm_create() and returns a dsa_pointer to the result.  Arrange this so
> that if it's called more than once, each call returns a separate
> iterator, so that you can call it once to get the main iterator and a
> second time for the prefetch iterator, but have both of those point to
> the same underlying iteration arrays.
>
> 3. Add a new function TBMSharedIterator
> *tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
> once per backend and gets passed the dsa_pointer from the previous
> step.

IIUC, tbm_prepare_shared_iterate will be called only by the leader,
for tbmiterator as well as for the prefetch_iterator. And,
tbm_attach_shared_iterate will be called by each backend and for both
the iterators.

IMHO, tbm_attach_shared_iterate should return TBMIterator with
reference to TBMSharedIterator inside it. The reason behind same is
that we can not keep TBMIterateResult inside TBMSharedIterator
otherwise, results will also go in shared memory but we want to have
local result memory for each worker so that other worker doesn't
disturb it.

Another option can be that we change the tbm_shared_iterate as explained below

TBMIterateResult * tbm_shared_iterate(TBMSharedIterator *,
TBMIterateResult *result).

Now, if result passed to this API is NULL then it will allocate the
memory for the result and that way we will have local result memory,
and if it's not NULL we will use this memory to store our results.
BitmapHeapNode already having a reference to the TBMIterateResult so
we should not have any problem in passing this reference to the
tbm_shared_iterate. I think this one looks better than what I
explained above.

Please suggest.

>
> 4. Add a new function TBMIterateResult
> *tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Robert Haas
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar  wrote:
> Here are the latest version of the patch, which address all the
> comments given by Robert.

I think it would be useful to break the remaining patch into two
parts, one that enhances the tidbitmap.c API and another that uses
that to implement Parallel Bitmap Heap Scan.  In this review I'm going
to focus on the changes to tidbitmap.c and tidbitmap.h.

+extern void tbm_restore_shared_members(TIDBitmap *tbm,
+   ParallelTBMInfo * stbm);
+extern void tbm_update_shared_members(TIDBitmap *tbm,
+  ParallelTBMInfo * parallel_tbm);
+voidtbm_set_parallel(TIDBitmap *tbm, void *area);
+extern Size tbm_estimate_parallel_tbminfo(Size size);
+TIDBitmap  *tbm_attach(ParallelTBMInfo * parallel_tbm, void *area);
+TBMIterator *tbm_begin_shared_iterate(TIDBitmap *tbm,
+ ParallelTBMInfo * parallel_tbm, bool prefetch);
+TBMIterateResult *tbm_shared_iterate(TBMIterator *iterator);
+voidtbm_init_shared_iterator(ParallelTBMInfo * tbminfo);

There are several cosmetic problems here.  You may have noticed that
all function prototypes in PostgreSQL header files are explicitly
declared extern; yours should be, too.  Also, there is extra
whitespace before some of the variable names here, like
"ParallelTBMInfo * tbminfo" instead of "ParallelTBMInfo *tbminfo". If
that is due to pgindent, the solution is to add the typedef names to
your local typedef list.  Also, tbm_restore_shared_members doesn't
actually exist.

More broadly, this seems like an extremely complicated API.  Even
ignoring the function that doesn't exist, that's still 7 different
functions just for shared iteration, which seems like a lot.  I
suggest the following API:

1. Add an additional argument to tbm_create(), dsa_area *dsa.  If it's
NULL, we allocate a backend-private memory for the hash entries as
normal.  If it's true, we use the dsa_area to store the hash entries,
using the infrastructure added by your 0002 and revised in
c3c4f6e1740be256178cd7551d5b8a7677159b74.  (You can use a flag in the
BitmapIndexScan and BitmapOr to decide whether to pass NULL or
es_query_dsa.)

2. Add a new function dsa_pointer tbm_prepare_shared_iterate(TIDBitmap
*tbm) which allocates shared iteration arrays using the DSA passed to
tbm_create() and returns a dsa_pointer to the result.  Arrange this so
that if it's called more than once, each call returns a separate
iterator, so that you can call it once to get the main iterator and a
second time for the prefetch iterator, but have both of those point to
the same underlying iteration arrays.

3. Add a new function TBMSharedIterator
*tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer) which is called
once per backend and gets passed the dsa_pointer from the previous
step.

4. Add a new function TBMIterateResult
*tbm_shared_iterate(TBMSharedIterator *) to fetch the next result.

As compared with your proposal, this involves only 3 functions instead
of 7 (plus one new argument to another function), and I think it's a
lot clearer what those functions are actually doing.  You don't need
tbm_estimate_parallel_tbminfo() any more because the data being passed
from one backend to another is precisely a dsa_pointer, and the bitmap
scan can just leave space for that in its own estimator function.  You
don't need tbm_update_shared_members() separately from
tbm_begin_shared_iterate() separately from tbm_init_shared_iterator()
because tbm_prepare_shared_iterate() can do all that stuff.  You don't
need tbm_set_parallel() because the additional argument to
tbm_create() takes care of that need.

The way you've got ParallelTBMInfo set up right now doesn't respect
the separation of concerns between nodeBitmapHeapscan.c and
tidbitmap.c properly. tidbitmap.c shouldn't know that the caller
intends on creating two iterators, one of which is for prefetching.
The design above hopefully addresses that: each call to
tbm_prepare_shared_iterate returns a dsa_pointer to a separate chunk
of shared iterator state, but tidbitmap.c does not need to know how
many times that will be called.

Apart from the above, this patch will need a rebase over today's
commits, and please make sure all functions you add have high-quality
header comments.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Robert Haas
On Tue, Feb 7, 2017 at 4:45 PM, Andres Freund  wrote:
> On 2017-02-07 16:36:55 -0500, Robert Haas wrote:
>> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund  wrote:
>> > FWIW, I think it'd have been better to not add the new callbacks as
>> > parameters to *_create(), but rather have them be "templatized" like the
>> > rest of simplehash.  That'd require that callback to check the context,
>> > to know whether it should use shared memory or not, but that seems fine
>> > to me.  Right now this pushes the head of simplehash above a
>> > cacheline...
>>
>> Something like the attached?
>
> Yes.
>
>> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
>> +
>
> That should probably be documented in the file header.

Right.  OK, did that and a few other cleanups, and committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Andres Freund
On 2017-02-07 16:36:55 -0500, Robert Haas wrote:
> On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund  wrote:
> > FWIW, I think it'd have been better to not add the new callbacks as
> > parameters to *_create(), but rather have them be "templatized" like the
> > rest of simplehash.  That'd require that callback to check the context,
> > to know whether it should use shared memory or not, but that seems fine
> > to me.  Right now this pushes the head of simplehash above a
> > cacheline...
> 
> Something like the attached?

Yes.

> +#ifndef SH_USE_NONDEFAULT_ALLOCATOR
> +

That should probably be documented in the file header.


Thanks!

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Robert Haas
On Tue, Feb 7, 2017 at 4:15 PM, Andres Freund  wrote:
> FWIW, I think it'd have been better to not add the new callbacks as
> parameters to *_create(), but rather have them be "templatized" like the
> rest of simplehash.  That'd require that callback to check the context,
> to know whether it should use shared memory or not, but that seems fine
> to me.  Right now this pushes the head of simplehash above a
> cacheline...

Something like the attached?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


rework-simplehash-allocator.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Robert Haas
On Tue, Feb 7, 2017 at 4:13 PM, Jeff Janes  wrote:
> I'm getting compiler errors:
>
> In file included from execGrouping.c:47:
> ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
> 'simplehash_allocate'
> ../../../src/include/lib/simplehash.h:91: note: previous declaration of
> 'simplehash_allocate' was here
> ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
> 'simplehash_free'
> ../../../src/include/lib/simplehash.h:92: note: previous declaration of
> 'simplehash_free' was here

Thanks, I'll stick an #ifdef guard around that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Andres Freund
On 2017-02-07 13:13:43 -0800, Jeff Janes wrote:
> On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas  wrote:
>
> > On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar  wrote:
> > >> I think maybe we should rename the functions to element_allocate,
> > >> element_free, and element_allocator_ctx, or something like that.  The
> > >> current names aren't capitalized consistently with other things in
> > >> this module, and putting the word "element" in there would make it
> > >> more clear what the purpose of this thing is.
> > >
> > > Fixed as per the suggestion
> >
> > Eh, not really.  You changed the memory context to be called
> > element_allocator_ctx, rather than changing the args passed to the
> > element allocator to have that name, which is what I had in mind.
> >
> > I did some assorted renaming and other cosmetic improvements and committed
> > 0002.
> >
>
> I'm getting compiler errors:
>
> In file included from execGrouping.c:47:
> ../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
> 'simplehash_allocate'
> ../../../src/include/lib/simplehash.h:91: note: previous declaration of
> 'simplehash_allocate' was here
> ../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
> 'simplehash_free'
> ../../../src/include/lib/simplehash.h:92: note: previous declaration of
> 'simplehash_free' was here

Oh yea, that too - you can't just redefine a typedef like that according
to C89 :(


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Andres Freund
On 2017-02-07 16:03:43 -0500, Robert Haas wrote:
> On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar  wrote:
> >> I think maybe we should rename the functions to element_allocate,
> >> element_free, and element_allocator_ctx, or something like that.  The
> >> current names aren't capitalized consistently with other things in
> >> this module, and putting the word "element" in there would make it
> >> more clear what the purpose of this thing is.
> >
> > Fixed as per the suggestion
> 
> Eh, not really.  You changed the memory context to be called
> element_allocator_ctx, rather than changing the args passed to the
> element allocator to have that name, which is what I had in mind.
> 
> I did some assorted renaming and other cosmetic improvements and committed 
> 0002.

FWIW, I think it'd have been better to not add the new callbacks as
parameters to *_create(), but rather have them be "templatized" like the
rest of simplehash.  That'd require that callback to check the context,
to know whether it should use shared memory or not, but that seems fine
to me.  Right now this pushes the head of simplehash above a
cacheline...

I'm also doubtful about the naming of the default callbacks:
+/* default memory allocator function */
+static void *
+SH_DEFAULT_ALLOC(Size size, void *args)
+{
+   MemoryContext context = (MemoryContext) args;
+
+   return MemoryContextAllocExtended(context, size,
+ MCXT_ALLOC_HUGE | MCXT_ALLOC_ZERO);
+}

SH_DEFAULT_ALLOC sounds like it's a name that's one of the prefixed
names (like SH_CREATE actually becomes pagetable_create and such) - but
afaics it's not.  Which afaics means that this'll generate symbol
conflicts if one translation unit uses multiple simplehash.h style
hashes.  Afaics these should either be prefixed, or static inline
functions.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Jeff Janes
On Tue, Feb 7, 2017 at 1:03 PM, Robert Haas  wrote:

> On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar  wrote:
> >> I think maybe we should rename the functions to element_allocate,
> >> element_free, and element_allocator_ctx, or something like that.  The
> >> current names aren't capitalized consistently with other things in
> >> this module, and putting the word "element" in there would make it
> >> more clear what the purpose of this thing is.
> >
> > Fixed as per the suggestion
>
> Eh, not really.  You changed the memory context to be called
> element_allocator_ctx, rather than changing the args passed to the
> element allocator to have that name, which is what I had in mind.
>
> I did some assorted renaming and other cosmetic improvements and committed
> 0002.
>

I'm getting compiler errors:

In file included from execGrouping.c:47:
../../../src/include/lib/simplehash.h:91: error: redefinition of typedef
'simplehash_allocate'
../../../src/include/lib/simplehash.h:91: note: previous declaration of
'simplehash_allocate' was here
../../../src/include/lib/simplehash.h:92: error: redefinition of typedef
'simplehash_free'
../../../src/include/lib/simplehash.h:92: note: previous declaration of
'simplehash_free' was here


gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)

Cheers,

Jeff


Re: [HACKERS] Parallel bitmap heap scan

2017-02-07 Thread Robert Haas
On Sun, Feb 5, 2017 at 9:48 AM, Dilip Kumar  wrote:
>> I think maybe we should rename the functions to element_allocate,
>> element_free, and element_allocator_ctx, or something like that.  The
>> current names aren't capitalized consistently with other things in
>> this module, and putting the word "element" in there would make it
>> more clear what the purpose of this thing is.
>
> Fixed as per the suggestion

Eh, not really.  You changed the memory context to be called
element_allocator_ctx, rather than changing the args passed to the
element allocator to have that name, which is what I had in mind.

I did some assorted renaming and other cosmetic improvements and committed 0002.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-05 Thread Dilip Kumar
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas  wrote:

Here are the latest version of the patch, which address all the
comments given by Robert.

> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)
> plus there's a sort of dummy hash table in every worker which copies
> some but not all of the members of the original hash table, leading to
> the otherwise-unnecessary if-test that Haribabu is complaining about.
> So the hash table is kinda-shared-kinda-not-shared, which I don't
> *think* is really what Andres had in mind.
>
> In short, I think SH_COPY (implemented here as pagetable_copy) needs
> to go away.  The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.

I have removed the SH_COPY and now leader performs the
tbm_begin_shared_iterate before waking up the worker. Basically, the
leader will create the page and chunk array and that is the array of
the "relptr" (offlist, suggested by Robert).

> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.  And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c.
Fixed

 pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.

I have fixed this, now there is new function called tbm_shared_iterate
which will directly iterate using shared iterator. So now no need to
copy member back and forth between local and shared iterator.

>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).

Done that way.
>
> In 0002, you have this:
>
> +tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).

Done as suggested
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.

Fixed as per the suggestion

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0002-hash-support-alloc-free-v16.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-01 Thread Dilip Kumar
On Wed, Feb 1, 2017 at 11:09 PM, Robert Haas  wrote:

Hi Robert,

Thanks for the review.

> When Andres wrote in
> https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de
> that you should try to share only the iteration arrays, I believe that
> he meant the results of tbm_begin_iterate() -- that is,
> iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
> perhaps putting words into his mouth, but what he said was that we
> could avoid sharing "the whole underlying hash".  But the patch set
> you've implemented here doesn't behave that way.  Instead, you've got
> the array containing the hash table elements shared (which is good)

Here is my analysis why we selected this instead of just sharing the iterator:

The problem is that each entry of iteration array is just a pointer to
hash entry. So if we only shared iterator then iterator element will
be pointing to local memory of other workers. I thought of one more
option that during tbm_begin_iterate instead of keeping pointer to
hash entry, we can make a copy of that in DSA area, but that will have
2 copies of hash table element for short duration which is not good.
And, I think what we are doing currently is better than that.

  The work of tbm_begin_iterate() should be done before we
> begin the shared scan and the results of that work should be shared.
> What happens right now (it appears) is that every backend does that
> work based on the same hash table and we just assume they all get the
> same answer.

Actually, tbm_begin_iterate is processing the each element of the hash
table and converting to chunk and page array and currently, it’s done
by each worker and it’s pretty cheap. Suppose I try to do this only by
the first worker then implementation will look something like this.

1. First, we need to create a tbm->spages array and tbm->schunks array
and both should be the array of dsa_pointers.
2. Then each worker will process these array’s and will convert them
to the array of their local pointers.
3. With the current solution where all hash elements are stored in one
large dsa chunk, then how we are going to divide them into multiple
dsa pointers.

I will work on remaining comments.

  And we really do assume that, because
> pbms_parallel_iterate() assumes it can shuttle private state back and
> forth between iterator in different backends and nothing will break;
> but those backends aren't actually using the same iteration arrays.
> They are using different iteration arrays that should have the same
> contents because they were all derived from the same semi-shared hash
> table.  That's pretty fragile, and a waste of CPU cycles if the hash
> table is large (every backend does its own sort).
>
> On a related note, I think it's unacceptable to make the internal
> details of TBMIterator public.  You've moved it from tidbitmap.c to
> tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
> the TBMIterator, but that's not OK.  Those details need to remain
> private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
> need some better solution.  The knowledge of how a shared iterator
> should iterate needs to be private to tidbitmap.c, not off in the
> executor someplace.  And I think the entries need to actually be
> updated in shared memory directly, not copied back and forth between a
> backend-private iterator and a shared iterator.
>
> Also, pbms_parallel_iterate() can't hold a spinlock around a call to
> tbm_iterate().  Note the coding rules for spinlocks mentioned in
> spin.h and src/backend/storage/lmgr/README.  I think the right thing
> to do here is to use an LWLock in a new tranche (add an entry to
> BuiltinTrancheIds).
>
> In 0002, you have this:
>
> +tb->alloc = palloc(sizeof(SH_ALLOCATOR));
>
> This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
> Otherwise the allocator object is in a different context from
> everything else in the hash table.  But TBH, I don't really see why we
> want this to be a separate object.  Why not just put
> HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
> pointer chasing and doesn't really seem to cost anything (except that
> SH_CREATE will grow a slightly longer argument sequence).
>
> I think maybe we should rename the functions to element_allocate,
> element_free, and element_allocator_ctx, or something like that.  The
> current names aren't capitalized consistently with other things in
> this module, and putting the word "element" in there would make it
> more clear what the purpose of this thing is.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-01 Thread Robert Haas
On Tue, Jan 31, 2017 at 6:05 AM, Dilip Kumar  wrote:
>> 0002-hash-support-alloc-free-v14.patch:
>>
>>
>> + if (tb->alloc)
>> + {
>>
>> The memory for tb->alloc is allocated always, is the if check still
>> required?
>
> In parallel case, only first worker will call SH_CREATE, other worker
> will only do palloc for pagetable and copy the reference from main
> worker, so they will not have allocator.

When Andres wrote in
https://www.postgresql.org/message-id/20161017201558.cr34stc6zvxy3...@alap3.anarazel.de
that you should try to share only the iteration arrays, I believe that
he meant the results of tbm_begin_iterate() -- that is,
iterator->spageptr, chunkptr, schunkbit, spages, and schunks.  I'm
perhaps putting words into his mouth, but what he said was that we
could avoid sharing "the whole underlying hash".  But the patch set
you've implemented here doesn't behave that way.  Instead, you've got
the array containing the hash table elements shared (which is good)
plus there's a sort of dummy hash table in every worker which copies
some but not all of the members of the original hash table, leading to
the otherwise-unnecessary if-test that Haribabu is complaining about.
So the hash table is kinda-shared-kinda-not-shared, which I don't
*think* is really what Andres had in mind.

In short, I think SH_COPY (implemented here as pagetable_copy) needs
to go away.  The work of tbm_begin_iterate() should be done before we
begin the shared scan and the results of that work should be shared.
What happens right now (it appears) is that every backend does that
work based on the same hash table and we just assume they all get the
same answer.  And we really do assume that, because
pbms_parallel_iterate() assumes it can shuttle private state back and
forth between iterator in different backends and nothing will break;
but those backends aren't actually using the same iteration arrays.
They are using different iteration arrays that should have the same
contents because they were all derived from the same semi-shared hash
table.  That's pretty fragile, and a waste of CPU cycles if the hash
table is large (every backend does its own sort).

On a related note, I think it's unacceptable to make the internal
details of TBMIterator public.  You've moved it from tidbitmap.c to
tidbitmap.h so that nodeBitmapHeapScan.c can tinker with the guts of
the TBMIterator, but that's not OK.  Those details need to remain
private to tidbitmap.c. pbms_parallel_iterate() is a nasty kludge; we
need some better solution.  The knowledge of how a shared iterator
should iterate needs to be private to tidbitmap.c, not off in the
executor someplace.  And I think the entries need to actually be
updated in shared memory directly, not copied back and forth between a
backend-private iterator and a shared iterator.

Also, pbms_parallel_iterate() can't hold a spinlock around a call to
tbm_iterate().  Note the coding rules for spinlocks mentioned in
spin.h and src/backend/storage/lmgr/README.  I think the right thing
to do here is to use an LWLock in a new tranche (add an entry to
BuiltinTrancheIds).

In 0002, you have this:

+tb->alloc = palloc(sizeof(SH_ALLOCATOR));

This should be using MemoryContextAlloc(ctx, ...) rather than palloc.
Otherwise the allocator object is in a different context from
everything else in the hash table.  But TBH, I don't really see why we
want this to be a separate object.  Why not just put
HashAlloc/HashFree/args into SH_TYPE directly?  That avoids some
pointer chasing and doesn't really seem to cost anything (except that
SH_CREATE will grow a slightly longer argument sequence).

I think maybe we should rename the functions to element_allocate,
element_free, and element_allocator_ctx, or something like that.  The
current names aren't capitalized consistently with other things in
this module, and putting the word "element" in there would make it
more clear what the purpose of this thing is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-31 Thread Dilip Kumar
On Tue, Jan 31, 2017 at 7:47 AM, Haribabu Kommi
 wrote:
> Thanks for the update. I have some comments
>
Thanks for the review.

>
> 0002-hash-support-alloc-free-v14.patch:
>
>
> + if (tb->alloc)
> + {
>
> The memory for tb->alloc is allocated always, is the if check still
> required?

In parallel case, only first worker will call SH_CREATE, other worker
will only do palloc for pagetable and copy the reference from main
worker, so they will not have allocator.


> 0003-parallel-bitmap-heap-scan-v14.patch:
>
>
> + * and set parallel flag in lower level bitmap index scan. Later
> + * bitmap index node will use this flag to indicate tidbitmap that
> + * it needs to create an shared page table.
> + */
>
> I feel better to mention, where this flag is used, so that it will be more
> clear.

Done
>
>
> + * Mark tidbitmap as shared and also store DSA area in it.
> + * marking tidbitmap as shared is indication that this tidbitmap
> + * should be created in shared memory. DSA area will be used for
>
> The flag of shared is set in tidbitmap structure itself, but the
> comment is mentioned that tidbitmpa should be created in shared memory.
> I think that is the page table that needs to be created in shared memory.
> Providing some more information here will be helpful.
>
Done

>
> - node->tbmres = tbmres = tbm_iterate(tbmiterator);
> + node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator,
> + pbminfo ? &pbminfo->tbmiterator : NULL);
>
> Instead Passing both normal and paralle iterators to the functions
> and checking inside again for NULL, How about just adding check
> in the caller itself? Or if you prefer the current method, then try to
> explain the details of input in the function header and more description.
>
Done as you said.

>
> + /* Increase prefetch target if it's not yet at the max. */
> + if (node->prefetch_target < node->prefetch_maximum)
>
> I didn't evaluate all scenarios, but the above code may have a problem,
> In case of parallel mode the the prefetch_target is fetched from node
> and not from the pbminfo. Later it gets from the pminfo and update that.
> May be this code needs to rewrite.
>
Fixed it.

>
> + TBMIterator *iterator = node->prefetch_iterator;
>
> Why another variable? Why can't we use the prefetch_iterator itself?
> Currently node->tbmiterator and node->prefetch_iterator are initialized
> irrespective of whether is it a parallel one or not. But while using, there
> is a check to use the parallel one in case of parallel. if it is the case,
> why can't we avoid the initialization itself?

Fixed
>
>
> + else
> + needWait = false;
>
> By default needWait is false. Just set that to true only for the
> case of PBM_INPROGRESS
>

Actually inside the while loop, suppose first we set needWait = true,
if PBM_INPROGRESS and got for ConditionalSleep, After it come out of
sleep, we need to check that PBM_FINISHED is set or we need to sleep
again, So in such case we need to reset it to needWait=false. However
this can be done by directly returning if it's PBM_FINISHED. But I
want to keep below the logic common.
+ SpinLockRelease(&pbminfo->state_mutex);
+
+ /* If we are leader or leader has already created a TIDBITMAP */
+ if (leader || !needWait)
+ break;

> + * so that during free we can directly get the dsa_pointe and free it.
>
> dsa_pointe -> dsa_pointer
Done

>
>
> +typedef struct
> +{
> + TIDBitmap  *tbm; /* TIDBitmap we're iterating over */
> + int spageptr; /* next spages index */
> + int schunkptr; /* next schunks index */
> + int schunkbit; /* next bit to check in current schunk */
> + TBMIterateResult output; /* MUST BE LAST (because variable-size) */
> +} TBMIterator;
>
> I didn't find the need of moving this structure. Can you point me where it
> is used.

Because  pbms_parallel_iterate need to access this structure so I
moved it to tidbitmap.h

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0003-parallel-bitmap-heap-scan-v15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-30 Thread Michael Paquier
On Tue, Jan 31, 2017 at 11:17 AM, Haribabu Kommi
 wrote:
> On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar  wrote:
>> On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar 
>> wrote:
>> > I have changed as per the comments. 0002 and 0003 are changed, 0001 is
>> > still the same.
>>
>> There is just one line change in 0003 compared to older version, all
>> other patches are the same.
>
> Thanks for the update. I have some comments

This review is too fresh to be addressed, so I have moved this patch
to the next CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-30 Thread Haribabu Kommi
On Fri, Jan 27, 2017 at 5:32 PM, Dilip Kumar  wrote:

> On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar 
> wrote:
> > I have changed as per the comments. 0002 and 0003 are changed, 0001 is
> > still the same.
>
> 2 days back my colleague Rafia, reported one issue (offlist) that
> parallel bitmap node is not scaling as good as other nodes e.g
> parallel sequence scan and parallel index scan.
>
> After some perf analysis, I found that there was one unconditional
> spin lock in parallel bitmap patch which we were taking for checking
> the prefetch target. Basically, we were always taking the lock and
> checking the prefetch_target is reached to prefetch_maximum. So even
> after it will reach to prefetch_maximum we will acquire the lock and
> check after every tuple. I have changed that logic, now I will check
> the condition first if we need to increase the target then will take
> the lock and recheck the condition.
>
> There is just one line change in 0003 compared to older version, all
> other patches are the same.
>
> Some performance data to show that new parallel bitmap patch performs
> way better than the previous version.
> TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers)
> machine intel 8 socket machine
>
> v13(time in ms)   v14 (time in ms)
> Q637065.84118202.903
>
> Q14 15229.569 11341.121


Thanks for the update. I have some comments


0002-hash-support-alloc-free-v14.patch:


+ if (tb->alloc)
+ {

The memory for tb->alloc is allocated always, is the if check still
required?



0003-parallel-bitmap-heap-scan-v14.patch:


+ * and set parallel flag in lower level bitmap index scan. Later
+ * bitmap index node will use this flag to indicate tidbitmap that
+ * it needs to create an shared page table.
+ */

I feel better to mention, where this flag is used, so that it will be more
clear.


+ * Mark tidbitmap as shared and also store DSA area in it.
+ * marking tidbitmap as shared is indication that this tidbitmap
+ * should be created in shared memory. DSA area will be used for

The flag of shared is set in tidbitmap structure itself, but the
comment is mentioned that tidbitmpa should be created in shared memory.
I think that is the page table that needs to be created in shared memory.
Providing some more information here will be helpful.


- node->tbmres = tbmres = tbm_iterate(tbmiterator);
+ node->tbmres = tbmres = pbms_parallel_iterate(tbmiterator,
+ pbminfo ? &pbminfo->tbmiterator : NULL);

Instead Passing both normal and paralle iterators to the functions
and checking inside again for NULL, How about just adding check
in the caller itself? Or if you prefer the current method, then try to
explain the details of input in the function header and more description.


+ /* Increase prefetch target if it's not yet at the max. */
+ if (node->prefetch_target < node->prefetch_maximum)

I didn't evaluate all scenarios, but the above code may have a problem,
In case of parallel mode the the prefetch_target is fetched from node
and not from the pbminfo. Later it gets from the pminfo and update that.
May be this code needs to rewrite.


+ TBMIterator *iterator = node->prefetch_iterator;

Why another variable? Why can't we use the prefetch_iterator itself?
Currently node->tbmiterator and node->prefetch_iterator are initialized
irrespective of whether is it a parallel one or not. But while using, there
is a check to use the parallel one in case of parallel. if it is the case,
why can't we avoid the initialization itself?


+ else
+ needWait = false;

By default needWait is false. Just set that to true only for the
case of PBM_INPROGRESS

+ * so that during free we can directly get the dsa_pointe and free it.

dsa_pointe -> dsa_pointer


+typedef struct
+{
+ TIDBitmap  *tbm; /* TIDBitmap we're iterating over */
+ int spageptr; /* next spages index */
+ int schunkptr; /* next schunks index */
+ int schunkbit; /* next bit to check in current schunk */
+ TBMIterateResult output; /* MUST BE LAST (because variable-size) */
+} TBMIterator;

I didn't find the need of moving this structure. Can you point me where it
is used.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Parallel bitmap heap scan

2017-01-27 Thread Robert Haas
On Fri, Jan 27, 2017 at 1:32 AM, Dilip Kumar  wrote:
> There is just one line change in 0003 compared to older version, all
> other patches are the same.

I spent some time looking at 0001 (and how those changes are used in
0003) and I thought it looked good, so I committed 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-26 Thread Dilip Kumar
On Tue, Jan 24, 2017 at 10:18 AM, Dilip Kumar  wrote:
> I have changed as per the comments. 0002 and 0003 are changed, 0001 is
> still the same.

2 days back my colleague Rafia, reported one issue (offlist) that
parallel bitmap node is not scaling as good as other nodes e.g
parallel sequence scan and parallel index scan.

After some perf analysis, I found that there was one unconditional
spin lock in parallel bitmap patch which we were taking for checking
the prefetch target. Basically, we were always taking the lock and
checking the prefetch_target is reached to prefetch_maximum. So even
after it will reach to prefetch_maximum we will acquire the lock and
check after every tuple. I have changed that logic, now I will check
the condition first if we need to increase the target then will take
the lock and recheck the condition.

There is just one line change in 0003 compared to older version, all
other patches are the same.

Some performance data to show that new parallel bitmap patch performs
way better than the previous version.
TPCH scale 20, work_mem 64MB, shared buffers 8GB (4 workers)
machine intel 8 socket machine

v13(time in ms)   v14 (time in ms)
Q637065.84118202.903

Q14 15229.569 11341.121


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-opt-parallelcost-refactoring-v14.patch
Description: Binary data


0002-hash-support-alloc-free-v14.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-23 Thread Dilip Kumar
On Mon, Jan 23, 2017 at 1:52 PM, Haribabu Kommi
 wrote:
> I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments.
>
> - SH_TYPE*tb;
> - uint64 size;
> + SH_TYPE *tb;
> + uint64 size;
>
> The above change may not be required.
>
> + if (tb->alloc)
> + {
> + tb->alloc->HashFree(tb->data, tb->alloc->args);
> + pfree(tb->alloc);
> + }
>
> The above code tries to free the tb->alloc memory. In case if the user
> has provide the alloc structure to SH_CREATE function and the same
> pointer is stored in the tb structure. And in free function freeing that
> memory may cause problem.
>
> So either explicitly mentioning that the input must a palloc'ed data or
> by default allocate memory and copy the input data into allocated
> memory.

I have changed as per the comments. 0002 and 0003 are changed, 0001 is
still the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-opt-parallelcost-refactoring-v13.patch
Description: Binary data


0002-hash-support-alloc-free-v13.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-23 Thread Haribabu Kommi
On Mon, Jan 23, 2017 at 3:42 PM, Dilip Kumar  wrote:

> On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas 
> wrote:
> >> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
> >> still the same.
> >
> > I've committed the first half of 0001.
> Thanks.  0001 and 0003 required rebasing after this commit.


I reviewed 0002-hash-support-alloc-free-v12.patch, some minor comments.

- SH_TYPE*tb;
- uint64 size;
+ SH_TYPE *tb;
+ uint64 size;

The above change may not be required.

+ if (tb->alloc)
+ {
+ tb->alloc->HashFree(tb->data, tb->alloc->args);
+ pfree(tb->alloc);
+ }

The above code tries to free the tb->alloc memory. In case if the user
has provide the alloc structure to SH_CREATE function and the same
pointer is stored in the tb structure. And in free function freeing that
memory may cause problem.

So either explicitly mentioning that the input must a palloc'ed data or
by default allocate memory and copy the input data into allocated
memory.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Parallel bitmap heap scan

2017-01-22 Thread Dilip Kumar
On Thu, Jan 19, 2017 at 12:26 AM, Robert Haas  wrote:
>> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
>> still the same.
>
> I've committed the first half of 0001.
Thanks.  0001 and 0003 required rebasing after this commit.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-opt-parallelcost-refactoring-v12.patch
Description: Binary data


0002-hash-support-alloc-free-v12.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-18 Thread Robert Haas
On Wed, Jan 18, 2017 at 12:14 AM, Dilip Kumar  wrote:
> On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar  wrote:
>>
>> Please verify with the new patch.
>
> Patch 0001 and 0003 required to rebase on the latest head. 0002 is
> still the same.

I've committed the first half of 0001.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-17 Thread Dilip Kumar
On Fri, Jan 13, 2017 at 6:36 PM, Dilip Kumar  wrote:
>
> Please verify with the new patch.

Patch 0001 and 0003 required to rebase on the latest head. 0002 is
still the same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-opt-parallelcost-refactoring-v11.patch
Description: Binary data


0002-hash-support-alloc-free-v11.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v11.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-13 Thread Dilip Kumar
On Thu, Jan 12, 2017 at 5:47 PM, Dilip Kumar  wrote:
>> Hello Dilip,
>> I was trying to test the performance of all the parallel query related
>> patches on TPC-H queries, I found that when parallel hash [1 ]and your
>> patch are applied then Q10 and Q14 were hanged, however, without any
>> one of these patches, these queries are running fine. Following is the
>> stack trace,
>
> Thanks for reporting, I will look into this.
I had setup for TPCH scale factor 5, I could reproduce the hang
reported by you with Q10. I have fixed the issue in v10 (only 0003 is
changed other patches have no change). However, after fixing this
issue it's crashing in parallel shared hash patch and call stack is
same what you have reported on parallel shared hash thread[1]

[1] 
https://www.postgresql.org/message-id/CAOGQiiOJji9GbLwfT0vBuixdkgAsxzZjTpPxsO6BiTLD--SzYg%40mail.gmail.com

Please verify with the new patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0003-parallel-bitmap-heap-scan-v10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-12 Thread Dilip Kumar
On Thu, Jan 12, 2017 at 5:22 PM, Rafia Sabih
 wrote:
> Hello Dilip,
> I was trying to test the performance of all the parallel query related
> patches on TPC-H queries, I found that when parallel hash [1 ]and your
> patch are applied then Q10 and Q14 were hanged, however, without any
> one of these patches, these queries are running fine. Following is the
> stack trace,

Thanks for reporting, I will look into this.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-12 Thread Rafia Sabih
On Wed, Jan 11, 2017 at 5:33 PM, tushar  wrote:
>
> On 01/10/2017 05:16 PM, Dilip Kumar wrote:
>>
>>   Please try attached patch and confirm from your
>> side.
>
> Thanks,issue seems to be fixed now.
>
>
> --
> regards,tushar
> EnterpriseDB  https://www.enterprisedb.com/
> The Enterprise PostgreSQL Company
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Hello Dilip,
I was trying to test the performance of all the parallel query related
patches on TPC-H queries, I found that when parallel hash [1 ]and your
patch are applied then Q10 and Q14 were hanged, however, without any
one of these patches, these queries are running fine. Following is the
stack trace,

At the master:

#0  0x3fff8bef7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6
#1  0x104ea184 in WaitEventSetWaitBlock (set=0x10039b2d360,
cur_timeout=-1, occurred_events=0x3fffd4a55a98, nevents=1) at
latch.c:998
#2  0x104e9fc8 in WaitEventSetWait (set=0x10039b2d360,
timeout=-1, occurred_events=0x3fffd4a55a98, nevents=1,
wait_event_info=134217730) at latch.c:950
#3  0x104e9484 in WaitLatchOrSocket (latch=0x3fff85128ab4,
wakeEvents=1, sock=-1, timeout=-1, wait_event_info=134217730) at
latch.c:350
#4  0x104e931c in WaitLatch (latch=0x3fff85128ab4,
wakeEvents=1, timeout=0, wait_event_info=134217730) at latch.c:304
#5  0x1032a378 in gather_readnext (gatherstate=0x10039aeeb98)
at nodeGather.c:393
#6  0x1032a044 in gather_getnext (gatherstate=0x10039aeeb98)
at nodeGather.c:293
#7  0x10329e94 in ExecGather (node=0x10039aeeb98) at nodeGather.c:234
#8  0x103086f0 in ExecProcNode (node=0x10039aeeb98) at
execProcnode.c:521
#9  0x1031f550 in fetch_input_tuple (aggstate=0x10039aed220)
at nodeAgg.c:587
#10 0x103229a4 in agg_retrieve_direct (aggstate=0x10039aed220)
at nodeAgg.c:2211

At one of the worker process,

#0  0x103879f8 in pagetable_insert (tb=0x10039abcac0,
key=6491, found=0x3fffd4a54b88 "") at
../../../src/include/lib/simplehash.h:571
#1  0x10389964 in tbm_get_pageentry (tbm=0x10039ab5778,
pageno=6491) at tidbitmap.c:876
#2  0x10388608 in tbm_add_tuples (tbm=0x10039ab5778,
tids=0x10039a94c30, ntids=1, recheck=0 '\000') at tidbitmap.c:340
#3  0x100f5e54 in btgetbitmap (scan=0x10039ab3c80,
tbm=0x10039ab5778) at nbtree.c:439
#4  0x100eab7c in index_getbitmap (scan=0x10039ab3c80,
bitmap=0x10039ab5778) at indexam.c:687
#5  0x10328acc in MultiExecBitmapIndexScan
(node=0x10039a98630) at nodeBitmapIndexscan.c:98
#6  0x103088d8 in MultiExecProcNode (node=0x10039a98630) at
execProcnode.c:591
#7  0x10326c70 in BitmapHeapNext (node=0x10039a98770) at
nodeBitmapHeapscan.c:164
#8  0x10316440 in ExecScanFetch (node=0x10039a98770,
accessMtd=0x10326b70 , recheckMtd=0x10327bb4
) at execScan.c:95
#9  0x10316590 in ExecScan (node=0x10039a98770,
accessMtd=0x10326b70 , recheckMtd=0x10327bb4
) at execScan.c:180
#10 0x10327c84 in ExecBitmapHeapScan (node=0x10039a98770) at
nodeBitmapHeapscan.c:623

At other workers,

#0  0x3fff8bef7de8 in __epoll_wait_nocancel () from /lib64/power8/libc.so.6
#1  0x104ea184 in WaitEventSetWaitBlock (set=0x10039a24670,
cur_timeout=-1, occurred_events=0x3fffd4a55ed8, nevents=1) at
latch.c:998
#2  0x104e9fc8 in WaitEventSetWait (set=0x10039a24670,
timeout=-1, occurred_events=0x3fffd4a55ed8, nevents=1,
wait_event_info=134217737) at latch.c:950
#3  0x1051a3dc in ConditionVariableSleep (cv=0x3fff7c2d01ac,
wait_event_info=134217737) at condition_variable.c:132
#4  0x103283e0 in pbms_is_leader (pbminfo=0x3fff7c2d0178) at
nodeBitmapHeapscan.c:900
#5  0x10326c38 in BitmapHeapNext (node=0x10039a95f50) at
nodeBitmapHeapscan.c:153
#6  0x10316440 in ExecScanFetch (node=0x10039a95f50,
accessMtd=0x10326b70 , recheckMtd=0x10327bb4
) at execScan.c:95
#7  0x10316590 in ExecScan (node=0x10039a95f50,
accessMtd=0x10326b70 , recheckMtd=0x10327bb4
) at execScan.c:180
#8  0x10327c84 in ExecBitmapHeapScan (node=0x10039a95f50) at
nodeBitmapHeapscan.c:623
#9  0x10308588 in ExecProcNode (node=0x10039a95f50) at
execProcnode.c:443
#10 0x103310d8 in ExecHashJoinOuterGetTuple
(outerNode=0x10039a95f50, hjstate=0x10039a96808,
hashvalue=0x3fffd4a5639c) at nodeHashjoin.c:936

I was using TPC-H with scale factor 20, please let me know if there is
anything more you require in this regard.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D1vGcv6LBrxZeqPb_rPxfraidWAF_8_4z2ZMQ%2B7DOjj9w%40mail.gmail.com
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-11 Thread tushar

On 01/10/2017 05:16 PM, Dilip Kumar wrote:

  Please try attached patch and confirm from your
side.

Thanks,issue seems to be fixed now.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-10 Thread Dilip Kumar
On Tue, Jan 10, 2017 at 2:19 PM, tushar  wrote:
> We found a regression , earlier the testcase was working fine (against the
> older patches of Parallel bitmap heap scan)  but now getting a server crash
> against v8 patches.
>
> Testcase - (one of the table of TPC-H )
>
> postgres=#explain analyze verbose
> SELECT SUM(l_extendedprice) FROM lineitem
> WHERE (l_shipdate >= '1995-01-01'::date)
> AND (l_shipdate <='1996-03-31'::date);

While fixing some of the review comments in v7 and v8, I had allocated
new memory for pagetable, and missed to initialize it.

Thanks for reporting, I have fixed it in v9. After fix query is
running fine for me. Please try attached patch and confirm from your
side.

postgres=# explain analyze verbose
SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <='1996-03-31'::date);

QUERY PLAN

--
---
 Finalize Aggregate  (cost=798002.46..798002.47 rows=1 width=32)
(actual time=15501.245..15501.245 rows=1 loops=1)
   Output: sum(l_extendedprice)
   ->  Gather  (cost=798002.24..798002.45 rows=2 width=32) (actual
time=15494.358..15498.919 rows=3 loops=1)
 Output: (PARTIAL sum(l_extendedprice))
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=797002.24..797002.25 rows=1
width=32) (actual time=15492.937..15492.937 rows=1 loops=3)
   Output: PARTIAL sum(l_extendedprice)
   Worker 0: actual time=15491.218..15491.219 rows=1 loops=1
   Worker 1: actual time=15493.514..15493.514 rows=1 loops=1
   ->  Parallel Bitmap Heap Scan on public.lineitem
(cost=147461.75..791014.31 rows=2395170 width=8) (actual
time=8553.301..15061.333 rows=189294
7 loops=3)
 Output: l_extendedprice
 Recheck Cond: ((lineitem.l_shipdate >=
'1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date))
 Rows Removed by Index Recheck: 6451177
 Heap Blocks: exact=27963 lossy=164938
 Worker 0: actual time=8548.957..15054.511
rows=1887239 loops=1
 Worker 1: actual time=8554.817..15050.317
rows=1902477 loops=1
 ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..146024.65 rows=5748409 width=0) (actual
time=8533.701..8533.701 rows=5678841
loops=1)
   Index Cond: ((lineitem.l_shipdate >=
'1995-01-01'::date) AND (lineitem.l_shipdate <= '1996-03-31'::date))
 Planning time: 2.742 ms
 Execution time: 15509.696 ms
(21 rows)



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-opt-parallelcost-refactoring-v9.patch
Description: Binary data


0002-hash-support-alloc-free-v9.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v9.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-10 Thread tushar

On 01/10/2017 11:29 AM, tushar wrote:

On 01/09/2017 07:22 PM, Dilip Kumar wrote:

Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
observed another issue related to code refactoring, Actually, there
was some code present in 0001 which supposed to be in 0003.

Thanks, I have checked at my end and it is fixed now.

We found a regression , earlier the testcase was working fine (against 
the older patches of Parallel bitmap heap scan)  but now getting a 
server crash

against v8 patches.

Testcase - (one of the table of TPC-H )

postgres=#explain analyze verbose
SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <='1996-03-31'::date);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost

Here is the stack trace - ( Two core dump file generated)

[centos@centos-cpula bin]$  gdb -q -c data/core.25434 
/home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres
Reading symbols from 
/home/centos/PG10_10jan/postgresql/edbpsql/bin/postgres...done.

[New Thread 25434]
Missing separate debuginfo for
Try: yum --enablerepo='*-debug*' install 
/usr/lib/debug/.build-id/7f/719af91ee951b4fcb6647e7868f95f766a616b
Reading symbols from /usr/lib64/libssl.so.10...(no debugging symbols 
found)...done.

Loaded symbols for /usr/lib64/libssl.so.10
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols 
found)...done.

Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/librt.so.1...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.

[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libgssapi_krb5.so.2...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libgssapi_krb5.so.2
Reading symbols from /lib64/libkrb5.so.3...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /lib64/libk5crypto.so.3...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libk5crypto.so.3
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libkrb5support.so.0...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols 
found)...done.

Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `postgres: bgworker: parallel worker for PID 
25433  '.

Program terminated with signal 11, Segmentation fault.
#0  0x006f2fa6 in pagetable_destroy (tb=0x2079bf0) at 
../../../src/include/lib/simplehash.h:361

361tb->alloc->HashFree(tb->data, tb->alloc->args);
Missing separate debuginfos, use: debuginfo-install 
glibc-2.12-1.192.el6.x86_64 keyutils-libs-1.4-5.el6.x86_64 
krb5-libs-1.10.3-57.el6.x86_64 libcom_err-1.41.12-22.el6.x86_64 
libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-48.el6_8.1.x86_64 
zlib-1.2.3-29.el6.x86_64

(gdb) bt
#0  0x006f2fa6 in pagetable_destroy (tb=0x2079bf0) at 
../../../src/include/lib/simplehash.h:361

#1  0x006f3b52 in tbm_free (tbm=0x2077fe0) at tidbitmap.c:296
#2  0x006ab29b in ExecEndBitmapHeapScan (node=0x207e760) at 
nodeBitmapHeapscan.c:717

#3  0x00691701 in ExecEndNode (node=0x207e760) at execProcnode.c:689
#4  0x006a8f86 in ExecEndAgg (node=0x207e878) at nodeAgg.c:3563
#5  0x000

Re: [HACKERS] Parallel bitmap heap scan

2017-01-09 Thread tushar

On 01/09/2017 07:22 PM, Dilip Kumar wrote:

Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
observed another issue related to code refactoring, Actually, there
was some code present in 0001 which supposed to be in 0003.

Thanks, I have checked at my end and it is fixed now.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-09 Thread Dilip Kumar
On Mon, Jan 9, 2017 at 5:01 PM, tushar  wrote:
> Thanks Dilip. issue is reproducible  if  we  uses '--enable-cassert' switch
> in the configure. We are able to reproduce it only with --enable-cassert' .

Thanks, Tushar. I have fixed it. The defect was in 0002. I have also
observed another issue related to code refactoring, Actually, there
was some code present in 0001 which supposed to be in 0003.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-opt-parallelcost-refactoring-v8.patch
Description: Binary data


0002-hash-support-alloc-free-v8.patch
Description: Binary data


0003-parallel-bitmap-heap-scan-v8.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-09 Thread tushar

On 01/09/2017 04:36 PM, Dilip Kumar wrote:

I have taken the latest code, applied all 3 patches and compiled.
Initdb is working fine for me.

Can you please verify, do you have any extra patch along with my patch?
Did you properly clean the code?
Thanks Dilip. issue is reproducible  if  we  uses '--enable-cassert' 
switch in the configure. We are able to reproduce it only with 
--enable-cassert' .


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-09 Thread Dilip Kumar
On Mon, Jan 9, 2017 at 3:07 PM, tushar  wrote:
> creating directory data ... ok
> creating subdirectories ... ok
> selecting default max_connections ... 100
> selecting default shared_buffers ... 128MB
> selecting dynamic shared memory implementation ... posix
> creating configuration files ... ok
> running bootstrap script ... ok
> performing post-bootstrap initialization ... sh: line 1: 30709 Segmentation
> fault "/home/centos/PG10_9ja/postgresql/edbpsql/bin/postgres" --single -F -O
> -j -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null
> child process exited with exit code 139
> initdb: removing data directory "data"

I have taken the latest code, applied all 3 patches and compiled.
Initdb is working fine for me.

Can you please verify, do you have any extra patch along with my patch?
Did you properly clean the code?

I have asked one of my colleague to verify this and the result is
same, No crash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-09 Thread tushar

On 01/09/2017 01:05 PM, Dilip Kumar wrote:

This patch can be used by 0003-parallel-bitmap-heap-scan-v7.patch
attached in the mail and also parallel-index-scan[2] can be rebased on
this, if this get committed,
After applying your patches against the fresh sources of PG v10 , not 
able to perform initdb


centos@tusharcentos7 bin]$ ./initdb -D data
The files belonging to this database system will be owned by user "centos".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... sh: line 1: 30709 
Segmentation fault 
"/home/centos/PG10_9ja/postgresql/edbpsql/bin/postgres" --single -F -O 
-j -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null

child process exited with exit code 139
initdb: removing data directory "data"
[centos@tusharcentos7 bin]$

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-01-08 Thread Dilip Kumar
On Fri, Jan 6, 2017 at 10:47 AM, Amit Khandekar  wrote:
> This looks good now.

Thanks..
>
>
> Further points below 

Thanks for the review.


> = nodeBItmapHeapscan.c ==
>
>
> In BitmapHeapNext(), the following code is relevant only for tbm
> returned from MultiExecProcNode().
> if (!tbm || !IsA(tbm, TIDBitmap))
> elog(ERROR, "unrecognized result from subplan");

Fixed

> --
>
> BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
> local variable is_leader in BitmapHeapNext() will solve the purpose.
> This is because is_leader is used only in BitmapHeapNext().

Fixed
>
> --
>
> In BitmapHeapNext(), just before tbm_restore_shared_members() is
> called, we create tbm using tbm_create(). I think tbm_create() does
> not make sense for shared tbm. Whatever fields are required, will be
> restored in tbm_restore_shared_members(). The other fields which do
> not make sense in a restored tbm are not required to be initialized
> using tbm_create(). So I think tbm_restore_shared_members() itself can
> call makeNode(TIDBitmap). (Also it is not required to initialize
> tbm->allocator; see note below in tidbitmap.c section). So
> tbm_restore_shared_members() itself can call tbm_set_parallel().
> Looking at all this, it looks better to have the function name changed
> to tbm_attach(parallel_tbm) or tbm_restore(parallel_tbm) rather than
> tbm_restore_shared_members(). The function header anyways (rightly)
> says  : Attach worker to shared TID bitmap.

Fixed

>
> -
>
> From what I understand, the leader worker does not have to create its
> iterators before waking up the other workers, as long as it makes sure
> it copies tbm fields into shared memory before waking workers. But in
> the patch, tbm_begin_iterate() is called *before* the
> ConditionVariableBroadcast() is called. So I feel, we can shift the
> code inside the "if (node->is_leader)" to a place inside the "if
> (pbminfo == NULL || pbms_is_leader(pbminfo))" condition block, just
> after MultiExecProcNode() call. (And we won't even need is_leader
> local variable as well). This way now the other workers will start
> working sooner.

Correct, Fixed.

>
>
>
> == tidbitmap.c ===
>
>
> The new #include's are not in alphabetical order.

Done..
>
> --
>
> ParallelTIDBitmap.inited is unused, and I believe, not required.

Fixed
>
> --
>
> For leader worker, the new TIDBitmap fields added for parallel bitmap
> *are* valid while the tid is being built. So the below comment should
> be shifted accordingly :
> /* these are valid when iterating is true: */
> Better still, the shared tbm-related fields can be kept in the end,
> and a comment should be added that these are for shared tbm.
>

Done
> --
>
> It seems, the comment below the last ParallelTIDBitmap field is not relevant :
> /* table to dsa_pointer's array. */

Fixed..
>
> --
>
> tbm->allocator field does not seem to be required. A new allocator can
> be just palloc'ed in tbm_create_pagetable(), and passed to
> pagetable_create(). SH_CREATE() stores this allocator in the
> SH_TYPE->alloc field, and fetches the same whenever it needs it for
> calling any of the allocator functions. So we can remove the
> tbm->allocator field and shift "palloc(sizeof(pagetable_alloc))" call
> from tbm_create() to tbm_create_pagetable().

Done
>
> --
>
> In tbm_free() :
> I think we should call pagetable_destroy() even when tbm is shared, so
> that the hash implementation gets a chance to free the hash table
> structure. I understand that the hash table structure itself is not
> big, so it will only be a small memory leak, but it's better to not
> assume that. Instead, let SH_DESTROY() call HashFree(). Then, in
> tbm_free_shared(), we can skip the dsa_free() call if tbm->iterating
> is false. Basically, tbm bitmap implementation should deduce from the
> bitmap state whether it should free the shared data, rather than
> preventing a call to SH_DESTROY().

Fixed
>
> ---
>
> In tbm_begin_iterate(), for shared tbm, internal structures from
> simplehash.h are assumed to be known. For e.g., the hash entries will
> always be present in one single array, and also the entry status is
> evaluated using pagetable_IN_USE. Is simplehash.h designed keeping in
> mind that these things are suppose to be exposed ?
>
> I understand that the hash table handle is local to the leader worker,
> and so it is not accessible to other workers. And so, we cannot use
> pagetable_iterate() to scan the hash table. So, how about copying the
> SH_TYPE structure and making it accessible to other workers ? If we
> have something like SH_ATTACH() or SH_COPY(), this will copy the
> relevant fields that are sufficient to restore the SH_TYPE structure,
> and other workers can start using this hash table after assigning dsa
> array back to tb->data. Something like HASH_ATTACH used in dynahash.c.

This looks cleaner, and also avoid processin

Re: [HACKERS] Parallel bitmap heap scan

2017-01-05 Thread Amit Khandekar
Sorry for the delay in my next response. I still haven't exhaustively
gone through all changes, but meanwhile, below are some more points.

On 26 November 2016 at 18:18, Dilip Kumar  wrote:
> On Tue, Nov 22, 2016 at 9:05 AM, Dilip Kumar  wrote:
>> On Fri, Nov 18, 2016 at 9:59 AM, Amit Khandekar  
>> wrote:
>>
>> Thanks for the review..
>
> I have worked on these comments..
>>
>>> In pbms_is_leader() , I didn't clearly understand the significance of
>>> the for-loop. If it is a worker, it can call
>>> ConditionVariablePrepareToSleep() followed by
>>> ConditionVariableSleep(). Once it comes out of
>>> ConditionVariableSleep(), isn't it guaranteed that the leader has
>>> finished the bitmap ? If yes, then it looks like it is not necessary
>>> to again iterate and go back through the pbminfo->state checking.
>>> Also, with this, variable queuedSelf also might not be needed. But I
>>> might be missing something here.
>>
>> I have taken this logic from example posted on conditional variable thread[1]
>>
>> for (;;)
>> {
>> ConditionVariablePrepareToSleep(cv);
>> if (condition for which we are waiting is satisfied)
>> break;
>> ConditionVariableSleep();
>> }
>> ConditionVariableCancelSleep();
>>
>> [1] 
>> https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

With the latest patches, this looks fine to me.

>>> tbm_iterate() call under SpinLock :
>>> For parallel tbm iteration, tbm_iterate() is called while SpinLock is
>>> held. Generally we try to keep code inside Spinlock call limited to a
>>> few lines, and that too without occurrence of a function call.
>>> Although tbm_iterate() code itself looks safe under a spinlock, I was
>>> checking if we can squeeze SpinlockAcquire() and SpinLockRelease()
>>> closer to each other. One thought is :  in tbm_iterate(), acquire the
>>> SpinLock before the while loop that iterates over lossy chunks. Then,
>>> if both chunk and per-page data remain, release spinlock just before
>>> returning (the first return stmt). And then just before scanning
>>> bitmap of an exact page, i.e. just after "if (iterator->spageptr <
>>> tbm->npages)", save the page handle, increment iterator->spageptr,
>>> release Spinlock, and then use the saved page handle to iterate over
>>> the page bitmap.
>>
>> Main reason to keep Spin lock out of this function to avoid changes
>> inside this function, and also this function takes local iterator as
>> input which don't have spin lock reference to it. But that can be
>> changed, we can pass shared iterator to it.
>>
>> I will think about this logic and try to update in next version.
> Still this issue is not addressed.
> Logic inside tbm_iterate is using same variable, like spageptr,
> multiple places. IMHO this complete logic needs to be done under one
> spin lock.

I think we both agree on the part that the mutex handle can be passed
to tbm_iterate() to do this. I am yet to give more thought on how
clumsy it will be if we add SpinlockRelease() calls in intermediate
places in the function rather than in the end.


>
>>
>>>
>>> prefetch_pages() call under Spinlock :
>>> Here again, prefetch_pages() is called while pbminfo->prefetch_mutex
>>> Spinlock is held. Effectively, heavy functions like PrefetchBuffer()
>>> would get called while under the Spinlock. These can even ereport().
>>> One option is to use mutex lock for this purpose. But I think that
>>> would slow things down. Moreover, the complete set of prefetch pages
>>> would be scanned by a single worker, and others might wait for this
>>> one. Instead, what I am thinking is: grab the pbminfo->prefetch_mutex
>>> Spinlock only while incrementing pbminfo->prefetch_pages. The rest
>>> part viz : iterating over the prefetch pages, and doing the
>>> PrefetchBuffer() need not be synchronised using this
>>> pgbinfo->prefetch_mutex Spinlock. pbms_parallel_iterate() already has
>>> its own iterator spinlock. Only thing is, workers may not do the
>>> actual PrefetchBuffer() sequentially. One of them might shoot ahead
>>> and prefetch 3-4 pages while the other is lagging with the
>>> sequentially lesser page number; but I believe this is fine, as long
>>> as they all prefetch all the required blocks.
>>
>> I agree with your point, will try to fix this as well.
>
> I have fixed this part.

This looks good now.


Further points below 



= nodeBItmapHeapscan.c ==


In BitmapHeapNext(), the following code is relevant only for tbm
returned from MultiExecProcNode().
if (!tbm || !IsA(tbm, TIDBitmap))
elog(ERROR, "unrecognized result from subplan");

So it should be moved just below MultiExecProcNode(), so that it does
not get called when it is created from tbm_create().

--

BitmapHeapScanState->is_leader field looks unnecessary. Instead, a
local variable is_leader in BitmapHeapNext() will solve the purpose.
This is because is_leader is used only in BitmapHeapNext().

--

In BitmapHeapNext(), just befo

  1   2   >