subject:"\[HACKERS\] Parallel tuplesort \(for parallel B\-Tree index creation\)"

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-08 Thread Robert Haas

On Thu, Mar 8, 2018 at 11:45 AM, Tom Lane  wrote:
> Prabhat Sahu  writes:
>> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas  wrote:
>>> That looks like the background worker got killed by the OOM killer.  How
>>> much memory do you have in the machine where this occurred?
>
>> I have ran the testcase in my local machine with below configurations:
>> Environment: CentOS 7(64bit)
>> HD : 100GB
>> RAM: 4GB
>> Processor: 4
>
> If you only have 4GB of physical RAM, it hardly seems surprising that
> trying to use 8GB of maintenance_work_mem would draw the wrath of the
> OOM killer.

Yup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-08 Thread Tom Lane

Prabhat Sahu  writes:
> On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas  wrote:
>> That looks like the background worker got killed by the OOM killer.  How
>> much memory do you have in the machine where this occurred?

> I have ran the testcase in my local machine with below configurations:
> Environment: CentOS 7(64bit)
> HD : 100GB
> RAM: 4GB
> Processor: 4

If you only have 4GB of physical RAM, it hardly seems surprising that
trying to use 8GB of maintenance_work_mem would draw the wrath of the
OOM killer.

regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-08 Thread Prabhat Sahu

On Wed, Mar 7, 2018 at 7:51 PM, Robert Haas  wrote:

> On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu <
> prabhat.s...@enterprisedb.com> wrote:
>>
>> 2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel
>> worker" (PID 54482) was terminated by signal 9: Killed
>>
>
> That looks like the background worker got killed by the OOM killer.  How
> much memory do you have in the machine where this occurred?
>
I have ran the testcase in my local machine with below configurations:

Environment: CentOS 7(64bit)
HD : 100GB
RAM: 4GB
Processor: 4

I have nerrowdown the testcase as below, which also reproduce the same
crash.

-- GUCs under postgres.conf
maintenance_work_mem = 8GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid,
abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

On Thu, Mar 8, 2018 at 7:12 AM, Andres Freund  wrote:

>
>
> On March 7, 2018 5:40:18 PM PST, Peter Geoghegan  wrote:
> >On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
> > wrote:
> >> FWIW that's usually written to the system log. Does dmesg say
> >something
> >> about the kill?
> >
> >While it would be nice to confirm that it was indeed the OOM killer,
> >either way the crash happened because SIGKILL was sent to a parallel
> >worker. There is no reason to suspect a bug.
>
> Not impossible there's a leak somewhere though.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-07 Thread Andres Freund



On March 7, 2018 5:40:18 PM PST, Peter Geoghegan  wrote:
>On Wed, Mar 7, 2018 at 5:16 PM, Tomas Vondra
> wrote:
>> FWIW that's usually written to the system log. Does dmesg say
>something
>> about the kill?
>
>While it would be nice to confirm that it was indeed the OOM killer,
>either way the crash happened because SIGKILL was sent to a parallel
>worker. There is no reason to suspect a bug.

Not impossible there's a leak somewhere though.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-07 Thread Robert Haas

On Wed, Mar 7, 2018 at 8:59 AM, Prabhat Sahu 
wrote:
>
> 2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel
> worker" (PID 54482) was terminated by signal 9: Killed
>

That looks like the background worker got killed by the OOM killer.  How
much memory do you have in the machine where this occurred?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-07 Thread Prabhat Sahu

On Wed, Mar 7, 2018 at 7:16 PM, Robert Haas  wrote:

> On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu <
> prabhat.s...@enterprisedb.com> wrote:
>
>> Hi all,
>>
>> While testing this feature I found a crash on PG head with parallel
>> create index using pgbanch tables.
>>
>> -- GUCs under postgres.conf
>> max_parallel_maintenance_workers = 16
>> max_parallel_workers = 16
>> max_parallel_workers_per_gather = 8
>> maintenance_work_mem = 8GB
>> max_wal_size = 4GB
>>
>> ./pgbench -i -s 500 -d postgres
>>
>> postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid,
>> abalance,filler);
>> WARNING:  terminating connection because of crash of another server
>> process
>> DETAIL:  The postmaster has commanded this server process to roll back
>> the current transaction and exit, because another server process exited
>> abnormally and possibly corrupted shared memory.
>> HINT:  In a moment you should be able to reconnect to the database and
>> repeat your command.
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>> !>
>>
>
> That makes it look like perhaps one of the worker backends crashed.  Did
> you get a message in the logfile that might indicate the nature of the
> crash?  Something with PANIC or TRAP, perhaps?
>

I am not able to see any PANIC/TRAP in log file,
Here are the contents.

[edb@localhost bin]$ cat logsnew
2018-03-07 19:21:20.922 IST [54400] LOG:  listening on IPv6 address "::1",
port 5432
2018-03-07 19:21:20.922 IST [54400] LOG:  listening on IPv4 address
"127.0.0.1", port 5432
2018-03-07 19:21:20.925 IST [54400] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5432"
2018-03-07 19:21:20.936 IST [54401] LOG:  database system was shut down at
2018-03-07 19:21:20 IST
2018-03-07 19:21:20.939 IST [54400] LOG:  database system is ready to
accept connections
2018-03-07 19:24:44.263 IST [54400] LOG:  background worker "parallel
worker" (PID 54482) was terminated by signal 9: Killed
2018-03-07 19:24:44.286 IST [54400] LOG:  terminating any other active
server processes
2018-03-07 19:24:44.297 IST [54405] WARNING:  terminating connection
because of crash of another server process
2018-03-07 19:24:44.297 IST [54405] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2018-03-07 19:24:44.297 IST [54405] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2018-03-07 19:24:44.301 IST [54478] WARNING:  terminating connection
because of crash of another server process
2018-03-07 19:24:44.301 IST [54478] DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2018-03-07 19:24:44.301 IST [54478] HINT:  In a moment you should be able
to reconnect to the database and repeat your command.
2018-03-07 19:24:44.494 IST [54504] FATAL:  the database system is in
recovery mode
2018-03-07 19:24:44.496 IST [54400] LOG:  all server processes terminated;
reinitializing
2018-03-07 19:24:44.513 IST [54505] LOG:  database system was interrupted;
last known up at 2018-03-07 19:22:54 IST
2018-03-07 19:24:44.552 IST [54505] LOG:  database system was not properly
shut down; automatic recovery in progress
2018-03-07 19:24:44.554 IST [54505] LOG:  redo starts at 0/AB401A38
2018-03-07 19:25:14.712 IST [54505] LOG:  invalid record length at
1/818B8D80: wanted 24, got 0
2018-03-07 19:25:14.714 IST [54505] LOG:  redo done at 1/818B8D48
2018-03-07 19:25:14.714 IST [54505] LOG:  last completed transaction was at
log time 2018-03-07 19:24:05.322402+05:30
2018-03-07 19:25:16.887 IST [54400] LOG:  database system is ready to
accept connections

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-07 Thread Robert Haas

On Wed, Mar 7, 2018 at 8:13 AM, Prabhat Sahu 
wrote:

> Hi all,
>
> While testing this feature I found a crash on PG head with parallel create
> index using pgbanch tables.
>
> -- GUCs under postgres.conf
> max_parallel_maintenance_workers = 16
> max_parallel_workers = 16
> max_parallel_workers_per_gather = 8
> maintenance_work_mem = 8GB
> max_wal_size = 4GB
>
> ./pgbench -i -s 500 -d postgres
>
> postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid,
> abalance,filler);
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>
>

That makes it look like perhaps one of the worker backends crashed.  Did
you get a message in the logfile that might indicate the nature of the
crash?  Something with PANIC or TRAP, perhaps?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-03-07 Thread Prabhat Sahu

Hi all,

While testing this feature I found a crash on PG head with parallel create
index using pgbanch tables.

-- GUCs under postgres.conf
max_parallel_maintenance_workers = 16
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 8GB
max_wal_size = 4GB

./pgbench -i -s 500 -d postgres

postgres=# create index pgb_acc_idx3 on pgbench_accounts(aid,
abalance,filler);
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

--

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company

On Thu, Feb 8, 2018 at 6:15 PM, Robert Haas  wrote:

> On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas  wrote:
> > On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> Unfortunately valgrind does not work at all on my laptop -- the server
> >>> appears to start, but as soon as you try to connect, the whole thing
> >>> dies with an error claiming that the startup process has failed.  So I
> >>> can't easily test this at the moment.  I'll try to get it working,
> >>> here or elsewhere, but thought I'd send the above reply first.
> >>
> >> Do you want somebody who does have a working valgrind installation
> >> (ie me) to take responsibility for pushing this patch?
> >
> > I committed it before seeing this.  It probably would've been better
> > if you had done it, but I assume Peter tested it, so let's see what
> > the BF thinks.
>
> skink and lousyjack seem happy now, so I think it worked.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-08 Thread Robert Haas

On Tue, Feb 6, 2018 at 3:53 PM, Robert Haas  wrote:
> On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Unfortunately valgrind does not work at all on my laptop -- the server
>>> appears to start, but as soon as you try to connect, the whole thing
>>> dies with an error claiming that the startup process has failed.  So I
>>> can't easily test this at the moment.  I'll try to get it working,
>>> here or elsewhere, but thought I'd send the above reply first.
>>
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

skink and lousyjack seem happy now, so I think it worked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Tomas Vondra



On 02/06/2018 10:39 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra
>  wrote:
>> I have little idea what -Og exactly means. It seems to be focused on
>> debugging experience, and so still does some of the optimizations.
> 
> As I understand it, -Og allows any optimization that does not hamper
> walking through code with a debugger.
> 
>> Which
>> I think would explain why skink was not detecting some of the failures
>> for a long time.
> 
> I think that skink didn't detect failures until now because the code
> wasn't exercised until parallel CREATE INDEX was added, simply because
> the function LogicalTapeFreeze() was never reached (though that's not
> the only reason, it is the most obvious one).
> 

Maybe. What I had in mind was a different thread from November,
discussing some non-deterministic valgrind failures:

https://www.postgresql.org/message-id/flat/20171125200014.qbewtip5oydqsklt%40alap3.anarazel.de#20171125200014.qbewtip5oydqs...@alap3.anarazel.de

But you're right that may be irrelevant here. As I said, it was mostly
just a random comment about valgrind.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Peter Geoghegan

On Tue, Feb 6, 2018 at 1:30 PM, Tomas Vondra
 wrote:
> I have little idea what -Og exactly means. It seems to be focused on
> debugging experience, and so still does some of the optimizations.

As I understand it, -Og allows any optimization that does not hamper
walking through code with a debugger.

> Which
> I think would explain why skink was not detecting some of the failures
> for a long time.

I think that skink didn't detect failures until now because the code
wasn't exercised until parallel CREATE INDEX was added, simply because
the function LogicalTapeFreeze() was never reached (though that's not
the only reason, it is the most obvious one).

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Tomas Vondra



On 02/06/2018 10:14 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra
>  wrote:
>> Did you do a test with "-O0"? In my experience that makes valgrind tests
>> much more reliable and repeatable. Some time ago we've seen cases that
>> were failing for me but not for others, and I suspect it was due to me
>> using "-O0".
> 
> FWIW, I use -O1 when configure is run for Valgrind. I also turn off
> assertions (this is all scripted). According to the Valgrind manual:
> 
> "With -O1 line numbers in error messages can be inaccurate, although
> generally speaking running Memcheck on code compiled at -O1 works
> fairly well, and the speed improvement compared to running -O0 is
> quite significant. Use of -O2 and above is not recommended as Memcheck
> occasionally reports uninitialised-value errors which don’t really
> exist."
> 

OK, although I was suggesting the optimizations may actually have the
opposite effect - valgrind missing some of the invalid memory accesses
(until the compiler decides not use them for some reason, causing sudden
valgrind failures).

> The manual does also say that there might even be some problems with
> -O1 at a later point, but it sounds like it's probably worth it to me.
> Skink uses -Og, FWIW.
> 

I have little idea what -Og exactly means. It seems to be focused on
debugging experience, and so still does some of the optimizations. Which
I think would explain why skink was not detecting some of the failures
for a long time.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Peter Geoghegan

On Tue, Feb 6, 2018 at 1:04 PM, Tomas Vondra
 wrote:
> Did you do a test with "-O0"? In my experience that makes valgrind tests
> much more reliable and repeatable. Some time ago we've seen cases that
> were failing for me but not for others, and I suspect it was due to me
> using "-O0".

FWIW, I use -O1 when configure is run for Valgrind. I also turn off
assertions (this is all scripted). According to the Valgrind manual:

"With -O1 line numbers in error messages can be inaccurate, although
generally speaking running Memcheck on code compiled at -O1 works
fairly well, and the speed improvement compared to running -O0 is
quite significant. Use of -O2 and above is not recommended as Memcheck
occasionally reports uninitialised-value errors which don’t really
exist."

The manual does also say that there might even be some problems with
-O1 at a later point, but it sounds like it's probably worth it to me.
Skink uses -Og, FWIW.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Tomas Vondra

On 02/06/2018 09:56 PM, Peter Geoghegan wrote:
> On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas  wrote:
>>> Do you want somebody who does have a working valgrind installation
>>> (ie me) to take responsibility for pushing this patch?
>>
>> I committed it before seeing this.  It probably would've been better
>> if you had done it, but I assume Peter tested it, so let's see what
>> the BF thinks.
> 
> I did test it with a full "make installcheck" + valgrind-3.11.0. I'd
> be very surprised if this doesn't make the buildfarm go green.
> 

Did you do a test with "-O0"? In my experience that makes valgrind tests
much more reliable and repeatable. Some time ago we've seen cases that
were failing for me but not for others, and I suspect it was due to me
using "-O0".

(This is more a random comment than a suggestion that you patch won't
make the buildfarm green.)

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Peter Geoghegan

On Tue, Feb 6, 2018 at 12:53 PM, Robert Haas  wrote:
>> Do you want somebody who does have a working valgrind installation
>> (ie me) to take responsibility for pushing this patch?
>
> I committed it before seeing this.  It probably would've been better
> if you had done it, but I assume Peter tested it, so let's see what
> the BF thinks.

I did test it with a full "make installcheck" + valgrind-3.11.0. I'd
be very surprised if this doesn't make the buildfarm go green.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Robert Haas

On Tue, Feb 6, 2018 at 2:11 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Unfortunately valgrind does not work at all on my laptop -- the server
>> appears to start, but as soon as you try to connect, the whole thing
>> dies with an error claiming that the startup process has failed.  So I
>> can't easily test this at the moment.  I'll try to get it working,
>> here or elsewhere, but thought I'd send the above reply first.
>
> Do you want somebody who does have a working valgrind installation
> (ie me) to take responsibility for pushing this patch?

I committed it before seeing this.  It probably would've been better
if you had done it, but I assume Peter tested it, so let's see what
the BF thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Tom Lane

Robert Haas  writes:
> Unfortunately valgrind does not work at all on my laptop -- the server
> appears to start, but as soon as you try to connect, the whole thing
> dies with an error claiming that the startup process has failed.  So I
> can't easily test this at the moment.  I'll try to get it working,
> here or elsewhere, but thought I'd send the above reply first.

Do you want somebody who does have a working valgrind installation
(ie me) to take responsibility for pushing this patch?

regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-06 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 1:45 PM, Peter Geoghegan  wrote:
>> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
>> on the buffer.  "We know what we're doing, trust us!"
>>
>> In some ways, that seems better than inserting a suppression, because
>> it only affects the memory in the buffer.
>
> I think that that would also work, and would be simpler, but also
> slightly inferior to using the proposed suppression. If there is
> garbage in logtape.c buffers, we still generally don't want to do
> anything important on the basis of those values. We make one exception
> with the suppression, which is a pretty typical kind of exception to
> make -- don't worry if we write() poisoned bytes, since those are
> bound to be alignment related.
>
> OTOH, as I've said we are generally bound to write some kind of
> logtape.c garbage, which will almost certainly not be of the
> uninitialized memory variety. So, while I feel that the suppression is
> better, the advantage is likely microscopic.

Attached patch does it to the tail of the buffer, as Tom suggested on
the -committers thread.

Note that there is one other place in logtape.c that can write a
partial block like this: LogicalTapeRewindForRead(). I haven't
bothered to do anything there, since it cannot possibly be affected by
this issue for the same reason that serial sorts cannot be -- it's
code that is only used by a tuplesort that really needs to spill to
disk, and merge multiple runs (or for tapes that have already been
frozen, that are expected to never reallocate logtape.c buffers).

-- 
Peter Geoghegan
From a67d4ae602247067a367265acfe83d35a264f40c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 6 Feb 2018 09:55:25 -0800
Subject: [PATCH] Mark logtape.c buffer's tail as defined to Valgrind.

LogicalTapeFreeze() may write out its first block when it is dirty but
not full, and then immediately read the first block back in from its
BufFile as a BLCKSZ-width block.  This can only occur in rare cases
where next to no tuples were written out, which is only possible with
parallel external tuplesorts.  While this is pointless, it's also
harmless.  However, this can cause Valgrind to complain about a write()
of uninitialized bytes, so mark the tail of the buffer as defined to
Valgrind.

Note that LogicalTapeFreeze() has always tended to write out some amount
of garbage bytes.  All that changed with parallel tuplesort is that the
garbage is sometimes uninitialized memory rather than memory containing
stale data from the tapeset.

Per buildfarm members lousyjack and skink.

Peter Geoghegan, based on a suggestion from Robert Haas and Tom Lane.
---
 src/backend/utils/sort/logtape.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 6b7c10b..1a02aba 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -86,6 +86,7 @@
 #include "storage/buffile.h"
 #include "utils/builtins.h"
 #include "utils/logtape.h"
+#include "utils/memdebug.h"
 #include "utils/memutils.h"

 /*
@@ -874,6 +875,13 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum, TapeShare *share)
 	 */
 	if (lt->dirty)
 	{
+		/*
+		 * Since garbage bytes at the tail of the buffer may remain
+		 * uninitialized in the case of worker tuplesorts with very little
+		 * input, mark the tail as defined
+		 */
+		VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
+  lt->buffer_size - lt->nbytes);
 		TapeBlockSetNBytes(lt->buffer, lt->nbytes);
 		ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
 		lt->writing = false;
-- 
2.7.4

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 1:39 PM, Tels  wrote:
> Are the uninitialized bytes that are written out "whatever was in the
> memory previously" or just some "0x00 bytes from the allocation but not
> yet overwritten from the PG code"?
>
> Because the first sounds like it could be a security problem - if random
> junk bytes go out to the disk, and stay there, information could
> inadvertedly leak to permanent storage.

But you can say the same thing about *any* of the
write()-of-uninitialized-bytes Valgrind suppressions that already
exist. There are quite a few of those.

That just isn't part of our security model.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 1:27 PM, Robert Haas  wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan  wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.

I should clarify what I meant here -- it is very common when we have
to freeze a tape, like when we do a serial external randomAccess
tuplesort, or a parallel worker's tuplesort. It shouldn't happen
otherwise. Note that there is a general pattern of dumping out the
current buffer just as the next one is needed, in order to make sure
that the linked list pointer correctly points to the
next/soon-to-be-current block. Note also that the majority of routines
declared within logtape.c can only be used on frozen tapes.

I am pretty confident that I've scoped this correctly by targeting
LogicalTapeFreeze().

> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.

I think that that would also work, and would be simpler, but also
slightly inferior to using the proposed suppression. If there is
garbage in logtape.c buffers, we still generally don't want to do
anything important on the basis of those values. We make one exception
with the suppression, which is a pretty typical kind of exception to
make -- don't worry if we write() poisoned bytes, since those are
bound to be alignment related.

OTOH, as I've said we are generally bound to write some kind of
logtape.c garbage, which will almost certainly not be of the
uninitialized memory variety. So, while I feel that the suppression is
better, the advantage is likely microscopic.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Tels

On Mon, February 5, 2018 4:27 pm, Robert Haas wrote:
> On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan  wrote:
>> It certainly is common. In the case of logtape.c, we almost always
>> write out some garbage bytes, even with serial sorts. The only
>> difference here is the *sense* in which they're garbage: they're
>> uninitialized bytes, which Valgrind cares about, rather than byte from
>> previous writes that are left behind in the buffer, which Valgrind
>> does not care about.
>
> /me face-palms.
>
> So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
> on the buffer.  "We know what we're doing, trust us!"
>
> In some ways, that seems better than inserting a suppression, because
> it only affects the memory in the buffer.
>
> Anybody else want to express an opinion here?

Are the uninitialized bytes that are written out "whatever was in the
memory previously" or just some "0x00 bytes from the allocation but not
yet overwritten from the PG code"?

Because the first sounds like it could be a security problem - if random
junk bytes go out to the disk, and stay there, information could
inadvertedly leak to permanent storage.

Best regards,

Tels

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Robert Haas

On Mon, Feb 5, 2018 at 1:03 PM, Peter Geoghegan  wrote:
> It certainly is common. In the case of logtape.c, we almost always
> write out some garbage bytes, even with serial sorts. The only
> difference here is the *sense* in which they're garbage: they're
> uninitialized bytes, which Valgrind cares about, rather than byte from
> previous writes that are left behind in the buffer, which Valgrind
> does not care about.

/me face-palms.

So, I guess another option might be to call VALGRIND_MAKE_MEM_DEFINED
on the buffer.  "We know what we're doing, trust us!"

In some ways, that seems better than inserting a suppression, because
it only affects the memory in the buffer.

Anybody else want to express an opinion here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-05 Thread Robert Haas

On Fri, Feb 2, 2018 at 10:26 PM, Peter Geoghegan  wrote:
> My proposed commit message
> has a full explanation of the Valgrind issue, which I won't repeat
> here. Go read it before reading the rest of this e-mail.

I'm going to paste the first two sentences of your proposed commit
message in here for the convenience of other readers, since I want to
reply to them.

# LogicalTapeFreeze() may write out its first block when it is dirty but
# not full, and then immediately read the first block back in from its
# BufFile as a BLCKSZ-width block.  This can only occur in rare cases
# where next to no tuples were written out, which is only possible with
# parallel external tuplesorts.

So, if I understand correctly what you're saying here, valgrind is
totally cool with us writing out an only-partially-initialized block
to a disk file, but it's got a real problem with us reading that data
back into the same memory space it already occupies.  That's a little
odd.  I presume that it's common for the tail of the final block
written to be uninitialized, but normally when we then go read block
0, that's some other, fully initialized block.

It seems like it would be pretty easy to just suppress the useless
read when we've already got the correct data, and I'd lean toward
going that direction since it's a valid optimization anyway.  But I'd
like to hear some opinions from people who use and think about
valgrind more than I do (Tom, Andres, Noah, ...?).

> It might seem like my suppression is overly broad, or not broad
> enough, since it essentially targets LogicalTapeFreeze(). I don't
> think it is, though, because this can occur in two places within
> LogicalTapeFreeze() -- it can occur in the place we actually saw the
> issue on lousyjack, from the ltsReadBlock() call within
> LogicalTapeFreeze(), as well as a second place -- when
> BufFileExportShared() is called. I found that you have to tweak code
> to prevent it happening in the first place before you'll see it happen
> in the second place.

I don't quite see how that would happen, because BufFileExportShared,
at least AFAICS, doesn't touch the buffer?

Unfortunately valgrind does not work at all on my laptop -- the server
appears to start, but as soon as you try to connect, the whole thing
dies with an error claiming that the startup process has failed.  So I
can't easily test this at the moment.  I'll try to get it working,
here or elsewhere, but thought I'd send the above reply first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 4:31 PM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund  wrote:
>> Not saying you're wrong, but you should include a comment on why this is
>> a benign warning. Presumably it's some padding memory somewhere, but
>> it's not obvious from the above bleat.
>
> Sure. This looks slightly more complicated than first anticipated, but
> I'll keep everyone posted.

I couldn't make up my mind if it was best to prevent the uninitialized
write(), or to instead just add a suppression. I eventually decided
upon the suppression -- see attached patch. My proposed commit message
has a full explanation of the Valgrind issue, which I won't repeat
here. Go read it before reading the rest of this e-mail.

It might seem like my suppression is overly broad, or not broad
enough, since it essentially targets LogicalTapeFreeze(). I don't
think it is, though, because this can occur in two places within
LogicalTapeFreeze() -- it can occur in the place we actually saw the
issue on lousyjack, from the ltsReadBlock() call within
LogicalTapeFreeze(), as well as a second place -- when
BufFileExportShared() is called. I found that you have to tweak code
to prevent it happening in the first place before you'll see it happen
in the second place. I see no point in actually playing whack-a-mole
for a totally benign issue like this, though, which made me finally
decide upon the suppression approach.

Bear in mind that a third way of fixing this would be to allocate
logtape.c buffers using palloc0() rather than palloc() (though I don't
like that idea at all). For serial external sorts, the logtape.c
buffers are guaranteed to have been written to/initialized at least
once as part of spilling a sort to disk. Parallel external sorts don't
quite guarantee that, which is why we run into this Valgrind issue.

-- 
Peter Geoghegan
From e51b9d411a859072e6a3f8b2f77e32e0f47d409a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 2 Feb 2018 17:47:13 -0800
Subject: [PATCH] Add logtape.c Valgrind suppression.

LogicalTapeFreeze() may write out its first block when it is dirty but
not full, and then immediately read the first block back in from its
BufFile as a BLCKSZ-width block.  This can only occur in rare cases
where next to no tuples were written out, which is only possible with
parallel external tuplesorts.  While this is pointless, it's also
harmless.

However, this issue causes Valgrind to complain about a write() of
uninitialized bytes, so a suppression is needed.  This is because
BufFile block reading will flush out its whole dirty stdio-style temp
buffer, writing bytes that include values originating from
poisoned/uninitialized logtape.c buffer bytes.

Author: Peter Geoghegan
---
 src/tools/valgrind.supp | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index af03051..22039c0 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -112,6 +112,16 @@
 	fun:BootStrapXLOG
 }

+# LogicalTapeFreeze() needs this for calls from parallel worker with few tuples
+{
+	padding_buffile_dump_buffer
+	Memcheck:Param
+	write(buf)
+
+	...
+	fun:LogicalTapeFreeze
+}
+
 {
 	bootstrap_write_relmap_overlap
 	Memcheck:Overlap
-- 
2.7.4

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund  wrote:
> Not saying you're wrong, but you should include a comment on why this is
> a benign warning. Presumably it's some padding memory somewhere, but
> it's not obvious from the above bleat.

Sure. This looks slightly more complicated than first anticipated, but
I'll keep everyone posted.

Valgrind suppression aside, this raises another question. The stack
trace shows that the error happens during the creation of a new TOAST
table (CheckAndCreateToastTable()). I wonder if I should also pass
down a flag that makes sure that parallelism is never even attempted
from that path, to match TRUNCATE's suppression of parallel index
builds during its reindexing. It really shouldn't be a problem as
things stand, but maybe it's better to be consistent about "useless"
parallel CREATE INDEX attempts, and suppress them here too.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Andres Freund

On 2018-02-02 13:35:59 -0800, Peter Geoghegan wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
> > On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas  wrote:
> >> If you could keep an eye on the buildfarm and investigate anything
> >> that breaks, I would appreciate it.
> 
> > I can keep an eye on it throughout the day.
> 
> There is a benign Valgrind error that causes the lousyjack animal to
> report failure. It looks like this:
> 
> ==6850== Syscall param write(buf) points to uninitialised byte(s)
> ==6850==at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
> ==6850==by 0x82328F: FileWrite (fd.c:2017)
> ==6850==by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
> ==6850==by 0x826569: BufFileFlush (buffile.c:657)
> ==6850==by 0x8262FB: BufFileRead (buffile.c:561)
> ==6850==by 0x9F6C79: ltsReadBlock (logtape.c:273)
> ==6850==by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
> ==6850==by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
> ==6850==by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
> ==6850==by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

Not saying you're wrong, but you should include a comment on why this is
a benign warning. Presumably it's some padding memory somewhere, but
it's not obvious from the above bleat.

Greetings,

Andres Freund

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas  wrote:
>> If you could keep an eye on the buildfarm and investigate anything
>> that breaks, I would appreciate it.

> I can keep an eye on it throughout the day.

There is a benign Valgrind error that causes the lousyjack animal to
report failure. It looks like this:

==6850== Syscall param write(buf) points to uninitialised byte(s)
==6850==at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
==6850==by 0x82328F: FileWrite (fd.c:2017)
==6850==by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
==6850==by 0x826569: BufFileFlush (buffile.c:657)
==6850==by 0x8262FB: BufFileRead (buffile.c:561)
==6850==by 0x9F6C79: ltsReadBlock (logtape.c:273)
==6850==by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
==6850==by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
==6850==by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
==6850==by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

I'll need to go and write a Valgrind suppression for this. I'll get to
it later today.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 3:35 PM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas  wrote:
>> For the record, I typically construct the list of reviewers by reading
>> over the thread and adding all the people whose names I find there in
>> chronological order, excluding things that are clearly not review
>> (like "Bumped to next CF.") and opinions on narrow questions that
>> don't indicate that any code-reading or testing was done (like "+1 for
>> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
>> copied Corey on the original email, but I see no posts from him on the
>> thread, which is why he didn't get included in the commit message.
>
> I did credit him in my own proposed commit message. I know that it's
> not part of your workflow to preserve that, but I had assumed that
> that would at least be taken into account.

Ah.  Sorry, I didn't look at that.  I try to remember to look at
proposed commit messages, but not everyone includes them, which is
probably part of the reason I don't always remember to look for them.
Or maybe I just have failed to adequately develop that habit...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas  wrote:
> For the record, I typically construct the list of reviewers by reading
> over the thread and adding all the people whose names I find there in
> chronological order, excluding things that are clearly not review
> (like "Bumped to next CF.") and opinions on narrow questions that
> don't indicate that any code-reading or testing was done (like "+1 for
> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
> copied Corey on the original email, but I see no posts from him on the
> thread, which is why he didn't get included in the commit message.

I did credit him in my own proposed commit message. I know that it's
not part of your workflow to preserve that, but I had assumed that
that would at least be taken into account.

Anyway, mistakes like this happen. I'm glad that we now have the
reviewer credit list, so that they can be corrected afterwards.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 3:23 PM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
>> Thanks everyone
>
> I would like to acknowledge the assistance of Corey Huinker with early
> testing of the patch (this took place in 2016, and much of it was not
> on-list). Even though he wasn't credited in the commit message, he
> should appear in the V11 release notes reviewer list IMV. His
> contribution certainly merits it.

For the record, I typically construct the list of reviewers by reading
over the thread and adding all the people whose names I find there in
chronological order, excluding things that are clearly not review
(like "Bumped to next CF.") and opinions on narrow questions that
don't indicate that any code-reading or testing was done (like "+1 for
calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
copied Corey on the original email, but I see no posts from him on the
thread, which is why he didn't get included in the commit message.
While I have no problem with him being included in the release notes,
I obviously can't know about activity that happens entirely off-list.
If you mentioned somewhere in the 200+ message on this topic that he
should be included, I missed that, too.  I think it's much harder to
give credit adequately when contributions are off-list; letting
everyone know what's going on is why we have a list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas  wrote:
> And that patch you attached is also, now, committed.
>
> If you could keep an eye on the buildfarm and investigate anything
> that breaks, I would appreciate it.

Fantastic!

I can keep an eye on it throughout the day.

Thanks everyone
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 11:16 AM, Peter Geoghegan  wrote:
> Attached patch has these changes.

And that patch you attached is also, now, committed.

If you could keep an eye on the buildfarm and investigate anything
that breaks, I would appreciate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-27 Thread Amit Kapila

On Fri, Jan 26, 2018 at 12:36 PM, Amit Kapila  wrote:
> On Fri, Jan 26, 2018 at 12:00 PM, Peter Geoghegan  wrote:
>> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila  
>> wrote:
 At this point, my preferred solution is for someone to go implement
 Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
 like the logical person for the job).

>>>
>>> I can implement it and share a prototype patch with you which you can
>>> use to test parallel sort stuff.
>>
>> That would be great. Thank you.
>>
>>> I would like to highlight the
>>> difference which you will see with WaitForParallelWorkersToAttach as
>>> compare to WaitForParallelWorkersToFinish() is that the former will
>>> give you how many of nworkers_launched workers are actually launched
>>> whereas latter gives an error if any of the expected workers is not
>>> launched.  I feel former is good and your proposed way of calling it
>>> after the leader is done with its work has alleviated the minor
>>> disadvantage of this API which is that we need for workers to startup.
>>
>> I'm not sure that it makes much difference, though, since in the end
>> WaitForParallelWorkersToFinish() is called anyway, much like
>> nodeGather.c. Have I missed something?
>>
>
> Nopes, you are right.  I had in my mind that if we have something like
> what I am proposing, then we don't even need to detect failures in
> WaitForParallelWorkersToFinish and we can finish the work without
> failing.
>
>> I had imagined that WaitForParallelWorkersToAttach() would give me an
>> error in the style of WaitForParallelWorkersToFinish(), without
>> actually waiting for the parallel workers to finish.
>>
>
> I think that is also doable.  I will give it a try and report back if
> I see any problem with it.
>

I have posted the patch for the above API and posted it on a new
thread [1].  Do let me know either here or on that thread if the patch
suffices your need?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Be2MzyouF5bg%3DOtyhDSX%2B%3DAo%3D3htN%3DT-r_6s3gCtKFiw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Robert Haas

On Fri, Jan 26, 2018 at 6:40 PM, Thomas Munro
 wrote:
> Thanks for looking into this.  Yeah.  I think you're right that it
> could add a bit of overhead in some cases (ie if you receive a lot of
> signals that AREN'T caused by fork failure, then you'll enter
> HandleParallelMessage() every time unnecessarily), and it does feel a
> bit kludgy.  The best idea I have to fix that so far is like this: (1)
> add a member fork_failure_count to struct BackgroundWorkerArray, (2)
> in do_start_bgworker() whenever fork fails, do
> ++BackgroundWorkerData->fork_failure_count (ie before a signal is sent
> to the leader), (3) in procsignal_sigusr1_handler where we normally do
> a bunch of CheckProcSignal(PROCSIG_XXX) stuff, if
> (BackgroundWorkerData->fork_failure_count !=
> last_observed_fork_failure_count) HandleParallelMessageInterrupt().
> As far as I know, as long as fork_failure_count is (say) int32 (ie not
> prone to tearing) then no locking is required due to the barriers
> implicit in the syscalls involved there.  This is still slightly more
> pessimistic than it needs to be (the failed fork may be for someone
> else's ParallelContext), but only in rare cases so it would be
> practically as good as precise PROCSIG delivery.  It's just that we
> aren't allowed to deliver PROCSIGs from the postmaster.  We are
> allowed to communicate through BackgroundWorkerData, and there is a
> precedent for cluster-visible event counters in there already.

I could sign on to that plan, but I don't think we should hold this
patch up for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Thomas Munro

On Fri, Jan 26, 2018 at 7:30 PM, Peter Geoghegan  wrote:
> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila  wrote:
>> However, now I see that you and Thomas are trying to find a different
>> way to overcome this problem differently, so not sure if I should go
>> ahead or not.  I have seen that you told you wanted to look at
>> Thomas's proposed stuff carefully tomorrow, so I will wait for you
>> guys to decide which way is appropriate.
>
> I suspect that the overhead of Thomas' experimental approach is going
> to causes problems in certain cases. Cases that are hard to foresee.
> That patch makes HandleParallelMessages() set ParallelMessagePending
> artificially, pending confirmation of having launched all workers.
>
> It was an interesting experiment, but I think that your
> WaitForParallelWorkersToAttach() idea has a better chance of working
> out.

Thanks for looking into this.  Yeah.  I think you're right that it
could add a bit of overhead in some cases (ie if you receive a lot of
signals that AREN'T caused by fork failure, then you'll enter
HandleParallelMessage() every time unnecessarily), and it does feel a
bit kludgy.  The best idea I have to fix that so far is like this: (1)
add a member fork_failure_count to struct BackgroundWorkerArray, (2)
in do_start_bgworker() whenever fork fails, do
++BackgroundWorkerData->fork_failure_count (ie before a signal is sent
to the leader), (3) in procsignal_sigusr1_handler where we normally do
a bunch of CheckProcSignal(PROCSIG_XXX) stuff, if
(BackgroundWorkerData->fork_failure_count !=
last_observed_fork_failure_count) HandleParallelMessageInterrupt().
As far as I know, as long as fork_failure_count is (say) int32 (ie not
prone to tearing) then no locking is required due to the barriers
implicit in the syscalls involved there.  This is still slightly more
pessimistic than it needs to be (the failed fork may be for someone
else's ParallelContext), but only in rare cases so it would be
practically as good as precise PROCSIG delivery.  It's just that we
aren't allowed to deliver PROCSIGs from the postmaster.  We are
allowed to communicate through BackgroundWorkerData, and there is a
precedent for cluster-visible event counters in there already.

I think you should proceed with Amit's plan.  If we ever make a plan
like the above work in future, it'd render that redundant by turning
every CFI() into a cancellation point for fork failure, but I'm not
planning to investigate further given the muted response to my
scheming in this area so far.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Peter Geoghegan

On Fri, Jan 26, 2018 at 11:17 AM, Robert Haas  wrote:
> Hmm, I like the idea of making it a #define instead of having it
> depend on parallel_leader_participation.  Let's do that.  If the
> consensus is later that it was the wrong decision, it'll be easy to
> change it back.

WFM.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Robert Haas

On Fri, Jan 26, 2018 at 2:04 PM, Peter Geoghegan  wrote:
> On Fri, Jan 26, 2018 at 10:33 AM, Robert Haas  wrote:
>> I'm busy with other things, so no rush.
>
> Got it.
>
> There is one question that I should probably get clarity on ahead of
> the next revision, which is: Should I rip out the code that disallows
> a "degenerate parallel CREATE INDEX" when
> parallel_leader_participation=off, or should I instead rip out any
> code that deals with parallel_leader_participation, and always have
> the leader participate as a worker?
>
> If I did the latter, then leader non-participation would live on as a
> #define debug option within nbtsort.c. It definitely seems like we'd
> want to preserve that at a minimum.

Hmm, I like the idea of making it a #define instead of having it
depend on parallel_leader_participation.  Let's do that.  If the
consensus is later that it was the wrong decision, it'll be easy to
change it back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Peter Geoghegan

On Fri, Jan 26, 2018 at 10:33 AM, Robert Haas  wrote:
> I'm busy with other things, so no rush.

Got it.

There is one question that I should probably get clarity on ahead of
the next revision, which is: Should I rip out the code that disallows
a "degenerate parallel CREATE INDEX" when
parallel_leader_participation=off, or should I instead rip out any
code that deals with parallel_leader_participation, and always have
the leader participate as a worker?

If I did the latter, then leader non-participation would live on as a
#define debug option within nbtsort.c. It definitely seems like we'd
want to preserve that at a minimum.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Robert Haas

On Fri, Jan 26, 2018 at 1:17 PM, Peter Geoghegan  wrote:
> On Fri, Jan 26, 2018 at 10:01 AM, Robert Haas  wrote:
>> On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan  wrote:
>>> I had imagined that WaitForParallelWorkersToAttach() would give me an
>>> error in the style of WaitForParallelWorkersToFinish(), without
>>> actually waiting for the parallel workers to finish.
>>
>> +1.  If we're going to go that route, and that seems to be the
>> consensus, then I think an error is more appropriate than returning an
>> updated worker count.
>
> Great.
>
> Should I wait for Amit's WaitForParallelWorkersToAttach() patch to be
> posted, reviewed, and committed, or would you like to see what I came
> up with ("The next revision of the patch will make the
> leader-participates-as-worker spool/Tuplelsortstate start and finish
> sorting before the main leader spool/Tuplelsortstate is even started")
> today?

I'm busy with other things, so no rush.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Peter Geoghegan

On Fri, Jan 26, 2018 at 10:01 AM, Robert Haas  wrote:
> On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan  wrote:
>> I had imagined that WaitForParallelWorkersToAttach() would give me an
>> error in the style of WaitForParallelWorkersToFinish(), without
>> actually waiting for the parallel workers to finish.
>
> +1.  If we're going to go that route, and that seems to be the
> consensus, then I think an error is more appropriate than returning an
> updated worker count.

Great.

Should I wait for Amit's WaitForParallelWorkersToAttach() patch to be
posted, reviewed, and committed, or would you like to see what I came
up with ("The next revision of the patch will make the
leader-participates-as-worker spool/Tuplelsortstate start and finish
sorting before the main leader spool/Tuplelsortstate is even started")
today?

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-26 Thread Robert Haas

On Fri, Jan 26, 2018 at 1:30 AM, Peter Geoghegan  wrote:
> I had imagined that WaitForParallelWorkersToAttach() would give me an
> error in the style of WaitForParallelWorkersToFinish(), without
> actually waiting for the parallel workers to finish.

+1.  If we're going to go that route, and that seems to be the
consensus, then I think an error is more appropriate than returning an
updated worker count.

On the question of whether this is better or worse than using
barriers, I'm not entirely sure.  I understand that various objections
to the Barrier concept have been raised, but I'm not personally
convinced by any of them.  On the other hand, if we only have to call
WaitForParallelWorkersToAttach after the leader finishes its own sort,
then there's no latency advantage to the barrier approach.  I suspect
we might still end up reworking this if we add the ability for new
workers to join an index build in medias res at some point in the
future -- but, as Peter points out, maybe the whole algorithm would
get reworked in that scenario.  So, since other people like
WaitForParallelWorkersToAttach, I think we can just go with that for
now.  I don't want to kill this patch with unnecessary nitpicking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-25 Thread Amit Kapila

On Fri, Jan 26, 2018 at 12:00 PM, Peter Geoghegan  wrote:
> On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila  wrote:
>>> At this point, my preferred solution is for someone to go implement
>>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>>> like the logical person for the job).
>>>
>>
>> I can implement it and share a prototype patch with you which you can
>> use to test parallel sort stuff.
>
> That would be great. Thank you.
>
>> I would like to highlight the
>> difference which you will see with WaitForParallelWorkersToAttach as
>> compare to WaitForParallelWorkersToFinish() is that the former will
>> give you how many of nworkers_launched workers are actually launched
>> whereas latter gives an error if any of the expected workers is not
>> launched.  I feel former is good and your proposed way of calling it
>> after the leader is done with its work has alleviated the minor
>> disadvantage of this API which is that we need for workers to startup.
>
> I'm not sure that it makes much difference, though, since in the end
> WaitForParallelWorkersToFinish() is called anyway, much like
> nodeGather.c. Have I missed something?
>

Nopes, you are right.  I had in my mind that if we have something like
what I am proposing, then we don't even need to detect failures in
WaitForParallelWorkersToFinish and we can finish the work without
failing.

> I had imagined that WaitForParallelWorkersToAttach() would give me an
> error in the style of WaitForParallelWorkersToFinish(), without
> actually waiting for the parallel workers to finish.
>

I think that is also doable.  I will give it a try and report back if
I see any problem with it.  However, it might take me some time as I
am busy with few other things and I am planning to take two days off
for some personal reasons, OTOH if it turns out to be a simple (which
I expect it should be), then I will report back early.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-25 Thread Peter Geoghegan

On Thu, Jan 25, 2018 at 10:00 PM, Amit Kapila  wrote:
>> At this point, my preferred solution is for someone to go implement
>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>> like the logical person for the job).
>>
>
> I can implement it and share a prototype patch with you which you can
> use to test parallel sort stuff.

That would be great. Thank you.

> I would like to highlight the
> difference which you will see with WaitForParallelWorkersToAttach as
> compare to WaitForParallelWorkersToFinish() is that the former will
> give you how many of nworkers_launched workers are actually launched
> whereas latter gives an error if any of the expected workers is not
> launched.  I feel former is good and your proposed way of calling it
> after the leader is done with its work has alleviated the minor
> disadvantage of this API which is that we need for workers to startup.

I'm not sure that it makes much difference, though, since in the end
WaitForParallelWorkersToFinish() is called anyway, much like
nodeGather.c. Have I missed something?

I had imagined that WaitForParallelWorkersToAttach() would give me an
error in the style of WaitForParallelWorkersToFinish(), without
actually waiting for the parallel workers to finish.

> However, now I see that you and Thomas are trying to find a different
> way to overcome this problem differently, so not sure if I should go
> ahead or not.  I have seen that you told you wanted to look at
> Thomas's proposed stuff carefully tomorrow, so I will wait for you
> guys to decide which way is appropriate.

I suspect that the overhead of Thomas' experimental approach is going
to causes problems in certain cases. Cases that are hard to foresee.
That patch makes HandleParallelMessages() set ParallelMessagePending
artificially, pending confirmation of having launched all workers.

It was an interesting experiment, but I think that your
WaitForParallelWorkersToAttach() idea has a better chance of working
out.

>> Once that's committed, I can
>> post a new version of the patch that uses that new infrastructure --
>> I'll add a call to the new function, without changing anything else.
>> Failing that, we could actually just use
>> WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
>> mostly because it complicates  parallel_leader_participation=off,
>> something that Amit is in agreement with [2][3].
>>
>
> I think if we want we can use barrier API's to solve this problem, but
> I kind of have a feeling that it doesn't seem to be the most
> appropriate API, especially because existing API like
> WaitForParallelWorkersToFinish() can serve the need in a similar way.

I can't see a way in which using a barrier can have less complexity. I
think it will have quite a bit more, and I suspect that you share this
feeling.

> Just to conclude, following are proposed ways to solve this problem:
>
> 1. Implement a new API WaitForParallelWorkersToAttach and use that to
> solve this problem.  Peter G. and Amit thinks, this is a good way to
> solve this problem.
> 2. Use existing API WaitForParallelWorkersToFinish to solve this
> problem.  Peter G. feels that if API mentioned in #1 is not available,
> we can use this to solve the problem and I agree with that position.
> Thomas is not against it.
> 3. Use Thomas's new way to detect such failures.  It is not clear to
> me at this stage if any one of us have accepted it to be the way to
> proceed, but Thomas and Peter G. want to investigate it further.
> 4. Use of  Barrier API to solve this problem.  Robert appears to be
> strongly in favor of this approach.

That's a good summary.

The next revision of the patch will make the
leader-participates-as-worker spool/Tuplelsortstate start and finish
sorting before the main leader spool/Tuplelsortstate is even started.
I did this with the intention of making it very clear that my approach
does not assume a number of participants up-front -- that is actually
something we only need a final answer on at the point that the leader
merges, which is logically the last possible moment.

Hopefully this will reassure Robert. It is quite a small change, but
leads to a slightly cleaner organization within nbtsort.c, since
_bt_begin_parallel() is the only point that has to deal with leader
participation. Another minor advantage is that this makes the
trace_sort overheads/duration for each of the two tuplesort within the
leader non-overlapping (when the leader participates as a worker).

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-25 Thread Amit Kapila

On Fri, Jan 26, 2018 at 11:30 AM, Amit Kapila  wrote:
> On Thu, Jan 25, 2018 at 1:24 AM, Peter Geoghegan  wrote:
>> On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila  wrote:
>>> Right, but what if the worker dies due to something proc_exit(1) or
>>> something like that before calling BarrierArriveAndWait.  I think this
>>> is part of the problem we have solved in
>>> WaitForParallelWorkersToFinish such that if the worker exits abruptly
>>> at any point due to some reason, the system should not hang.
>>
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>>
>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>>
>> 3. This short term fix works just as well with
>> parallel_leader_participation=off.
>>
>> At this point, my preferred solution is for someone to go implement
>> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
>> like the logical person for the job).
>>
>
> I can implement it and share a prototype patch with you which you can
> use to test parallel sort stuff.  I would like to highlight the
> difference which you will see with WaitForParallelWorkersToAttach as
> compare to WaitForParallelWorkersToFinish() is that the former will
> give you how many of nworkers_launched workers are actually launched
> whereas latter gives an error if any of the expected workers is not
> launched.  I feel former is good and your proposed way of calling it
> after the leader is done with its work has alleviated the minor
> disadvantage of this API which is that we need for workers to startup.
>

/we need for workers to startup./we need to wait for workers to startup.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-25 Thread Amit Kapila

On Thu, Jan 25, 2018 at 1:24 AM, Peter Geoghegan  wrote:
> On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila  wrote:
>> Right, but what if the worker dies due to something proc_exit(1) or
>> something like that before calling BarrierArriveAndWait.  I think this
>> is part of the problem we have solved in
>> WaitForParallelWorkersToFinish such that if the worker exits abruptly
>> at any point due to some reason, the system should not hang.
>
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).
>
> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.
>
> 3. This short term fix works just as well with
> parallel_leader_participation=off.
>
> At this point, my preferred solution is for someone to go implement
> Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
> like the logical person for the job).
>

I can implement it and share a prototype patch with you which you can
use to test parallel sort stuff.  I would like to highlight the
difference which you will see with WaitForParallelWorkersToAttach as
compare to WaitForParallelWorkersToFinish() is that the former will
give you how many of nworkers_launched workers are actually launched
whereas latter gives an error if any of the expected workers is not
launched.  I feel former is good and your proposed way of calling it
after the leader is done with its work has alleviated the minor
disadvantage of this API which is that we need for workers to startup.

However, now I see that you and Thomas are trying to find a different
way to overcome this problem differently, so not sure if I should go
ahead or not.  I have seen that you told you wanted to look at
Thomas's proposed stuff carefully tomorrow, so I will wait for you
guys to decide which way is appropriate.

> Once that's committed, I can
> post a new version of the patch that uses that new infrastructure --
> I'll add a call to the new function, without changing anything else.
> Failing that, we could actually just use
> WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
> mostly because it complicates  parallel_leader_participation=off,
> something that Amit is in agreement with [2][3].
>

I think if we want we can use barrier API's to solve this problem, but
I kind of have a feeling that it doesn't seem to be the most
appropriate API, especially because existing API like
WaitForParallelWorkersToFinish() can serve the need in a similar way.

Just to conclude, following are proposed ways to solve this problem:

1. Implement a new API WaitForParallelWorkersToAttach and use that to
solve this problem.  Peter G. and Amit thinks, this is a good way to
solve this problem.
2. Use existing API WaitForParallelWorkersToFinish to solve this
problem.  Peter G. feels that if API mentioned in #1 is not available,
we can use this to solve the problem and I agree with that position.
Thomas is not against it.
3. Use Thomas's new way to detect such failures.  It is not clear to
me at this stage if any one of us have accepted it to be the way to
proceed, but Thomas and Peter G. want to investigate it further.
4. Use of  Barrier API to solve this problem.  Robert appears to be
strongly in favor of this approach.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Thomas Munro

On Thu, Jan 25, 2018 at 9:28 AM, Peter Geoghegan  wrote:
> On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
>  wrote:
>> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
>>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>>
>>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>>> real problem. Sure enough, the coding pattern within
>>> _bt_leader_heapscan() can cause us to wait forever even with commit
>>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>>> consequence of the patch not using tuple queues (it uses the new
>>> tuplesort sharing thing instead).
>>
>> Just curious: does the attached also help?
>
> I can still reproduce the problem without the fix I described (which
> does work), using your patch instead.
>
> Offhand, I suspect that the way you set ParallelMessagePending may not
> always leave it set when it should be.

Here's a version that works, and a minimal repro test module thing.
Without 0003 applied, it hangs.  With 0003 applied, it does this:

postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
CALL
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker
postgres=# call test_fork_failure();
ERROR:  lost connection to parallel worker

I won't be surprised if 0003 is judged to be a horrendous abuse of the
interrupt system, but these patches might at least be useful for
understanding the problem.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Chaos-monkey-fork-failure.patch
Description: Binary data


0002-A-simple-test-module-that-hangs-on-fork-failure.patch
Description: Binary data


0003-Pessimistic-fork-failure-detector.patch
Description: Binary data

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan

On Wed, Jan 24, 2018 at 12:13 PM, Thomas Munro
 wrote:
> On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
>> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>>
>> 1. The problem of fork failure causing nbtsort.c to wait forever is a
>> real problem. Sure enough, the coding pattern within
>> _bt_leader_heapscan() can cause us to wait forever even with commit
>> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
>> consequence of the patch not using tuple queues (it uses the new
>> tuplesort sharing thing instead).
>
> Just curious: does the attached also help?

I can still reproduce the problem without the fix I described (which
does work), using your patch instead.

Offhand, I suspect that the way you set ParallelMessagePending may not
always leave it set when it should be.

>> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
>> within _bt_leader_heapscan() before waiting on our condition variable
>> fixes the problem -- errors are reliably propagated, and we never end
>> up waiting forever.
>
> That does seem like a nice, simple solution and I am not against it.
> The niggling thing that bothers me about it, though, is that it
> requires the client of parallel.c to follow a slightly complicated
> protocol or risk a rare obscure failure mode, and recognise the cases
> where that's necessary.  Specifically, if you're not blocking in a
> shm_mq wait loop, then you must make a call to this new interface
> before you do any other kind of latch wait, but if you get that wrong
> you'll probably not notice since fork failure is rare!  It seems like
> it'd be nicer if we could figure out a way to make it so that any
> latch/CFI loop would automatically be safe against fork failure.

It would certainly be nicer, but I don't see much risk if we add a
comment next to nworkers_launched that said: Don't trust this until
you've called (Amit's proposed) WaitForParallelWorkersToAttach()
function, unless you're using the tuple queue infrastructure, which
lets you not need to directly care about the distinction between a
launched worker never starting, and a launched worker successfully
completing.

While I agree with what Robert said on the other thread -- "I guess
that works, but it seems more like blind luck than good design.
Parallel CREATE INDEX fails to be as "lucky" as Gather" -- that
doesn't mean that that situation cannot be formalized. And even if it
isn't formalized, then I think that that will probably be because
Gather ends up doing almost the same thing.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Thomas Munro

On Thu, Jan 25, 2018 at 8:54 AM, Peter Geoghegan  wrote:
> I have used Thomas' chaos-monkey-fork-process.patch to verify:
>
> 1. The problem of fork failure causing nbtsort.c to wait forever is a
> real problem. Sure enough, the coding pattern within
> _bt_leader_heapscan() can cause us to wait forever even with commit
> 2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
> consequence of the patch not using tuple queues (it uses the new
> tuplesort sharing thing instead).

Just curious: does the attached also help?

> 2. Simply adding a single call to WaitForParallelWorkersToFinish()
> within _bt_leader_heapscan() before waiting on our condition variable
> fixes the problem -- errors are reliably propagated, and we never end
> up waiting forever.

That does seem like a nice, simple solution and I am not against it.
The niggling thing that bothers me about it, though, is that it
requires the client of parallel.c to follow a slightly complicated
protocol or risk a rare obscure failure mode, and recognise the cases
where that's necessary.  Specifically, if you're not blocking in a
shm_mq wait loop, then you must make a call to this new interface
before you do any other kind of latch wait, but if you get that wrong
you'll probably not notice since fork failure is rare!  It seems like
it'd be nicer if we could figure out a way to make it so that any
latch/CFI loop would automatically be safe against fork failure.  The
attached (if it actually works, I dunno) is the worst way, but I
wonder if there is some way to traffic just a teensy bit more
information from postmaster to leader so that it could be efficient...

-- 
Thomas Munro
http://www.enterprisedb.com

pessimistic-fork-failure-detector-v2.patch
Description: Binary data

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-24 Thread Peter Geoghegan

On Tue, Jan 23, 2018 at 9:43 PM, Amit Kapila  wrote:
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

I have used Thomas' chaos-monkey-fork-process.patch to verify:

1. The problem of fork failure causing nbtsort.c to wait forever is a
real problem. Sure enough, the coding pattern within
_bt_leader_heapscan() can cause us to wait forever even with commit
2badb5afb89cd569500ef7c3b23c7a9d11718f2f, more or less as a
consequence of the patch not using tuple queues (it uses the new
tuplesort sharing thing instead).

2. Simply adding a single call to WaitForParallelWorkersToFinish()
within _bt_leader_heapscan() before waiting on our condition variable
fixes the problem -- errors are reliably propagated, and we never end
up waiting forever.

3. This short term fix works just as well with
parallel_leader_participation=off.

At this point, my preferred solution is for someone to go implement
Amit's WaitForParallelWorkersToAttach() idea [1] (Amit himself seems
like the logical person for the job). Once that's committed, I can
post a new version of the patch that uses that new infrastructure --
I'll add a call to the new function, without changing anything else.
Failing that, we could actually just use
WaitForParallelWorkersToFinish(). I still don't want to use a barrier,
mostly because it complicates  parallel_leader_participation=off,
something that Amit is in agreement with [2][3].

For now, I am waiting for feedback from Robert on next steps.

[1] 
https://postgr.es/m/CAH2-Wzm6dF=g9lywthgcqzrc4dzbe-8tv28yvg0xj8q6e4+...@mail.gmail.com
[2] 
https://postgr.es/m/CAA4eK1LEFd28p1kw2Fst9LzgBgfMbDEq9wPh9jWFC0ye6ce62A%40mail.gmail.com
[3] 
https://postgr.es/m/caa4ek1+a0of4m231vbgpr_0ygg_bnmrgzlib7wqde-fybsy...@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Thomas Munro

On Wed, Jan 24, 2018 at 6:43 PM, Amit Kapila  wrote:
> On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
>  wrote:
>> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila  wrote:
> I am going to repeat my previous suggest that we use a Barrier here.
> Given the discussion subsequent to my original proposal, this can be a
> lot simpler than what I suggested originally.  Each worker does
> BarrierAttach() before beginning to read tuples (exiting if the phase
> returned is non-zero) and BarrierArriveAndDetach() when it's done
> sorting.  The leader does BarrierAttach() before launching workers and
> BarrierArriveAndWait() when it's done sorting.
>>>
>>> How does leader detect if one of the workers does BarrierAttach and
>>> then fails (either exits or error out) before doing
>>> BarrierArriveAndDetach?
>>
>> If you attach and then exit cleanly, that's a programming error and
>> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>>
>
> Right, but what if the worker dies due to something proc_exit(1) or
> something like that before calling BarrierArriveAndWait.  I think this
> is part of the problem we have solved in
> WaitForParallelWorkersToFinish such that if the worker exits abruptly
> at any point due to some reason, the system should not hang.

Actually what I said before is no longer true: after commit 2badb5af,
if you exit unexpectedly then the new ParallelWorkerShutdown() exit
hook delivers PROCSIG_PARALLEL_MESSAGE (apparently after detaching
from the error queue) and the leader aborts when it tries to read the
error queue.  I just hacked Parallel Hash like this:

BarrierAttach(build_barrier);
+   if (ParallelWorkerNumber == 0)
+   {
+   pg_usleep(100);
+   proc_exit(1);
+   }

Now I see:

postgres=# select count(*) from foox r join foox s on r.a = s.a;
ERROR:  lost connection to parallel worker

Using a debugger I can see the leader raising that error with this stack:

HandleParallelMessages at parallel.c:890
ProcessInterrupts at postgres.c:3053
ConditionVariableSleep(cv=0x00010a62e4c8,
wait_event_info=134217737) at condition_variable.c:151
BarrierArriveAndWait(barrier=0x00010a62e4b0,
wait_event_info=134217737) at barrier.c:191
MultiExecParallelHash(node=0x7ffcd9050b10) at nodeHash.c:312
MultiExecHash(node=0x7ffcd9050b10) at nodeHash.c:112
MultiExecProcNode(node=0x7ffcd9050b10) at execProcnode.c:502
ExecParallelHashJoin [inlined]
ExecHashJoinImpl(pstate=0x7ffcda01baa0, parallel='\x01') at
nodeHashjoin.c:291
ExecParallelHashJoin(pstate=0x7ffcda01baa0) at nodeHashjoin.c:582

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Amit Kapila

On Wed, Jan 24, 2018 at 10:36 AM, Thomas Munro
 wrote:
> On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila  wrote:
 I am going to repeat my previous suggest that we use a Barrier here.
 Given the discussion subsequent to my original proposal, this can be a
 lot simpler than what I suggested originally.  Each worker does
 BarrierAttach() before beginning to read tuples (exiting if the phase
 returned is non-zero) and BarrierArriveAndDetach() when it's done
 sorting.  The leader does BarrierAttach() before launching workers and
 BarrierArriveAndWait() when it's done sorting.
>>
>> How does leader detect if one of the workers does BarrierAttach and
>> then fails (either exits or error out) before doing
>> BarrierArriveAndDetach?
>
> If you attach and then exit cleanly, that's a programming error and
> would cause anyone who runs BarrierArriveAndWait() to hang forever.
>

Right, but what if the worker dies due to something proc_exit(1) or
something like that before calling BarrierArriveAndWait.  I think this
is part of the problem we have solved in
WaitForParallelWorkersToFinish such that if the worker exits abruptly
at any point due to some reason, the system should not hang.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Thomas Munro

On Wed, Jan 24, 2018 at 5:59 PM, Amit Kapila  wrote:
>>> I am going to repeat my previous suggest that we use a Barrier here.
>>> Given the discussion subsequent to my original proposal, this can be a
>>> lot simpler than what I suggested originally.  Each worker does
>>> BarrierAttach() before beginning to read tuples (exiting if the phase
>>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>>> sorting.  The leader does BarrierAttach() before launching workers and
>>> BarrierArriveAndWait() when it's done sorting.
>
> How does leader detect if one of the workers does BarrierAttach and
> then fails (either exits or error out) before doing
> BarrierArriveAndDetach?

If you attach and then exit cleanly, that's a programming error and
would cause anyone who runs BarrierArriveAndWait() to hang forever.
If you attach and raise an error, the leader will receive an error
message via CFI() and will then raise an error itself and terminate
all workers during cleanup.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Amit Kapila

On Wed, Jan 24, 2018 at 12:20 AM, Peter Geoghegan  wrote:
> On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas  wrote:
>> As Amit says, what remains is the case where fork() fails or the
>> worker dies before it reaches the line in ParallelWorkerMain that
>> reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
>> signaled until you call WaitForParallelWorkersToFinish().  If you wait
>> prior to that point for a number of workers equal to
>> nworkers_launched, you will wait forever in those cases.
>
> Another option might be to actually call
> WaitForParallelWorkersToFinish() in place of a condition variable or
> barrier, as Amit suggested at one point.
>

Yes, the only thing that is slightly worrying about using
WaitForParallelWorkersToFinish is that backend leader needs to wait
for workers to finish rather than just finishing sort related work.  I
think there shouldn't be much difference between when the sort is done
and the workers actually finish the remaining resource cleanup.
However, OTOH, if we are not okay with this solution and want to go
with some kind of usage of barriers to solve this problem, then we can
evaluate that as well, but I feel it is better if we can use the
method which is used in other parallelism code to solve this problem
(which is to use WaitForParallelWorkersToFinish).

>> I am going to repeat my previous suggest that we use a Barrier here.
>> Given the discussion subsequent to my original proposal, this can be a
>> lot simpler than what I suggested originally.  Each worker does
>> BarrierAttach() before beginning to read tuples (exiting if the phase
>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>> sorting.  The leader does BarrierAttach() before launching workers and
>> BarrierArriveAndWait() when it's done sorting.
>>

How does leader detect if one of the workers does BarrierAttach and
then fails (either exits or error out) before doing
BarrierArriveAndDetach?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Thomas Munro

On Fri, Jan 19, 2018 at 6:22 AM, Robert Haas  wrote:
>> (3)
>> erm, maybe it's a problem that errors occurring in workers while the
>> leader is waiting at a barrier won't unblock the leader (we don't
>> detach from barriers on abort/exit) -- I'll look into this.
>
> I think if there's an ERROR, the general parallelism machinery is
> going to arrange to kill every worker, so nothing matters in that case
> unless barrier waits ignore interrupts, which I'm pretty sure they
> don't.  (Also: if they do, I'll hit the ceiling; that would be awful.)

(After talking this through with Robert off-list).  Right, the
CHECK_FOR_INTERRUPTS() in ConditionVariableSleep() handles errors from
parallel workers.  There is no problem here.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Robert Haas

On Tue, Jan 23, 2018 at 2:11 PM, Peter Geoghegan  wrote:
> Finally, it's still not clear to me why nodeGather.c's use of
> parallel_leader_participation=off doesn't suffer from similar problems
> [1].

Thomas and I just concluded that it does.  See my email on the other
thread just now.

I thought that I had the failure cases all nailed down here now, but I
guess not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Peter Geoghegan

On Tue, Jan 23, 2018 at 10:50 AM, Peter Geoghegan  wrote:
> On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas  wrote:
>> I am going to repeat my previous suggest that we use a Barrier here.
>> Given the discussion subsequent to my original proposal, this can be a
>> lot simpler than what I suggested originally.  Each worker does
>> BarrierAttach() before beginning to read tuples (exiting if the phase
>> returned is non-zero) and BarrierArriveAndDetach() when it's done
>> sorting.  The leader does BarrierAttach() before launching workers and
>> BarrierArriveAndWait() when it's done sorting.  If we don't do this,
>> we're going to have to invent some other mechanism to count the
>> participants that actually initialize successfully, but that seems
>> like it's just duplicating code.
>
> I think that this closes the door to leader non-participation as
> anything other than a developer-only debug option, which might be
> fine. If parallel_leader_participation=off (or some way of getting the
> same behavior through a #define) is to be retained, then an artificial
> wait is required as a substitute for the leader's participation as a
> worker.

This idea of an artificial wait seems pretty grotty to me. If we made
it one second, would that be okay with Valgrind builds? And when it
wasn't sufficient, wouldn't we be back to waiting forever?

Finally, it's still not clear to me why nodeGather.c's use of
parallel_leader_participation=off doesn't suffer from similar problems
[1].

[1] 
https://postgr.es/m/CAH2-Wz=cAMX5btE1s=aTz7CLwzpEPm_NsUhAMAo5t5=1i9v...@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Peter Geoghegan

On Tue, Jan 23, 2018 at 10:36 AM, Robert Haas  wrote:
> As Amit says, what remains is the case where fork() fails or the
> worker dies before it reaches the line in ParallelWorkerMain that
> reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
> signaled until you call WaitForParallelWorkersToFinish().  If you wait
> prior to that point for a number of workers equal to
> nworkers_launched, you will wait forever in those cases.

Another option might be to actually call
WaitForParallelWorkersToFinish() in place of a condition variable or
barrier, as Amit suggested at one point.

> I am going to repeat my previous suggest that we use a Barrier here.
> Given the discussion subsequent to my original proposal, this can be a
> lot simpler than what I suggested originally.  Each worker does
> BarrierAttach() before beginning to read tuples (exiting if the phase
> returned is non-zero) and BarrierArriveAndDetach() when it's done
> sorting.  The leader does BarrierAttach() before launching workers and
> BarrierArriveAndWait() when it's done sorting.  If we don't do this,
> we're going to have to invent some other mechanism to count the
> participants that actually initialize successfully, but that seems
> like it's just duplicating code.

I think that this closes the door to leader non-participation as
anything other than a developer-only debug option, which might be
fine. If parallel_leader_participation=off (or some way of getting the
same behavior through a #define) is to be retained, then an artificial
wait is required as a substitute for the leader's participation as a
worker.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-23 Thread Robert Haas

On Mon, Jan 22, 2018 at 10:13 PM, Peter Geoghegan  wrote:
> _bt_leader_heapscan() can detect when workers exit early, at least in
> the vast majority of cases. It can do this simply by processing
> interrupts and automatically propagating any error -- nothing special
> about that. It can also detect when workers have finished
> successfully, because of course, that's the main reason for its
> existence. What remains, exactly?

As Amit says, what remains is the case where fork() fails or the
worker dies before it reaches the line in ParallelWorkerMain that
reads shm_mq_set_sender(mq, MyProc).  In those cases, no error will be
signaled until you call WaitForParallelWorkersToFinish().  If you wait
prior to that point for a number of workers equal to
nworkers_launched, you will wait forever in those cases.

I am going to repeat my previous suggest that we use a Barrier here.
Given the discussion subsequent to my original proposal, this can be a
lot simpler than what I suggested originally.  Each worker does
BarrierAttach() before beginning to read tuples (exiting if the phase
returned is non-zero) and BarrierArriveAndDetach() when it's done
sorting.  The leader does BarrierAttach() before launching workers and
BarrierArriveAndWait() when it's done sorting.  If we don't do this,
we're going to have to invent some other mechanism to count the
participants that actually initialize successfully, but that seems
like it's just duplicating code.

This proposal has some minor advantages even when no fork() failure or
similar occurs.  If, for example, one or more workers take a long time
to start, the leader doesn't have to wait for them before writing out
the index.  As soon as all the workers that attached to the Barrier
have arrived at the end of phase 0, the leader can build a new tape
set from all of the tapes that exist at that time.  It does not need
to wait for the remaining workers to start up and create empty tapes.
This is only a minor advantage since we probably shouldn't be doing
CREATE INDEX in parallel in the first place if the index build is so
short that this scenario is likely to occur, but we get it basically
for free, so why not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-22 Thread Amit Kapila

On Tue, Jan 23, 2018 at 8:43 AM, Peter Geoghegan  wrote:
> On Mon, Jan 22, 2018 at 6:45 PM, Amit Kapila  wrote:
>>> FWIW, I don't think that that's really much of a difference.
>>>
>>> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
>>> similar to how _bt_end_parallel() calls
>>> WaitForParallelWorkersToFinish() in the patch. The
>>> _bt_leader_heapscan() condition variable wait for workers that you
>>> refer to is quite a bit like how gather_readnext() behaves. It
>>> generally checks to make sure that all tuple queues are done.
>>> gather_readnext() can wait for developments using WaitLatch(), to make
>>> sure every tuple queue is visited, with all output reliably consumed.
>>>
>>
>> The difference lies in the fact that in gather_readnext, we use tuple
>> queue mechanism which has the capability to detect that the workers
>> are stopped/exited whereas _bt_leader_heapscan doesn't have any such
>> capability, so I think it will loop forever.
>
> _bt_leader_heapscan() can detect when workers exit early, at least in
> the vast majority of cases. It can do this simply by processing
> interrupts and automatically propagating any error -- nothing special
> about that. It can also detect when workers have finished
> successfully, because of course, that's the main reason for its
> existence. What remains, exactly?
>

Will it able to detect fork failure or if worker exits before
attaching to error queue?  I think you can once try it by forcing fork
failure in do_start_bgworker and see the behavior of
_bt_leader_heapscan.  I could have tried and let you know the results,
but the latest patch doesn't seem to apply cleanly.

> I don't know that much about tuple queues, but from a quick read I
> guess you might be talking about shm_mq_receive() +
> shm_mq_wait_internal(). It's not obvious that that will work in all
> cases ("Note that if handle == NULL, and the process fails to attach,
> we'll potentially get stuck here forever"). Also, I don't see how this
> addresses the parallel_leader_participation issue I raised.
>

I am talking about shm_mq_receive->shm_mq_counterparty_gone.  In
shm_mq_counterparty_gone, it can detect if the worker is gone by using
GetBackgroundWorkerPid.





-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-22 Thread Amit Kapila

On Tue, Jan 23, 2018 at 1:45 AM, Peter Geoghegan  wrote:
> On Mon, Jan 22, 2018 at 3:52 AM, Amit Kapila  wrote:
>> The difference is that nodeGather.c doesn't have any logic like the
>> one you have in _bt_leader_heapscan where the patch waits for each
>> worker to increment nparticipantsdone.  For Gather node, we do such a
>> thing (wait for all workers to finish) by calling
>> WaitForParallelWorkersToFinish which will have the capability after
>> Robert's patch to detect if any worker is exited abnormally (fork
>> failure or failed before attaching to the error queue).
>
> FWIW, I don't think that that's really much of a difference.
>
> ExecParallelFinish() calls WaitForParallelWorkersToFinish(), which is
> similar to how _bt_end_parallel() calls
> WaitForParallelWorkersToFinish() in the patch. The
> _bt_leader_heapscan() condition variable wait for workers that you
> refer to is quite a bit like how gather_readnext() behaves. It
> generally checks to make sure that all tuple queues are done.
> gather_readnext() can wait for developments using WaitLatch(), to make
> sure every tuple queue is visited, with all output reliably consumed.
>

The difference lies in the fact that in gather_readnext, we use tuple
queue mechanism which has the capability to detect that the workers
are stopped/exited whereas _bt_leader_heapscan doesn't have any such
capability, so I think it will loop forever.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-22 Thread Amit Kapila

On Mon, Jan 22, 2018 at 10:36 AM, Peter Geoghegan  wrote:
> On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila  wrote:
>>> Why is this okay for Gather nodes, though? nodeGather.c looks at
>>> pcxt->nworkers_launched during initialization, and appears to at least
>>> trust it to indicate that more than zero actually-launched workers
>>> will also show up when "nworkers_launched > 0". This trust seems critical
>>> when parallel_leader_participation is off, because "node->nreaders ==
>>> 0" overrides the parallel_leader_participation GUC's setting (note
>>> that node->nreaders comes directly from pcxt->nworkers_launched). If
>>> zero workers show up, and parallel_leader_participation is off, but
>>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
>>> never make forward progress?
>>
>> Ideally, that situation should be detected and we should throw an
>> error, but that doesn't happen today.  However, it will be handled
>> with Robert's patch on the other thread for CF entry [1].
>
> I knew that, but I was confused by your sketch of the
> WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion
> that the problem was unique to nbtsort.c, or was at least something
> that nbtsort.c had to take a special interest in. It now appears more
> like a general problem with a general solution, and likely one that
> won't need *any* changes to code in places like nodeGather.c (or
> nbtsort.c, in the case of my patch).
>
> I guess that you meant that parallel CREATE INDEX is the first thing
> to care about the *precise* number of nworkers_launched -- that is
> kind of a new thing. That doesn't seem like it makes any practical
> difference to us, though. I don't see why nbtsort.c should take a
> special interest in this problem, for example by calling
> WaitForParallelWorkerToAttach() itself. I may have missed something,
> but right now ISTM that it would be risky to make the API anything
> other than what both nodeGather.c and nbtsort.c already expect (that
> they'll either have nworkers_launched workers show up, or be able to
> propagate an error).
>

The difference is that nodeGather.c doesn't have any logic like the
one you have in _bt_leader_heapscan where the patch waits for each
worker to increment nparticipantsdone.  For Gather node, we do such a
thing (wait for all workers to finish) by calling
WaitForParallelWorkersToFinish which will have the capability after
Robert's patch to detect if any worker is exited abnormally (fork
failure or failed before attaching to the error queue).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-21 Thread Peter Geoghegan

On Sun, Jan 21, 2018 at 6:34 PM, Amit Kapila  wrote:
>> Why is this okay for Gather nodes, though? nodeGather.c looks at
>> pcxt->nworkers_launched during initialization, and appears to at least
>> trust it to indicate that more than zero actually-launched workers
>> will also show up when "nworkers_launched > 0". This trust seems critical
>> when parallel_leader_participation is off, because "node->nreaders ==
>> 0" overrides the parallel_leader_participation GUC's setting (note
>> that node->nreaders comes directly from pcxt->nworkers_launched). If
>> zero workers show up, and parallel_leader_participation is off, but
>> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
>> never make forward progress?
>
> Ideally, that situation should be detected and we should throw an
> error, but that doesn't happen today.  However, it will be handled
> with Robert's patch on the other thread for CF entry [1].

I knew that, but I was confused by your sketch of the
WaitForParallelWorkerToAttach() API [1]. Specifically, your suggestion
that the problem was unique to nbtsort.c, or was at least something
that nbtsort.c had to take a special interest in. It now appears more
like a general problem with a general solution, and likely one that
won't need *any* changes to code in places like nodeGather.c (or
nbtsort.c, in the case of my patch).

I guess that you meant that parallel CREATE INDEX is the first thing
to care about the *precise* number of nworkers_launched -- that is
kind of a new thing. That doesn't seem like it makes any practical
difference to us, though. I don't see why nbtsort.c should take a
special interest in this problem, for example by calling
WaitForParallelWorkerToAttach() itself. I may have missed something,
but right now ISTM that it would be risky to make the API anything
other than what both nodeGather.c and nbtsort.c already expect (that
they'll either have nworkers_launched workers show up, or be able to
propagate an error).

[1] 
https://postgr.es/m/caa4ek1kzvxtcff8inhceviupxp4ywcs3rzuwjfqmttf75x2...@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-21 Thread Amit Kapila

On Mon, Jan 22, 2018 at 12:50 AM, Peter Geoghegan  wrote:
> On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila  wrote:
>> It would have been better if there were some comments besides that
>> field, but I think it has been covered at another place in the code.
>> See comments in LaunchParallelWorkers().
>>
>> /*
>> * Start workers.
>> *
>> * The caller must be able to tolerate ending up with fewer workers than
>> * expected, so there is no need to throw an error here if registration
>> * fails.  It wouldn't help much anyway, because registering the worker in
>> * no way guarantees that it will start up and initialize successfully.
>> */
>
> Why is this okay for Gather nodes, though? nodeGather.c looks at
> pcxt->nworkers_launched during initialization, and appears to at least
> trust it to indicate that more than zero actually-launched workers
> will also show up when "nworkers_launched > 0". This trust seems critical
> when parallel_leader_participation is off, because "node->nreaders ==
> 0" overrides the parallel_leader_participation GUC's setting (note
> that node->nreaders comes directly from pcxt->nworkers_launched). If
> zero workers show up, and parallel_leader_participation is off, but
> pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
> never make forward progress?
>

Ideally, that situation should be detected and we should throw an
error, but that doesn't happen today.  However, it will be handled
with Robert's patch on the other thread for CF entry [1].


[1] - https://commitfest.postgresql.org/16/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-21 Thread Peter Geoghegan

On Sat, Jan 20, 2018 at 8:38 PM, Amit Kapila  wrote:
> It would have been better if there were some comments besides that
> field, but I think it has been covered at another place in the code.
> See comments in LaunchParallelWorkers().
>
> /*
> * Start workers.
> *
> * The caller must be able to tolerate ending up with fewer workers than
> * expected, so there is no need to throw an error here if registration
> * fails.  It wouldn't help much anyway, because registering the worker in
> * no way guarantees that it will start up and initialize successfully.
> */

Why is this okay for Gather nodes, though? nodeGather.c looks at
pcxt->nworkers_launched during initialization, and appears to at least
trust it to indicate that more than zero actually-launched workers
will also show up when "nworkers_launched > 0". This trust seems critical
when parallel_leader_participation is off, because "node->nreaders ==
0" overrides the parallel_leader_participation GUC's setting (note
that node->nreaders comes directly from pcxt->nworkers_launched). If
zero workers show up, and parallel_leader_participation is off, but
pcxt->nworkers_launched/node->nreaders is non-zero, won't the Gather
never make forward progress?

Parallel CREATE INDEX does go a bit further. It assumes that
nworkers_launched *exactly* indicates the number of workers that
successfully underwent parallel initialization, and therefore can be
expected to show up.

Is there actually a meaningful difference between the way
nworkers_launched is depended upon in each case, though?

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-20 Thread Amit Kapila

On Sun, Jan 21, 2018 at 1:39 AM, Peter Geoghegan  wrote:
> On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila  wrote:
>
> Actually, though it doesn't really look like it from the way things
> are structured within nbtsort.c, I don't need to wait for workers to
> start up (call the WaitForParallelWorkerToAttach() function you
> sketched) before doing any real work within the leader. The leader can
> participate as a worker, and only do this check afterwards. That will
> work because the leader Tuplesortstate has yet to do any real work.
> Nothing stops me from adding a new function to tuplesort, for the
> leader, that lets the leader say: "New plan -- you should now expect
> this many participants" (leader takes this reliable number from
> eventual call to WaitForParallelWorkerToAttach()).
>
> I admit that I had no idea that there is this issue with
> nworkers_launched until very recently. But then, that field has
> absolutely no comments.
>

It would have been better if there were some comments besides that
field, but I think it has been covered at another place in the code.
See comments in LaunchParallelWorkers().

/*
* Start workers.
*
* The caller must be able to tolerate ending up with fewer workers than
* expected, so there is no need to throw an error here if registration
* fails.  It wouldn't help much anyway, because registering the worker in
* no way guarantees that it will start up and initialize successfully.
*/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-20 Thread Peter Geoghegan

On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila  wrote:
> I think I can see why this patch needs that.  Is it mainly for the
> work you are doing in _bt_leader_heapscan where you are waiting for
> all the workers to be finished?

Yes, though it's also needed for the leader tuplesort. It needs to be
able to discover worker runs by looking for temp files named 0 through
to $NWORKERS - 1.

The problem with seeing who shows up after a period of time, and
having the leader arbitrarily determine that to be the total number of
participants (while blocking further participants from joining) is
that I don't know *how long to wait*. This would probably work okay
for parallel CREATE INDEX, when the leader participates as a worker,
because you can check only when the leader is finished acting as a
worker. It stands to reason that that's enough time for worker
processes to at least show up, and be seen to show up. We can use the
duration of the leader's participation as a worker as a natural way to
decide how long to wait.

But what when the leader doesn't participate as a worker, for whatever
reason? Other uses for parallel tuplesort might typically have much
less leader participation as compared to parallel CREATE INDEX. In
short, ISTM that seeing who shows up is a bad strategy for parallel
tuplesort.

> I think till now we don't have any such requirement, but if it is must
> for this patch, then I don't think it is tough to do that.  We need to
> write an API WaitForParallelWorkerToAttach() and then call for each
> launched worker or maybe WaitForParallelWorkersToAttach() which can
> wait for all workers to attach and report how many have successfully
> attached.   It will have functionality of
> WaitForBackgroundWorkerStartup and additionally it needs to check if
> the worker is attached to the error queue.  We already have similar
> API (WaitForReplicationWorkerAttach) for logical replication workers
> as well.  Note that it might have a slight impact on the performance
> because with this you need to wait for the workers to startup before
> doing any actual work, but I don't think it should be noticeable for
> large operations especially for operations like parallel create index.

Actually, though it doesn't really look like it from the way things
are structured within nbtsort.c, I don't need to wait for workers to
start up (call the WaitForParallelWorkerToAttach() function you
sketched) before doing any real work within the leader. The leader can
participate as a worker, and only do this check afterwards. That will
work because the leader Tuplesortstate has yet to do any real work.
Nothing stops me from adding a new function to tuplesort, for the
leader, that lets the leader say: "New plan -- you should now expect
this many participants" (leader takes this reliable number from
eventual call to WaitForParallelWorkerToAttach()).

I admit that I had no idea that there is this issue with
nworkers_launched until very recently. But then, that field has
absolutely no comments.

--
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-19 Thread Amit Kapila

On Sat, Jan 20, 2018 at 7:03 AM, Peter Geoghegan  wrote:
> On Thu, Jan 18, 2018 at 5:53 PM, Peter Geoghegan  wrote:
>
> Attached patch details:
>
> * The patch synchronizes processes used the approach just described.
> Note that this allowed me to remove several #include statements within
> tuplesort.c.
>
> * The patch uses only a single condition variable for a single wait
> within nbtsort.c, for the leader. No barriers are used at all (and, as
> I said, tuplesort.c doesn't use condition variables anymore). Since
> things are now very simple, I can't imagine anyone still arguing for
> the use of barriers.
>
> Note that I'm still relying on nworkers_launched as the single source
> of truth on the number of participants that can be expected to
> eventually show up (even if they end up doing zero real work). This
> should be fine, because I think that it will end up being formally
> guaranteed to be reliable by the work currently underway from Robert
> and Amit. But even if I'm wrong about things going that way, and it
> turns out that the leader needs to decide how many putative launched
> workers don't "get lost" due to fork() failure (the approach which
> Amit apparently advocates), then there *still* isn't much that needs
> to change.
>
> Ultimately, the leader needs to have the exact number of workers that
> participated, because that's fundamental to the tuplesort approach to
> parallel sort.
>

I think I can see why this patch needs that.  Is it mainly for the
work you are doing in _bt_leader_heapscan where you are waiting for
all the workers to be finished?

 If necessary, the leader can just figure it out in
> whatever way it likes at one central point within nbtsort.c, before
> the leader calls its main spool's tuplesort_begin_index_btree() --
> that can happen fairly late in the process. Actually doing that (and
> not just using nworkers_launched) seems questionable to me, because it
> would be almost the first thing that the leader would do after
> starting parallel mode -- why not just have the parallel
> infrastructure do it for us, and for everyone else?
>

I think till now we don't have any such requirement, but if it is must
for this patch, then I don't think it is tough to do that.  We need to
write an API WaitForParallelWorkerToAttach() and then call for each
launched worker or maybe WaitForParallelWorkersToAttach() which can
wait for all workers to attach and report how many have successfully
attached.   It will have functionality of
WaitForBackgroundWorkerStartup and additionally it needs to check if
the worker is attached to the error queue.  We already have similar
API (WaitForReplicationWorkerAttach) for logical replication workers
as well.  Note that it might have a slight impact on the performance
because with this you need to wait for the workers to startup before
doing any actual work, but I don't think it should be noticeable for
large operations especially for operations like parallel create index.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-19 Thread Peter Geoghegan

On Fri, Jan 19, 2018 at 8:44 PM, Amit Kapila  wrote:
> Right, but I think using parallel_leader_participation, you can do it
> reliably and probably write some regression tests which can complete
> in a predictable time.

Do what reliably? Guarantee that the leader will not participate as a
worker, but that workers will be used? If so, yes, you can get that.

The only issue is that you may not be able to launch parallel workers
due to hitting a limit like max_parallel_workers, in which case you'll
get a serial index build despite everything. Nothing we can do about
that, though.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-19 Thread Amit Kapila

On Sat, Jan 20, 2018 at 8:33 AM, Peter Geoghegan  wrote:
> On Fri, Jan 19, 2018 at 6:52 PM, Amit Kapila  wrote:
>
>> BTW, is there any other way for "parallel create index" to force that
>> the work is done by workers?  I am insisting on having something which
>> can test the code path in workers because we have found quite a few
>> bugs using that idea.
>
> I agree that this is essential (more so than supporting
> parallel_leader_participation). You can use the parallel_workers table
> storage parameter for this. When the storage param has been set, we
> don't care about the amount of memory available to each worker. You
> can stress-test the implementation as needed. (The storage param does
> care about max_parallel_maintenance_workers, but you can set that as
> high as you like.)
>

Right, but I think using parallel_leader_participation, you can do it
reliably and probably write some regression tests which can complete
in a predictable time.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-19 Thread Amit Kapila

On Sat, Jan 20, 2018 at 2:57 AM, Thomas Munro
 wrote:
> On Sat, Jan 20, 2018 at 6:32 AM, Robert Haas  wrote:
>> On Fri, Jan 19, 2018 at 12:16 PM, Peter Geoghegan  wrote:
>>> Clarity on what I should do about parallel_leader_participation in the
>>> next revision would be useful at this point. You seem to either want
>>> me to remove it from consideration entirely, or to remove the code
>>> that specifically disallows a "degenerate parallel CREATE INDEX". I
>>> need a final answer on that.
>>
>> Right.  I do think that we should do one of those things, and I lean
>> towards removing it entirely, but I'm not entirely sure.Rather
>> than making an executive decision immediately, I'd like to wait a few
>> days to give others a chance to comment. I am hoping that we might get
>> some other opinions, especially from Thomas who implemented
>> parallel_leader_participation, or maybe Amit who has been reviewing
>> recently, or anyone else who is paying attention to this thread.
>
> Well, I see parallel_leader_participation as having these reasons to exist:
>
> 1.  Gather could in rare circumstances not run the plan in the leader.
> This can hide bugs.  It's good to be able to force that behaviour for
> testing.
>

Or reverse is also possible which means the workers won't get chance
to run the plan in which case we can use parallel_leader_participation
= off to test workers behavior.  As said before, I see only that as
the reason to keep parallel_leader_participation in this patch.  If we
decide to do that way, then I think we should remove the code that
specifically disallows a "degenerate parallel CREATE INDEX" as that
seems to be confusing.   If we go this way, then I think we should use
the wording suggested by Robert in one of its email [1] to describe
the usage of parallel_leader_participation.

BTW, is there any other way for "parallel create index" to force that
the work is done by workers?  I am insisting on having something which
can test the code path in workers because we have found quite a few
bugs using that idea.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoYN-YQU9JsGQcqFLovZ-C%2BXgp1_xhJQad%3DcunGG-_p5gg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-19 Thread Peter Geoghegan

On Fri, Jan 19, 2018 at 4:52 AM, Robert Haas  wrote:
>> Other than that, looks good to me. It's a simple patch with a clear purpose.
>
> Committed.

Cool.

Clarity on what I should do about parallel_leader_participation in the
next revision would be useful at this point. You seem to either want
me to remove it from consideration entirely, or to remove the code
that specifically disallows a "degenerate parallel CREATE INDEX". I
need a final answer on that.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-19 Thread Robert Haas

On Thu, Jan 18, 2018 at 2:05 PM, Peter Geoghegan  wrote:
> Review of your patch:
>
> * SerializedReindexState could use some comments. At least a one liner
> stating its basic purpose.

Added a comment.

> * The "System index reindexing support" comment block could do with a
> passing acknowledgement of the fact that this is serialized for
> parallel workers.

Done.

> * Maybe the "Serialize reindex state" comment within
> InitializeParallelDSM() should instead say something like "Serialize
> indexes-pending-reindex state".

That would require corresponding changes in a bunch of other places,
possibly including the function names.  I think it's better to keep
the function names shorter and the comments matching the function
names, so I did not make this change.

> Other than that, looks good to me. It's a simple patch with a clear purpose.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Peter Geoghegan

On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas  wrote:
> I just went back to the thread on "parallel.c oblivion of
> worker-startup failures" and refreshed my memory about what's going on
> over there.  What's going on over there is (1) currently,
> nworkers_launched can be over-reported in a scenario that doesn't
> error out or crash and (2) I'm proposing to tighten things up so that
> this is no longer the case.

I think that we need to be able to rely on nworkers_launched to not
over-report the number of workers launched. To be fair to Amit, I
haven't actually gone off and studied the problem myself, so it's not
fair to dismiss his point of view. It nevertheless seems to me that it
makes life an awful lot easier to be able to rely on
nworkers_launched.

> So, thinking about this, I think that my proposal to use dynamic
> barriers here seems like it will work regardless of who wins that
> argument.  Your proposal to use static barriers and decrement the
> party size based on the number of participants which fail to start
> will work if I win that argument, but will not work if Amit wins that
> argument.

Sorry, but I've changed my mind. I don't think barriers owned by
tuplesort.c will work for us (though I think we will still need a
synchronization primitive within nbtsort.c). The design that Robert
sketched for using barriers seemed fine at first. But then I realized:
what about the case where you have 2 spools?

I now understand why Thomas thought that I'd end up using static
barriers, because I now see that dynamic barriers have problems of
their own if used by tuplesort.c, even with the trick of only having
participants actually participate on the condition that they show up
before the party is over (before there is no tuples left for the
worker to consume). The idea of the leader using nworkers_launched as
the assumed-launched number of workers is pretty much baked into my
patch, because my patch makes tuplesorts composable (e.g. nbtsort.c
uses two tuplesorts when there is a unique index build/2 spools).

Do individual workers need to be prepared to back out of the main
spool's sort, but not the spool2 sort (for unique index builds), or
vice-versa? Clearly that's untenable, because they're going to need to
have both as long as they're participating in a parallel CREATE INDEX
(of a unique index) -- IndexBuildHeapScan() expects both at the same
time, but there is a race condition when launching workers with 2
spools. So does nbtsort.c need to own the barrier instead? If it does,
and if that barrier subsumes the responsibilities of tuplesort.c's
condition variables, then I don't see how that can avoid causing a
mess due to confusion about phases across tuplesorts/spools.

nbtsort.c *will* need some synchronization primitive, actually, (I'm
thinking of a condition variable), but only because of the fact that
nbtsort.c happens to want to aggregate statistics about the sort at
the end (for pg_index) -- this doesn't seem like tuplesort's problem
at all. In general, it's very natural to just call
tuplesort_leader_wait(), and have all the relevant details
encapsulated within tuplesort.c. We could make tuplesort_leader_wait()
totally optional, and just use the barrier within nbtsort.c for the
wait (more on that later).

> In particular, for parallel query, there is absolutely zero guarantee
> that every worker reaches every plan node.  For a parallel utility
> command, things seem a little better: we can assume  that workers are
> started only for one particular purpose.  But even that might not be
> true in the future.

I expect workers that are reported launched to show up eventually, or
report failure. They don't strictly have to do any work beyond just
showing up (finding no tuples, reaching tuplesort_performsort(), then
finally reaching tuplesort_end()). The spool2 issue I describe above
shows why this is. They own the state (tuplesort tuples) that they
consume, and may possibly have 2 or more tuplesorts. If they cannot do
the bare minimum of checking in with us, then we're in big trouble,
because that's indistinguishable from their having actually sorted
some tuples without our knowing that the leader ultimately gets to
consume them.

It wouldn't be impossible to use barriers for everything. That just
seems to be incompatible with tuplesorts being composable. Long ago,
nbtsort.c actually did the sorting, too. If that was still true, then
it would be rather a lot more like parallel hashjoin, I think. You
could then just have one barrier for one state machine (with one or
two spools). It seems clear that we should avoid teaching tuplesort.c
about nbtsort.c.

> But, hang on a minute.  Why do the workers need to wait for the leader
> anyway?  Can't they just exit once they're done sorting?  I think the
> original reason why this ended up in the patch is that we needed the
> leader to assume ownership of the tapes to avoid having the tapes get
> blown away when the worker exits.  But,

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Peter Geoghegan

On Thu, Jan 18, 2018 at 10:17 AM, Robert Haas  wrote:
>> Should I go ahead and restore builds on catalogs, and remove those
>> comments, on the assumption that your patch will be committed before
>> mine? Obviously parallel index builds on catalogs don't matter. OTOH,
>> why not? Perhaps it's like the debate around HOT that took place over
>> 10 years ago, where Tom insisted that HOT work with catalogs on
>> general principle.
>
> Yes, I think so.  If you (or someone else) can review that patch, I'll
> go ahead and commit it, and then your patch can treat it as a solved
> problem.  I'm not really worried about the cycles; the amount of
> effort required here is surely very small compared to all of the other
> things that have to be done when starting a parallel worker.

Review of your patch:

* SerializedReindexState could use some comments. At least a one liner
stating its basic purpose.

* The "System index reindexing support" comment block could do with a
passing acknowledgement of the fact that this is serialized for
parallel workers.

* Maybe the "Serialize reindex state" comment within
InitializeParallelDSM() should instead say something like "Serialize
indexes-pending-reindex state".

Other than that, looks good to me. It's a simple patch with a clear purpose.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Peter Geoghegan

On Thu, Jan 18, 2018 at 10:27 AM, Robert Haas  wrote:
> On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan  wrote:
>> That seems pretty far fetched.
>
> I don't think it is, and there are plenty of other examples.  All you
> need is a query plan that involves significant CPU work both below the
> Gather node and above the Gather node.  It's not difficult to find
> plans like that; there are TPC-H queries that generate plans like
> that.

You need to have a very selective qual in the worker, that eliminates
most input (keeps the worker busy), and yet manages to keep the leader
busy rather than waiting on input from the gather.

>> But even if it wasn't, my position
>> would not change. This could happen only because the planner
>> determined that it was the cheapest plan when
>> parallel_leader_participation happened to be off. But clearly a
>> "degenerate parallel CREATE INDEX" will never be faster than a serial
>> CREATE INDEX, and there is a simple way to always avoid one. So why
>> not do so?
>
> That's an excellent argument for making parallel CREATE INDEX ignore
> parallel_leader_participation entirely.

I'm done making arguments about parallel_leader_participation. Tell me
what you want, and I'll do it.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Peter Geoghegan

On Thu, Jan 18, 2018 at 6:14 AM, Amit Kapila  wrote:
> I see your point.  OTOH, I think we should have something for testing
> purpose as that helps in catching the bugs and makes it easy to write
> tests that cover worker part of the code.

This is about the question of whether or not we want to allow
parallel_leader_participation to prevent or allow a parallel CREATE
INDEX that has 1 parallel worker that does all the sorting, with the
leader simply consuming its output without doing any merging (a
"degenerate paralllel CREATE INDEX"). It is perhaps only secondarily
about the question of ripping out parallel_leader_participation
entirely.

> Can you please elaborate what part of optimizer are you talking about
> where without leader participation partial path will always lose to a
> serial sequential scan path?

See my remarks to Robert just now.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Robert Haas

On Thu, Jan 18, 2018 at 1:14 PM, Peter Geoghegan  wrote:
> That seems pretty far fetched.

I don't think it is, and there are plenty of other examples.  All you
need is a query plan that involves significant CPU work both below the
Gather node and above the Gather node.  It's not difficult to find
plans like that; there are TPC-H queries that generate plans like
that.

> But even if it wasn't, my position
> would not change. This could happen only because the planner
> determined that it was the cheapest plan when
> parallel_leader_participation happened to be off. But clearly a
> "degenerate parallel CREATE INDEX" will never be faster than a serial
> CREATE INDEX, and there is a simple way to always avoid one. So why
> not do so?

That's an excellent argument for making parallel CREATE INDEX ignore
parallel_leader_participation entirely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Robert Haas

On Wed, Jan 17, 2018 at 10:22 PM, Peter Geoghegan  wrote:
> If you think it's worth the cycles, then I have no objection. I will
> point out that this means that everything that I say about
> ReindexIsProcessingIndex() no longer applies, because the relevant
> state will now be propagated. It doesn't need to be mentioned at all,
> and I don't even need to forbid builds on catalogs.
>
> Should I go ahead and restore builds on catalogs, and remove those
> comments, on the assumption that your patch will be committed before
> mine? Obviously parallel index builds on catalogs don't matter. OTOH,
> why not? Perhaps it's like the debate around HOT that took place over
> 10 years ago, where Tom insisted that HOT work with catalogs on
> general principle.

Yes, I think so.  If you (or someone else) can review that patch, I'll
go ahead and commit it, and then your patch can treat it as a solved
problem.  I'm not really worried about the cycles; the amount of
effort required here is surely very small compared to all of the other
things that have to be done when starting a parallel worker.

I'm not as dogmatic about the idea that everything must support system
catalogs or it's not worth doing as Tom is, but I do think it's better
if it can be done that way with reasonable effort.  When each new
feature comes with a set of unsupported corner cases, it becomes hard
for users to understand what will and will not actually work.  Now,
really big features like parallel query or partitioning or logical
replication generally do need to exclude some things in v1 or you can
never finish the project, but in this case plugging the gap seems
quite feasible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Peter Geoghegan

On Thu, Jan 18, 2018 at 6:21 AM, Robert Haas  wrote:
> Amit's reply to this part drew my attention to it.  I think this is
> entirely false.  Consider an aggregate that doesn't support partial
> aggregation, and a plan that looks like this:
>
> Aggregate
> -> Gather
>   -> Parallel Seq Scan
> Filter: something fairly selective
>
> It is quite possible for this to be superior to a non-parallel plan
> even with only 1 worker and no parallel leader participation.  The
> worker can evaluate the filter qual, and the leader can evaluate the
> aggregate.  If the CPU costs of doing those computations are high
> enough to outweigh the costs of shuffling tuples between backends, we
> win.

That seems pretty far fetched. But even if it wasn't, my position
would not change. This could happen only because the planner
determined that it was the cheapest plan when
parallel_leader_participation happened to be off. But clearly a
"degenerate parallel CREATE INDEX" will never be faster than a serial
CREATE INDEX, and there is a simple way to always avoid one. So why
not do so?

I give up. I'll go ahead and make parallel_leader_participation=off
allow a degenerate parallel CREATE INDEX in the next version. I think
that it will make parallel_leader_participation less useful, with no
upside, but there doesn't seem to be much more that I can do about
that.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Robert Haas

On Thu, Jan 18, 2018 at 5:49 AM, Thomas Munro
 wrote:
> I'm mostly away from my computer this week -- sorry about that,

Yeah, seriously.  Since when it is OK for hackers to ever be away from
their computers?  :-)

> The idea I mentioned would only work if nworkers_launched is never
> over-reported in a scenario that doesn't error out or crash, and never
> under-reported in any scenario.  Otherwise static barriers may be even
> less useful than I thought.

I just went back to the thread on "parallel.c oblivion of
worker-startup failures" and refreshed my memory about what's going on
over there.  What's going on over there is (1) currently,
nworkers_launched can be over-reported in a scenario that doesn't
error out or crash and (2) I'm proposing to tighten things up so that
this is no longer the case.  Amit proposed making it the
responsibility of code that uses parallel.c to cope with
nworkers_launched being larger than the number that actually launched,
and my counter-proposal was to make it reliably ERROR when they don't
all launch.

So, thinking about this, I think that my proposal to use dynamic
barriers here seems like it will work regardless of who wins that
argument.  Your proposal to use static barriers and decrement the
party size based on the number of participants which fail to start
will work if I win that argument, but will not work if Amit wins that
argument.

It seems to me in general that dynamic barriers are to be preferred in
almost every circumstance, because static barriers require a longer
chain of assumptions.  We can't assume that the number of guests we
invite to the party will match the number that actually show up, so,
in the case of a static barrier, we have to make sure to adjust the
party size if some of the guests end up having to stay home with a
sick kid or their car breaks down or if they decide to go to the
neighbor's party instead.  Absentee guests are not intrinsically a
problem, but we have to make sure that we account for them in a
completely water-tight fashion.  On the other hand, with a dynamic
barrier, we don't need to worry about the guests that don't show up;
we only need to worry about the guests that DO show up.  As they come
in the door, we count them; as they leave, we count them again.  When
the numbers are equal, the party's over.  That seems more robust.

In particular, for parallel query, there is absolutely zero guarantee
that every worker reaches every plan node.  For a parallel utility
command, things seem a little better: we can assume  that workers are
started only for one particular purpose.  But even that might not be
true in the future.  For example, imagine a parallel CREATE INDEX on a
partitioned table that cascades to all children.  One can easily
imagine wanting to use the same workers for the whole operation and
spread them out across the pool of tasks much as Parallel Append does.
There's a good chance this will be faster than doing each index build
in turn with maximum parallelism.  And then the static barrier thing
goes right out the window again, because the number of participants is
determined dynamically.

I really struggle to think of any case where a static barrier is
better.   I mean, suppose we have an existing party and then decide to
hold a baking contest.  We'll use a barrier to separate the baking
phase from the judging phase.  One might think that, since the number
of participants is already decided, someone could initialize the
barrier with that number rather than making everyone attach.  But it
doesn't really work, because there's a race: while one process is
creating the barrier with participants = 10, the doctor's beeper goes
off and he leaves the party.  Now there could be some situation in
which we are absolutely certain that we know how many participants
we've got and it won't change, but I suspect that in almost every
scenario deciding to use a static barrier is going to be immediately
followed by a lot of angst about how we can be sure that the number of
participants will always be correct.

>> Okay. I'll work on adopting dynamic barriers in the way you described.
>> I just wanted to make sure that we're all on the same page about what
>> that looks like.
>
> Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
> and then just exit, you'll need to detach from the barrier both in the
> case where the worker exits early because the phase is too high and
> the case where you attach in in time to help and run to completion;

In the first case, I guess this is because otherwise the other
participants will wait for us even though we're not really there any
more.  In the second case, I'm not sure why it matters whether we
detach.  If we've reached the highest possible phase number, nobody's
going to wait any more, so who cares?  (I mean, apart from tidiness.)

> (2) maybe workers could use BarrierArriveAndDetach() at the end (the
> leader needs to use BarrierArriveAndWait(),

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Amit Kapila

On Thu, Jan 18, 2018 at 4:19 PM, Thomas Munro
 wrote:
> Hi,
>
> I'm mostly away from my computer this week -- sorry about that, but
> here are a couple of quick answers to questions directed at me:
>
> On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan  wrote:
>> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas  wrote:
>
>>> It's true that the leader will know the value of nworkers_launched,
>>> but as the comment in LaunchParallelWorkers() says: "The caller must
>>> be able to tolerate ending up with fewer workers than expected, so
>>> there is no need to throw an error here if registration fails.  It
>>> wouldn't help much anyway, because registering the worker in no way
>>> guarantees that it will start up and initialize successfully."  So it
>>> seems to me that a much better plan than having the leader try to
>>> figure out how many workers failed to launch would be to just keep a
>>> count of how many workers did in fact launch.
>
> (If nworkers_launched can be silently over-reported, then does
> parallel_leader_participation = off have a bug?
>

Yes, and it is being discussed in CF entry [1].

[1] - https://commitfest.postgresql.org/16/1341/

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-18 Thread Thomas Munro

Hi,

I'm mostly away from my computer this week -- sorry about that, but
here are a couple of quick answers to questions directed at me:

On Thu, Jan 18, 2018 at 4:22 PM, Peter Geoghegan  wrote:
> On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas  wrote:
>>> While it certainly did occur to me that that was kind of weird, and I
>>> struggled with it on my own for a little while, I ultimately agreed
>>> with Thomas that it added something to have ltsConcatWorkerTapes()
>>> call some buffile function in every iteration of its loop.
>>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>>> wrote, though I added the asserts and comments myself.)
>>
>> Hmm, well, if Thomas contributed code to this patch, then he needs to
>> be listed as an author.  I went searching for an email on this thread
>> (or any other) where he posted code for this, thinking that there
>> might be some discussion explaining the motivation, but I didn't find
>> any.  I'm still in favor of erasing this distinction.
>
> I cleared this with Thomas recently, on this very thread, and got a +1
> from him on not listing him as an author. Still, I have no problem
> crediting Thomas as an author instead of a reviewer, even though
> you're now asking me to remove what little code he actually authored.
> The distinction between secondary author and reviewer is often
> blurred, anyway.

The confusion comes about because I gave some small code fragments to
Rushabh for the BufFileView stuff off-list, when suggesting ideas for
how to integrate Peter's patch with some ancestor of my SharedFileSet
patch.  It was just a sketch and whether or not any traces remain in
the final commit, please credit me as a reviewer.  I need to review
more patches!  /me ducks

No objections from me if you hate the "view" idea or implementation
and think it's better to make a destructive append-BufFile-to-BufFile
operation instead.

On Thu, Jan 18, 2018 at 4:28 PM, Peter Geoghegan  wrote:
> On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas  wrote:
>> I had forgotten about the previous discussion.  The sketch in my
>> previous email supposed that we would use dynamic barriers since the
>> whole point, after all, is to handle the fact that we don't know how
>> many participants will really show up.  Thomas's idea seems to be that
>> the leader will initialize the barrier based on the anticipated number
>> of participants and then tell it to forget about the participants that
>> don't materialize.  Of course, that would require that the leader
>> somehow figure out how many participants didn't show up so that it can
>> deduct then from the counter in the barrier.  And how is it going to
>> do that?
>
> I don't know; Thomas?

The idea I mentioned would only work if nworkers_launched is never
over-reported in a scenario that doesn't error out or crash, and never
under-reported in any scenario.  Otherwise static barriers may be even
less useful than I thought.

>> It's true that the leader will know the value of nworkers_launched,
>> but as the comment in LaunchParallelWorkers() says: "The caller must
>> be able to tolerate ending up with fewer workers than expected, so
>> there is no need to throw an error here if registration fails.  It
>> wouldn't help much anyway, because registering the worker in no way
>> guarantees that it will start up and initialize successfully."  So it
>> seems to me that a much better plan than having the leader try to
>> figure out how many workers failed to launch would be to just keep a
>> count of how many workers did in fact launch.

(If nworkers_launched can be silently over-reported, then does
parallel_leader_participation = off have a bug?  If no workers really
launched and reached the main executor loop but nworkers_launched > 0,
then no one is running the plan.)

>> So my position (at least until Thomas or Andres shows up and tells me
>> why I'm wrong) is that you can use the Barrier API just as it is
>> without any yak-shaving, just by following the sketch I set out
>> before.  The additional API I proposed in that sketch isn't really
>> required, although it might be more efficient.  But it doesn't really
>> matter: if that comes along later, it will be trivial to adjust the
>> code to take advantage of it.

Yeah, the dynamic Barrier API was intended for things like this.  I
was only trying to provide a simpler-to-use alternative that I thought
might work for this particular case (but not executor nodes, which
have another source of uncertainty about party size).  It sounds like
it's not actually workable though, and the dynamic API may be the only
way.  So the patch would have to deal with explicit phases.

> Okay. I'll work on adopting dynamic barriers in the way you described.
> I just wanted to make sure that we're all on the same page about what
> that looks like.

Looking at Robert's sketch, a few thoughts: (1) it's not OK to attach
and then just exit,

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2018 at 6:20 PM, Robert Haas  wrote:
> I had forgotten about the previous discussion.  The sketch in my
> previous email supposed that we would use dynamic barriers since the
> whole point, after all, is to handle the fact that we don't know how
> many participants will really show up.  Thomas's idea seems to be that
> the leader will initialize the barrier based on the anticipated number
> of participants and then tell it to forget about the participants that
> don't materialize.  Of course, that would require that the leader
> somehow figure out how many participants didn't show up so that it can
> deduct then from the counter in the barrier.  And how is it going to
> do that?

I don't know; Thomas?

> It's true that the leader will know the value of nworkers_launched,
> but as the comment in LaunchParallelWorkers() says: "The caller must
> be able to tolerate ending up with fewer workers than expected, so
> there is no need to throw an error here if registration fails.  It
> wouldn't help much anyway, because registering the worker in no way
> guarantees that it will start up and initialize successfully."  So it
> seems to me that a much better plan than having the leader try to
> figure out how many workers failed to launch would be to just keep a
> count of how many workers did in fact launch.

> So my position (at least until Thomas or Andres shows up and tells me
> why I'm wrong) is that you can use the Barrier API just as it is
> without any yak-shaving, just by following the sketch I set out
> before.  The additional API I proposed in that sketch isn't really
> required, although it might be more efficient.  But it doesn't really
> matter: if that comes along later, it will be trivial to adjust the
> code to take advantage of it.

Okay. I'll work on adopting dynamic barriers in the way you described.
I just wanted to make sure that we're all on the same page about what
that looks like.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2018 at 10:40 AM, Robert Haas  wrote:
>> While it certainly did occur to me that that was kind of weird, and I
>> struggled with it on my own for a little while, I ultimately agreed
>> with Thomas that it added something to have ltsConcatWorkerTapes()
>> call some buffile function in every iteration of its loop.
>> (BufFileView() + BufFileViewAppend() are code that Thomas actually
>> wrote, though I added the asserts and comments myself.)
>
> Hmm, well, if Thomas contributed code to this patch, then he needs to
> be listed as an author.  I went searching for an email on this thread
> (or any other) where he posted code for this, thinking that there
> might be some discussion explaining the motivation, but I didn't find
> any.  I'm still in favor of erasing this distinction.

I cleared this with Thomas recently, on this very thread, and got a +1
from him on not listing him as an author. Still, I have no problem
crediting Thomas as an author instead of a reviewer, even though
you're now asking me to remove what little code he actually authored.
The distinction between secondary author and reviewer is often
blurred, anyway.

Whether or not Thomas is formally a co-author is ambiguous, and not
something that I feel strongly about (there is no ambiguity about the
fact that he made a very useful contribution, though -- he certainly
did, both directly and indirectly). I already went out of my way to
ensure that Heikki receives a credit for parallel CREATE INDEX in the
v11 release notes, even though I don't think that there is any formal
rule requiring me to do so -- he *didn't* write even one line of code
in this patch. (That was just my take on another ambiguous question
about authorship.)

I suggest that we revisit this when you're just about to commit the
patch. Or you can just add his name -- I like to err on the side of
being inclusive.

>> If you think about this in terms of the interface rather than the
>> implementation, then it may make more sense. The encapsulation adds
>> something which might pay off later, such as when extendBufFile()
>> needs to work with a concatenated set of BufFiles. And even right now,
>> I cannot simply reuse the BufFile without then losing the assert that
>> is currently in BufFileViewAppend() (must not have associated shared
>> fileset assert). So I'd end up asserting less (rather than more) there
>> if BufFileView() was removed.
>
> I would see the encapsulation as having some value if the original
> BufFile remained valid and the new view were also valid.  Then the
> BufFileView operation is a bit like a copy-on-write filesystem
> snapshot: you have the original, which you can do stuff with, and you
> have a copy, which can be manipulated independently, but the copying
> is cheap.  But here the BufFile gobbles up the original so I don't see
> the point.

I'll see what I can do about this.

>> I think I still get the gist of what you're saying, though. I've come
>> up with a new structure that is a noticeable improvement on what I
>> had. Importantly, the new structure let me add a number of
>> parallelism-agnostic asserts that make sure that every ambuild routine
>> that supports parallelism gets the details right.
>
> Yes, that looks better.  I'm slightly dubious that the new Asserts()
> are worthwhile, but I guess it's OK.

Bear in mind that the asserts basically amount to a check that the am
propagated indexInfo->ii_Concurrent correct within workers. It's nice
to be able to do this in a way that applies equally well to the serial
case.

> But I think it would be better
> to ditch the if-statement and do it like this:
>
> Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
> Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
> : !TransactionIdIsValid(OldestXmin));
> Assert(snapshot == SnapshotAny || !anyvisible);
>
> Also, I think you've got a little more than you need in terms of
> comments.  I would keep the comments for the serial case and parallel
> case and drop the earlier one that basically says the same thing:

Okay.

>> (ReindexIsProcessingIndex() issue with non-catalog tables)
>
> I agree that it isn't particularly likely, but if somebody found it
> worthwhile to insert guards against those cases, maybe we should
> preserve them instead of abandoning them.  It shouldn't be that hard
> to propagate those values from the leader to the workers.  The main
> difficulty there seems to be that we're creating the parallel context
> in nbtsort.c, while the state that would need to be propagated is
> private to index.c, but there are several ways to solve that problem.
> It looks to me like the most robust approach would be to just make
> that part of what parallel.c naturally does.  Patch for that attached.

If you think it's worth the cycles, then I have no objection. I will
point out that this means that everything that I say about
ReindexIsProcessingIndex() no longer applies, because the relevant
state

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2018 at 10:27 AM, Robert Haas  wrote:
> On Wed, Jan 17, 2018 at 12:27 PM, Peter Geoghegan  wrote:
>> I think that both problems (the live _bt_parallel_scan_and_sort() bug,
>> as well as the general issue with needing to account for parallel
>> worker fork() failure) are likely solvable by not using
>> tuplesort_leader_wait(), and instead calling
>> WaitForParallelWorkersToFinish(). Which you suggested already.
>
> I'm wondering if this shouldn't instead be handled by using the new
> Barrier facilities.

> While I find the Barrier API slightly confusing -- and I suspect I'm
> not entirely alone -- I don't think that's a good excuse for
> reinventing the wheel.  The problem of needing to wait for every
> process that does A (in this case, read tuples from the scan) to also
> do B (in this case, finish sorting those tuples) is a very general one
> that is deserving of a general solution.  Unless somebody comes up
> with a better plan, Barrier seems to be the way to do that in
> PostgreSQL.
>
> I don't think using WaitForParallelWorkersToFinish() is a good idea.
> That would require workers to hold onto their tuplesorts until after
> losing the ability to send messages to the leader, which doesn't sound
> like a very good plan.  We don't want workers to detach from their
> error queues until the bitter end, lest errors go unreported.

What you say here sounds convincing to me. I actually brought up the
idea of using the barrier abstraction a little over a month ago. I was
discouraged by a complicated sounding issue raised by Thomas [1]. At
the time, I figured that the barrier abstraction was a nice to have,
but not really essential. That idea doesn't hold up under scrutiny. I
need to be able to use barriers.

There seems to be some yak shaving involved in getting the barrier
abstraction to do exactly what is required, as Thomas went into at the
time. How should that prerequisite work be structured? For example,
should a patch be spun off for that part?

I may not be the most qualified person for this job, since Thomas
considered two alternative approaches (to making the static barrier
abstraction forget about never-launched participants) without ever
settling on one of them.

[1] 
https://postgr.es/m/CAEepm=03YnefpCeB=Z67HtQAOEMuhKGyPCY_S1TeH=9a2rr...@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-17 Thread Robert Haas

On Mon, Jan 15, 2018 at 7:54 PM, Peter Geoghegan  wrote:
>> BufFileView() looks fairly pointless.  It basically creates a copy of
>> the input and, in so doing, destroys the input, which is a lot like
>> returning the input parameter except that it uses more cycles.  It
>> does do a few things.
>
> While it certainly did occur to me that that was kind of weird, and I
> struggled with it on my own for a little while, I ultimately agreed
> with Thomas that it added something to have ltsConcatWorkerTapes()
> call some buffile function in every iteration of its loop.
> (BufFileView() + BufFileViewAppend() are code that Thomas actually
> wrote, though I added the asserts and comments myself.)

Hmm, well, if Thomas contributed code to this patch, then he needs to
be listed as an author.  I went searching for an email on this thread
(or any other) where he posted code for this, thinking that there
might be some discussion explaining the motivation, but I didn't find
any.  I'm still in favor of erasing this distinction.

> If you think about this in terms of the interface rather than the
> implementation, then it may make more sense. The encapsulation adds
> something which might pay off later, such as when extendBufFile()
> needs to work with a concatenated set of BufFiles. And even right now,
> I cannot simply reuse the BufFile without then losing the assert that
> is currently in BufFileViewAppend() (must not have associated shared
> fileset assert). So I'd end up asserting less (rather than more) there
> if BufFileView() was removed.

I would see the encapsulation as having some value if the original
BufFile remained valid and the new view were also valid.  Then the
BufFileView operation is a bit like a copy-on-write filesystem
snapshot: you have the original, which you can do stuff with, and you
have a copy, which can be manipulated independently, but the copying
is cheap.  But here the BufFile gobbles up the original so I don't see
the point.

The  Assert(target->fileset == NULL) that would be lost in
BufFileViewAppend has no value anyway, AFAICS.  There is also
Assert(source->readOnly) given which the presence or absence of the
fileset makes no difference.  And if, as you say, extendBufFile were
eventually made to work here, this Assert would presumably get removed
anyway; I think we'd likely want the additional files to get
associated with the shared file set rather than being locally
temporary files.

> It wastes some cycles to not simply use the BufFile directly, but not
> terribly many in the grand scheme of things. This happens once per
> external sort operation.

I'm not at all concerned about the loss of cycles.  I'm concerned
about making the mechanism more complicated to understand and maintain
for future readers of the code.  When experienced hackers see code
that doesn't seem to accomplish anything, they (or at least I) tend to
assume that there must be a hidden reason for it to be there and spend
time trying to figure out what it is.  If there actually is no hidden
purpose, then that study is a waste of time and we can spare them the
trouble by getting rid of it now.

>> In miscadmin.h, I'd put the prototype for the new GUC next to
>> max_worker_processes, not maintenance_work_mem.
>
> But then I'd really have to put it next to max_worker_processes in
> globals.c, too. That would mean that it would go under "Primary
> determinants of sizes of shared-memory structures" within globals.c,
> which seems wrong to me. What do you think?

OK, that's a fair point.

> I think I still get the gist of what you're saying, though. I've come
> up with a new structure that is a noticeable improvement on what I
> had. Importantly, the new structure let me add a number of
> parallelism-agnostic asserts that make sure that every ambuild routine
> that supports parallelism gets the details right.

Yes, that looks better.  I'm slightly dubious that the new Asserts()
are worthwhile, but I guess it's OK.  But I think it would be better
to ditch the if-statement and do it like this:

Assert(snapshot == SnapshotAny || IsMVCCSnapshot(snapshot));
Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin)
: !TransactionIdIsValid(OldestXmin));
Assert(snapshot == SnapshotAny || !anyvisible);

Also, I think you've got a little more than you need in terms of
comments.  I would keep the comments for the serial case and parallel
case and drop the earlier one that basically says the same thing:

+ * (Note that parallel case never has us register/unregister snapshot, and
+ * provides appropriate snapshot for us.)

> There is a little bit of ambiguity about other cases, though -- that's
> the secondary point I tried to make within that comment block, and the
> part that you took issue with. To put this secondary point another
> way: It's possible that we'd fail to detect it if someone's comparator
> went bananas and decided it was okay to do SQL access (that resulted
> in an index scan of the

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2018 at 5:47 AM, Amit Kapila  wrote:
> I could still reproduce it. I think the way you have fixed it has a
> race condition.  In _bt_parallel_scan_and_sort(), the value of
> brokenhotchain is set after you signal the leader that the worker is
> done (by incrementing workersFinished). Now, the leader is free to
> decide based on the current shared state which can give the wrong
> value.  Similarly, I think the value of havedead and reltuples can
> also be wrong.

> You neither seem to have fixed nor responded to the second problem
> mentioned in my email upthread [1].  To reiterate, the problem is that
> we can't assume that the workers we have launched will always start
> and finish. It is possible that postmaster fails to start the worker
> due to fork failure. In such conditions, tuplesort_leader_wait will
> hang indefinitely because it will wait for the workersFinished count
> to become equal to launched workers (+1, if leader participates) which
> will never happen.  Am I missing something due to which this won't be
> a problem?

I think that both problems (the live _bt_parallel_scan_and_sort() bug,
as well as the general issue with needing to account for parallel
worker fork() failure) are likely solvable by not using
tuplesort_leader_wait(), and instead calling
WaitForParallelWorkersToFinish(). Which you suggested already.

Separately, I will need to monitor that bugfix patch, and check its
progress, to make sure that what I add is comparable to what
ultimately gets committed for parallel query.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-17 Thread Amit Kapila

On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan  wrote:
> On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas  wrote:
>> More comments:
>
> Attached patch has all open issues worked through, including those
> that I respond to or comment on below, as well as the other feedback
> from your previous e-mails. Note also that I fixed the issue that Amit
> raised,
>

I could still reproduce it. I think the way you have fixed it has a
race condition.  In _bt_parallel_scan_and_sort(), the value of
brokenhotchain is set after you signal the leader that the worker is
done (by incrementing workersFinished). Now, the leader is free to
decide based on the current shared state which can give the wrong
value.  Similarly, I think the value of havedead and reltuples can
also be wrong.

You neither seem to have fixed nor responded to the second problem
mentioned in my email upthread [1].  To reiterate, the problem is that
we can't assume that the workers we have launched will always start
and finish. It is possible that postmaster fails to start the worker
due to fork failure. In such conditions, tuplesort_leader_wait will
hang indefinitely because it will wait for the workersFinished count
to become equal to launched workers (+1, if leader participates) which
will never happen.  Am I missing something due to which this won't be
a problem?

Now, I think one argument is that such a problem can happen in a
parallel query, so it is not the responsibility of this patch to solve
it.  However, we already have a patch (there are some review comments
that needs to be addressed in the proposed patch) to solve it and this
patch is adding a new path in the code which has similar symptoms
which can't be fixed with the already proposed patch.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BizMyxzFD6k81Deyar35YJ5qdpbRTUp9cQvo%2BniQom7Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-16 Thread Prabhat Sahu

Hi all,

I have been continue doing testing of parallel create index patch. So far
I haven't came across any sort of issue or regression with the patches.
Here are few performance number for the latest round of testing - which
is perform on top of 6th Jan patch submitted by Peter.

Testing is done on openstack instance with:

CUP: 8
RAM : 16GB
HD: 640 GB

postgres=# select pg_size_pretty(pg_total_relation_size
('lineitem'));
 pg_size_pretty

 93 GB
(1 row)

-- Test 1.
max_parallel_workers_maintenance = 2
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 1GB
max_wal_size = 4GB

-- Test 2.
max_parallel_workers_maintenance = 4
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 2GB
max_wal_size = 4GB

-- Test 3.
max_parallel_workers_maintenance = 8
max_parallel_workers = 16
max_parallel_workers_per_gather = 8
maintenance_work_mem = 4GB
max_wal_size = 4GB

NOTE: All the time taken entries are the median of 3 consecutive runs for
the same B-tree index creation query.

Time taken for Parallel Index createion:
Test 1 Test 2 Test 3
Simple/Composite Indexes: Without patch With patch ,
max_parallel_workers_maintenance = 2 % Change Without patch With patch,
max_parallel_workers_maintenance = 4 % Change Without patch With patch,
max_parallel_workers_maintenance = 8 % Change
Index on "bigint" column:
CREATE INDEX li_ordkey_idx1 ON lineitem(l_orderkey); 1062446.462 ms
(17:42.446) 1024972.273 ms
(17:04.972) 3.52 % 1053468.945 ms
(17:33.469) 896375.543 ms
(14:56.376) 17.75 % 1082920.703 ms
(18:02.921) 932550.058 ms
(15:32.550) 13.88 %
index on "integer" column:
CREATE INDEX li_lineno_idx2 ON lineitem(l_linenumber); 1538285.499 ms
(25:38.285) 1201008.423 ms
(20:01.008) 21.92 % 1529837.023 ms
(25:29.837) 1014188.489 ms
(16:54.188) 33.70 % 1642160.947 ms
(27:22.161) 978518.253 ms
(16:18.518) 40.41 %
index on "numeric" column:
CREATE INDEX li_qty_idx3 ON lineitem(l_quantity); 3968102.568 ms
(01:06:08.103) 2359304.405 ms
(39:19.304) 40.54 % 4129510.930 ms
(01:08:49.511) 1680201.644 ms
(28:00.202) 59.31 % 4348248.210 ms
(01:12:28.248) 1490461.879 ms
(24:50.462) 65.72 %
index on "character" column:
CREATE INDEX li_lnst_idx4 ON lineitem(l_linestatus); 1510273.931 ms
(25:10.274) 1240265.301 ms
(20:40.265) 17.87 % 1516842.985 ms
(25:16.843) 995730.092 ms
(16:35.730) 34.35 % 1580789.375 ms
(26:20.789) 984975.746 ms
(16:24.976) 37.69 %
index on "date" column:
CREATE INDEX li_shipdt_idx5 ON lineitem(l_shipdate); 1483603.274 ms
(24:43.603) 1189704.930 ms
(19:49.705) 19.80 % 1498348.925 ms
(24:58.349) 1040421.626 ms
(17:20.422) 30.56 % 1653651.499 ms
(27:33.651) 1016305.794 ms
(16:56.306) 38.54 %
index on "character varying" column:
CREATE INDEX li_comment_idx6 ON lineitem(l_comment); 6945953.838 ms
(01:55:45.954) 4329696.334 ms
(01:12:09.696) 37.66 % 6818556.437 ms
(01:53:38.556) 2834034.054 ms
(47:14.034) 58.43 % 6942285.711 ms
(01:55:42.286) 2648430.902 ms
(44:08.431) 61.85 %
composite index on "numeric", "character" columns:
CREATE INDEX li_qtylnst_idx34 ON lineitem
(l_quantity, l_linestatus); 4961563.400 ms
(01:22:41.563) 2959722.178 ms
(49:19.722) 40.34 % 5242809.501 ms
(01:27:22.810) 2077463.136 ms
(34:37.463) 60.37 % 5576765.727 ms
(01:32:56.766) 1755829.420 ms
(29:15.829) 68.51 %
composite index on "date", "character varying" columns:
CREATE INDEX li_shipdtcomment_idx56 ON lineitem
(l_shipdate, l_comment); 4693318.077 ms
(01:18:13.318) 3181494.454 ms
(53:01.494) 32.21 % 4627624.682 ms
(01:17:07.625) 2613289.211 ms
(43:33.289) 43.52 % 4719242.965 ms
(01:18:39.243) 2685516.832 ms
(44:45.517) 43.09 %


*Thanks & Regards,*

*Prabhat Kumar Sahu*
Skype ID: prabhat.sahu1984

www.enterprisedb.co m


On Tue, Jan 16, 2018 at 6:24 AM, Peter Geoghegan  wrote:

> On Fri, Jan 12, 2018 at 10:28 AM, Robert Haas 
> wrote:
> > More comments:
>
> Attached patch has all open issues worked through, including those
> that I respond to or comment on below, as well as the other feedback
> from your previous e-mails. Note also that I fixed the issue that Amit
> raised, as well as the GetOldestXmin()-argument bug that I noticed in
> passing when responding to Amit. I also worked on the attribution in
> the commit message.
>
> Before getting to my responses to your most recent round of feedback,
> I want to first talk about some refactoring that I decided to do. As
> you can see from the master branch, tuplesort_performsort() isn't
> necessarily reached for spool2, even when we start out with a spool2
> (that is, for many unique index builds, spool2 never even does a
> tuplesort_performsort()). We may instead decide to shut down spool2
> when it has no (dead) tuples. I made this work just as well for the
> parallel case in this latest revision. I had to teach tuplesort.c to
> accept an early tuplesort_end() for LEADER() -- it had to be prepared
> to release still-waiting

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-15 Thread Peter Geoghegan

On Sun, Jan 14, 2018 at 8:25 PM, Amit Kapila  wrote:
> On Sun, Jan 14, 2018 at 1:43 AM, Peter Geoghegan  wrote:
>> On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila  wrote:
>>> Yeah, but this would mean that now with parallel create index, it is
>>> possible that some tuples from the transaction would end up in index
>>> and others won't.
>>
>> You mean some tuples from some past transaction that deleted a bunch
>> of tuples and committed, but not before someone acquired a still-held
>> snapshot that didn't see the deleter's transaction as committed yet?
>>
>
> I think I am talking about something different.  Let me try to explain
> in some more detail.  Consider a transaction T-1 has deleted two
> tuples from tab-1, first on page-1 and second on page-2 and committed.
> There is a parallel transaction T-2 which has an open snapshot/query
> due to which oldestXmin will be smaller than T-1.   Now, in another
> session, we started parallel Create Index on tab-1 which has launched
> one worker.  The worker decided to scan page-1 and will found that the
> deleted tuple on page-1 is Recently Dead, so will include it in Index.
> In the meantime transaction, T-2 got committed/aborted which allows
> oldestXmin to be greater than the value of transaction T-1 and now
> leader decides to scan the page-2 with freshly computed oldestXmin and
> found that the tuple on that page is Dead and has decided not to
> include it in the index.  So, this leads to a situation where some
> tuples deleted by the transaction will end up in index whereas others
> won't.  Note that I am not arguing that there is any fundamental
> problem with this, but just want to highlight that such a case doesn't
> seem to exist with Create Index.

I must have not done a good job of explaining myself ("You mean some
tuples from some past transaction..."), because this is exactly what I
meant, and was exactly how I understood your original remarks from
Saturday.

In summary, while I do agree that this is different to what we see
with serial index builds, I still don't think that this is a concern
for us.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-13 Thread Peter Geoghegan

On Sat, Jan 13, 2018 at 4:32 AM, Amit Kapila  wrote:
> Yeah, but this would mean that now with parallel create index, it is
> possible that some tuples from the transaction would end up in index
> and others won't.

You mean some tuples from some past transaction that deleted a bunch
of tuples and committed, but not before someone acquired a still-held
snapshot that didn't see the deleter's transaction as committed yet?

I guess that that is different, but it doesn't matter. All that
matters is that in the end, the index contains all entries for all
heap tuples visible to any possible snapshot (though possibly
excluding some existing old snapshots iff we detect broken HOT chains
during builds).

> In general, this makes me slightly nervous mainly
> because such a case won't be possible without the parallel option for
> create index, but if you and Robert are okay with it as there is no
> fundamental problem, then we might as well leave it as it is or maybe
> add a comment saying so.

Let me try to explain this another way, in terms of the high-level
intuition that I have about it (Robert can probably skip this part).

GetOldestXmin() returns a value that is inherently a *conservative*
cut-off. In hot standby mode, it's possible for the value it returns
to go backwards from a value previously returned within the same
backend.

Even with serial builds, the exact instant that GetOldestXmin() gets
called could vary based on something like the OS scheduling of the
process that runs CREATE INDEX. It could have a different value based
only on that. It follows that it won't matter if parallel CREATE INDEX
participants have a slightly different value, because the cut-off is
all about the consistency of the index with what the universe of
possible snapshots could see in the heap, not the consistency of
different parts of the index with each other (the parts produced from
heap tuples read from each participant).

Look at how the pg_visibility module calls GetOldestXmin() to recheck
-- it has to call GetOldestXmin() a second time, with a buffer lock
held on a heap page throughout. It does this to conclusively establish
that the visibility map is corrupt (otherwise, it could just be that
the cut-off became stale).

Putting all of this together, it would be safe for the
HEAPTUPLE_RECENTLY_DEAD case within IndexBuildHeapRangeScan() to call
GetOldestXmin() again (a bit like pg_visibility does), to avoid having
to index an actually-fully-dead-by-now tuple (we could call
HeapTupleSatisfiesVacuum() a second time for the heap tuple, hoping to
get HEAPTUPLE_DEAD the second time around). This optimization wouldn't
work out a lot of the time (it would only work out when an old
snapshot went away during the CREATE INDEX), and would add
procarraylock traffic, so we don't do it. But AFAICT it's feasible.

> Another point is that the information about broken hot chains
> indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
> coordinate this information among backends that participate in
> parallel create index.  Test to reproduce the problem is as below:
>
> create table tbrokenchain(c1 int, c2 varchar);
> insert into tbrokenchain values(3, 'aaa');
>
> begin;
> set force_parallel_mode=on;
> update tbrokenchain set c2 = 'bbb' where c1=3;
> create index idx_tbrokenchain on tbrokenchain(c1);
> commit;
>
> Now, check the value of indcheckxmin in pg_index, it should be true,
> but with patch it is false.  You can try with patch by not changing
> the value of force_parallel_mode;

Ugh, you're right. That's a real howler. Will fix.

Note that my stress-testing strategy has had a lot to do with
verifying that a serial build has relfiles that are physically
identical to parallel builds. Obviously that couldn't have caught
this, because this only concerns the state of the pg_index catalog.

> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.

I think that you're quite right that parallel_leader_participation
needs to be supported for testing purposes. I had some sympathy for
the idea that we should remove leader participation as a worker from
the patch entirely, but the testing argument seems to clinch it. I'm
fine with killing force_parallel_mode, though, because it will be
possible to force the use of parallelism by using the existing
parallel_workers table storage param in the next version of the patch,
regardless of how small the table is.

Thanks for the review.
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-13 Thread Amit Kapila

On Sat, Jan 13, 2018 at 6:02 PM, Amit Kapila  wrote:
>
> The patch uses both parallel_leader_participation and
> force_parallel_mode, but it seems the definition is different from
> what we have in Gather.  Basically, even with force_parallel_mode, the
> leader is participating in parallel build. I see there is some
> discussion above about both these parameters and still, there is not
> complete agreement on the best way forward.  I think we should have
> parallel_leader_participation as that can help in testing if nothing
> else.
>

Or maybe just have force_parallel_mode.  I think one of these is
required to facilitate some form of testing of the parallel code
easily.  As you can see from my previous email, it was quite easy to
demonstrate a test with force_parallel_mode.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-13 Thread Amit Kapila

On Sat, Jan 13, 2018 at 1:25 AM, Peter Geoghegan  wrote:
> On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas  wrote:
>> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila  wrote:
>>> 1.
>>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>>   {
>>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>>> - OldestXmin = InvalidTransactionId; /* not used */
>>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>>
>>> I think leader and workers should have the same idea of oldestXmin for
>>> the purpose of deciding the visibility of tuples.  I think this is
>>> ensured in all form of parallel query as we do share the snapshot,
>>> however, same doesn't seem to be true for Parallel Index builds.
>>
>> Hmm.  Does it break anything if they use different snapshots?  In the
>> case of a query that would be disastrous because then you might get
>> inconsistent results, but if the snapshot is only being used to
>> determine what is and is not dead then I'm not sure it makes much
>> difference ... unless the different snapshots will create confusion of
>> some other sort.
>
> I think that this is fine. GetOldestXmin() is only used when we have a
> ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
> only talking about the difference between HEAPTUPLE_DEAD and
> HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
> strictly necessary by the time you got to it is normal.
>

Yeah, but this would mean that now with parallel create index, it is
possible that some tuples from the transaction would end up in index
and others won't.   In general, this makes me slightly nervous mainly
because such a case won't be possible without the parallel option for
create index, but if you and Robert are okay with it as there is no
fundamental problem, then we might as well leave it as it is or maybe
add a comment saying so.

Another point is that the information about broken hot chains
indexInfo->ii_BrokenHotChain is getting lost.  I think you need to
coordinate this information among backends that participate in
parallel create index.  Test to reproduce the problem is as below:

create table tbrokenchain(c1 int, c2 varchar);
insert into tbrokenchain values(3, 'aaa');

begin;
set force_parallel_mode=on;
update tbrokenchain set c2 = 'bbb' where c1=3;
create index idx_tbrokenchain on tbrokenchain(c1);
commit;

Now, check the value of indcheckxmin in pg_index, it should be true,
but with patch it is false.  You can try with patch by not changing
the value of force_parallel_mode;

The patch uses both parallel_leader_participation and
force_parallel_mode, but it seems the definition is different from
what we have in Gather.  Basically, even with force_parallel_mode, the
leader is participating in parallel build. I see there is some
discussion above about both these parameters and still, there is not
complete agreement on the best way forward.  I think we should have
parallel_leader_participation as that can help in testing if nothing
else.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Peter Geoghegan

On Fri, Jan 12, 2018 at 6:14 AM, Robert Haas  wrote:
> On Fri, Jan 12, 2018 at 8:19 AM, Amit Kapila  wrote:
>> 1.
>> + if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
>>   {
>> - snapshot = RegisterSnapshot(GetTransactionSnapshot());
>> - OldestXmin = InvalidTransactionId; /* not used */
>> + OldestXmin = GetOldestXmin(heapRelation, true);
>>
>> I think leader and workers should have the same idea of oldestXmin for
>> the purpose of deciding the visibility of tuples.  I think this is
>> ensured in all form of parallel query as we do share the snapshot,
>> however, same doesn't seem to be true for Parallel Index builds.
>
> Hmm.  Does it break anything if they use different snapshots?  In the
> case of a query that would be disastrous because then you might get
> inconsistent results, but if the snapshot is only being used to
> determine what is and is not dead then I'm not sure it makes much
> difference ... unless the different snapshots will create confusion of
> some other sort.

I think that this is fine. GetOldestXmin() is only used when we have a
ShareLock on the heap relation, and the snapshot is SnapshotAny. We're
only talking about the difference between HEAPTUPLE_DEAD and
HEAPTUPLE_RECENTLY_DEAD here. Indexing a heap tuple when that wasn't
strictly necessary by the time you got to it is normal.

However, it's not okay that GetOldestXmin()'s second argument is true
in the patch, rather than PROCARRAY_FLAGS_VACUUM. That's due to bitrot
that was not caught during some previous rebase (commit af4b1a08
changed the signature). Will fix.

You've given me a lot more to work through in your most recent mail,
Robert. I will probably get the next revision to you on Monday.
Doesn't seem like there is much point in posting what I've done so
far.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Robert Haas

On Thu, Jan 11, 2018 at 8:58 PM, Peter Geoghegan  wrote:
> I've caught up with you again. I just need to take a look at what I
> came up with with fresh eyes, and maybe do some more testing.

More comments:

BufFileView() looks fairly pointless.  It basically creates a copy of
the input and, in so doing, destroys the input, which is a lot like
returning the input parameter except that it uses more cycles.  It
does do a few things.  First, it zeroes the offsets array instead of
copying the offsets.  But as used, those offsets would have been 0
anyway.  Second, it sets the fileset parameter to NULL.   But that
doesn't actually seem to be important for anything: the fileset is
only used when creating new files, and the BufFile must already be
marked read-only, so we won't be doing that.  It seems like this
function can just be entirely removed and replaced by Assert()-ing
some things about the target in BufFileViewAppend, which I would just
rename to BufFileAppend.

In miscadmin.h, I'd put the prototype for the new GUC next to
max_worker_processes, not maintenance_work_mem.

The ereport() in index_build will, I think, confuse people when it
says that there are 0 parallel workers.  I suggest splitting this into
two cases: if (indexInfo->ii_ParallelWorkers == 0) ereport(...
"building index \"%s\" on table \"%s\" serially" ...) else ereport(...
"building index \"%s\" on table \"%s\" in parallel with request for %d
parallel workers" ...).  Might even need three cases to handle
parallel_leader_participation without needing to assemble the message,
unless we drop parallel_leader_participation support.

The logic in IndexBuildHeapRangeScan() around need_register_snapshot
and OldestXmin seems convoluted and not very well-edited to me.  For
example, need_register_snapshot is set to false in a block that is
only entered when it's already false, and the comment that follows is
supposed to be associated with GetOldestXmin() and makes no sense
here.  I suggest that you go back to the original code organization
and then just insert an additional case for a caller-supplied scan, so
that the overall flow looks like this:

if (scan != NULL)
{
   ...
}
else if (IsBootstrapProcessingMode() || indexInfo->ii_Concurrent)
{
   ...
}
else
{
   ...
}

Along with that, I'd change the name of need_register_snapshot to
need_unregister_snapshot (it's doing both jobs right now) and
initialize it to false.  If you enter the second of the above blocks
then change it to true just after snapshot =
RegisterSnapshot(GetTransactionSnapshot()).  Then adjust the comment
that begins "Prepare for scan of the base relation." by inserting an
additional sentence just after that one: "If the caller has supplied a
scan, just use it.  Otherwise, in a normal index build..." and the
rest as it is currently.

+ * This support code isn't reliable when called from within a parallel
+ * worker process due to the fact that our state isn't propagated.  This is
+ * why parallel index builds are disallowed on catalogs.  It is possible
+ * that we'll fail to catch an attempted use of a user index undergoing
+ * reindexing due the non-propagation of this state to workers, which is not
+ * ideal, but the problem is not particularly likely to go undetected due to
+ * our not doing better there.

I understand the first two sentences, but I have no idea what the
third one means, especially the part that says "not particularly
likely to go undetected due to our not doing better there".  It sounds
scary that something bad is only "not particularly likely to go
undetected"; don't we need to detect bad things reliably? But also,
you used the word "not" three times and also the prefix "un-", meaning
"not", once.  Four negations in 13 words! Perhaps I'm not entirely in
a position to cast aspersions on overly-complex phraseology -- the pot
calling the kettle black and all that -- but I bet that will be a lot
clearer if you reduce the number of negations to either 0 or 1.

The comment change in standard_planner() doesn't look helpful to me;
I'd leave it out.

+ * tableOid is the table that index is to be built on.  indexOid is the OID
+ * of a index to be created or reindexed (which must be a btree index).

I'd rewrite that first sentence to end "the table on which the index
is to be built".  The second sentence should say "an index" rather
than "a index".

+ * leaderWorker indicates whether leader will participate as worker or not.
+ * This needs to be taken into account because leader process is guaranteed to
+ * be idle when not participating as a worker, in contrast with conventional
+ * parallel relation scans, where the leader process typically has plenty of
+ * other work to do relating to coordinating the scan, etc.  For CREATE INDEX,
+ * leader is usually treated as just another participant for our scaling
+ * calculation.

OK, I get the first sentence.  But the rest of this appears to be
partially irrelevant and partially incorrect.  The degree to which

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-12 Thread Amit Kapila

On Sat, Jan 6, 2018 at 3:47 AM, Peter Geoghegan  wrote:
> On Tue, Jan 2, 2018 at 8:43 PM, Rushabh Lathia  
> wrote:
>> I agree that plan_create_index_workers() needs to count the leader as a
>> normal worker for the CREATE INDEX.  So what you proposing is - when
>> parallel_leader_participation is true launch (return value of
>> compute_parallel_worker() - 1)
>> workers. true ?
>
> Almost. We need to not subtract one when only one worker is indicated
> by compute_parallel_worker(). I also added some new stuff there, to
> consider edge cases with the parallel_leader_participation GUC.
>
>>> I'm working on fixing up what you posted. I'm probably not more than a
>>> week away from posting a patch that I'm going to mark "ready for
>>> committer". I've already made the change above, and once I spend time
>>> on trying to break the few small changes needed within buffile.c I'll
>>> have taken it as far as I can, most likely.
>>>
>>
>> Okay, once you submit the patch with changes - I will do one round of
>> review for the changes.
>
> I've attached my revision. Changes include:
>

Few observations while skimming through the patch:

1.
+ if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
  {
- snapshot = RegisterSnapshot(GetTransactionSnapshot());
- OldestXmin = InvalidTransactionId; /* not used */
+ OldestXmin = GetOldestXmin(heapRelation, true);

I think leader and workers should have the same idea of oldestXmin for
the purpose of deciding the visibility of tuples.  I think this is
ensured in all form of parallel query as we do share the snapshot,
however, same doesn't seem to be true for Parallel Index builds.

2.
+
+ /* Wait on worker processes to finish (should be almost instant) */
+ reltuples = _bt_leader_wait_for_workers(buildstate);

Can't we use WaitForParallelWorkersToFinish for this purpose?  The
reason is that if we use a different mechanism here then we might need
a different way to solve the problem related to fork failure.  See
thread [1].  Basically, what if postmaster fails to launch workers due
to fork failure, the leader backend might wait indefinitely.



[1] - https://commitfest.postgresql.org/16/1341/


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-11 Thread Peter Geoghegan

On Thu, Jan 11, 2018 at 2:26 PM, Robert Haas  wrote:
> While the patch contains, as I said before, an excellent set of how-to
> directions explaining how to use the new parallel sort facilities in
> tuplesort.c, there seems to be no such thing for logtape.c, and as a
> result I find it a bit unclear how the interface is supposed to work.
> I think it would be good to add a similar summary here.

Okay. I came up with something for that.

> It seems like the words "leader" and "worker" here refer to the leader
> of a parallel operation and the associated workers, but do we really
> need to make that assumption?  Couldn't we generally describe this as
> merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
> into a single LogicalTapeSet that can thereafter be read by the
> process that does the merging?

It's not so much an assumption as it is the most direct way of
referring to these various objects. logtape.c is very clearly a
submodule of tuplesort.c, so this felt okay to me. There are already
several references to what tuplesort.c expects. I'm not going to argue
about it if you insist on this, though I do think that trying to
describe things in more general terms would be a net loss. It would
kind of come off as feigning ignorance IMV. There is nothing that
logtape.c could know less about other than names/roles, and I find it
hard to imagine those changing, even when we add support for
partitioning/distribution sort (where logtape.c handles
"redistribution", something discussed early in this project's
lifetime).

> +/* Pass worker BufFile pieces, and a placeholder leader piece */
> +for (i = 0; i < lts->nTapes; i++)
> +{
> +lt = >tapes[i];
> +
> +/*
> + * Build concatenated view of all BufFiles, remembering the block
> + * number where each source file begins.
> + */
> +if (i < lts->nTapes - 1)
>
> Unless I'm missing something, the "if" condition just causes the last
> pass through this loop to do nothing.  If so, why not just change the
> loop condition to i < lts->nTapes - 1 and drop the "if" statement
> altogether?

The last "lt" in the loop is in fact used separately, just outside the
loop. But that use turns out to have been subtly wrong, apparently due
to a problem with converting logtape.c to use the shared buffile
stuff. This buglet would only have caused writing to the leader tape
to break (never trace_sort instrumentation), something that isn't
supported anyway due to the restrictions that shared BufFiles have.
But, we should, on general principle, be able to write to the leader
tape if and when shared buffiles learn to support writing (after
exporting original BufFile in worker).

Buglet fixed in my local working copy. I did so in a way that changes
loop test along the lines you suggest. This should make the whole
design of tape concatenation a bit clearer.

> +charfilename[MAXPGPATH] = {0};
>
> I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

Okay.

> +/* Alter worker's tape state (generic values okay for leader) */
>
> What do you mean by generic values?

I mean that the leader's tape doesn't need to have
lt->firstBlockNumber set, because it's empty -- it can remain -1. Same
applies to lt->offsetBlockNumber, too.

I'll remove the text within parenthesis, since it seems redundant
given the structure of the loop.

> + * Each tape is initialized in write state.  Serial callers pass ntapes, but
> + * NULL arguments for everything else.  Parallel worker callers pass a
> + * shared handle and worker number, but tapeset should be NULL.  Leader
> + * passes worker -1, a shared handle, and shared tape metadata. These are
> + * used to claim ownership of worker tapes.
>
> This comment doesn't match the actual function definition terribly
> well.  Serial callers don't pass NULL for "everything else", because
> "int worker" is not going to be NULL.  For parallel workers, it's not
> entirely obvious whether "a shared handle" means TapeShare *tapes or
> SharedFileSet *fileset.  "tapeset" sounds like an argument name, but
> there is no such argument.

Okay. I've tweaked things here.

> lt->max_size looks like it might be an optimization separate from the
> overall patch, but maybe I'm wrong about that.

I think that it's pretty much essential. Currently, the MaxAllocSize
restriction is needed in logtape.c for the same reason that it's
needed anywhere else. Not much to talk about there. The new max_size
thing is about more than that, though -- it's really about not
stupidly allocating up to a full MaxAllocSize when you already know
that you're going to use next to no memory.

You don't have this issue with serial sorts because serial sorts that
only sort a tiny number of tuples never end up as external sorts --
when you end up doing a serial external sort, clearly you're never
going to allocate an excessive amount of memory up front in logtape.c,

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-11 Thread Robert Haas

On Wed, Jan 10, 2018 at 1:45 PM, Robert Haas  wrote:
> There's a lot here I haven't grokked yet, but I'm running out of
> mental energy so I think I'll send this for now and work on this some
> more when time permits, hopefully tomorrow.

Looking at the logtape changes:

While the patch contains, as I said before, an excellent set of how-to
directions explaining how to use the new parallel sort facilities in
tuplesort.c, there seems to be no such thing for logtape.c, and as a
result I find it a bit unclear how the interface is supposed to work.
I think it would be good to add a similar summary here.

It seems like the words "leader" and "worker" here refer to the leader
of a parallel operation and the associated workers, but do we really
need to make that assumption?  Couldn't we generally describe this as
merging a bunch of 1-tape LogicalTapeSets created from a SharedFileSet
into a single LogicalTapeSet that can thereafter be read by the
process that does the merging?

+/* Pass worker BufFile pieces, and a placeholder leader piece */
+for (i = 0; i < lts->nTapes; i++)
+{
+lt = >tapes[i];
+
+/*
+ * Build concatenated view of all BufFiles, remembering the block
+ * number where each source file begins.
+ */
+if (i < lts->nTapes - 1)

Unless I'm missing something, the "if" condition just causes the last
pass through this loop to do nothing.  If so, why not just change the
loop condition to i < lts->nTapes - 1 and drop the "if" statement
altogether?

+charfilename[MAXPGPATH] = {0};

I don't think you need = {0}, because pg_itoa is about to clobber it anyway.

+/* Alter worker's tape state (generic values okay for leader) */

What do you mean by generic values?

+ * Each tape is initialized in write state.  Serial callers pass ntapes, but
+ * NULL arguments for everything else.  Parallel worker callers pass a
+ * shared handle and worker number, but tapeset should be NULL.  Leader
+ * passes worker -1, a shared handle, and shared tape metadata. These are
+ * used to claim ownership of worker tapes.

This comment doesn't match the actual function definition terribly
well.  Serial callers don't pass NULL for "everything else", because
"int worker" is not going to be NULL.  For parallel workers, it's not
entirely obvious whether "a shared handle" means TapeShare *tapes or
SharedFileSet *fileset.  "tapeset" sounds like an argument name, but
there is no such argument.

lt->max_size looks like it might be an optimization separate from the
overall patch, but maybe I'm wrong about that.

+/* palloc() larger than MaxAllocSize would fail */
 lt->buffer = NULL;
 lt->buffer_size = 0;
+lt->max_size = MaxAllocSize;

The comment about palloc() should move down to where you assign max_size.

Generally we avoid returning a struct type, so maybe
LogicalTapeFreeze() should instead grow an out parameter of type
TapeShare * which it populates only if not NULL.

Won't LogicalTapeFreeze() fail an assertion in BufFileExportShared()
if the file doesn't belong to a shared fileset?  If you adopt the
previous suggestion, we can probably just make whether to call this
contingent on whether the TapeShare * out parameter is provided.

I'm not confident I completely understand what's going on with the
logtape stuff yet, so I might have more comments (or better ones)
after I study this further.  To your question about whether to go
ahead and post a new version, I'm OK to keep reviewing this version
for a little longer or to switch to a new one, as you prefer.  I have
not made any local changes, just written a blizzard of email text.
:-p

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-11 Thread Peter Geoghegan

On Thu, Jan 11, 2018 at 1:44 PM, Robert Haas  wrote:
>> A third option here is to specifically recognize that
>> compute_parallel_worker() returned a value based on the table storage
>> param max_workers, and for that reason alone no "insufficient memory
>> per participant" decrementing/vetoing should take place. That is, when
>> the max_workers param is set, perhaps it should be completely
>> impossible for CREATE INDEX to ignore it for any reason other than an
>> inability to launch parallel workers (though that could be due to the
>> max_parallel_workers GUC's setting).
>>
>> You could argue that we should do this anyway, I suppose.
>
> Yes, I think this sounds like a good idea.

Cool. I've already implemented this in my local working copy of the
patch. That settles that.

If I'm not mistaken, the only outstanding question at this point is
whether or not we're going to give in and completely remove parallel
leader participation entirely. I suspect that we won't end up doing
that, because while it's not very useful, it's also not hard to
support. Besides, to some extent that's the expectation that has been
established already.

I am not far from posting a revision that incorporates all of your
feedback. Expect that tomorrow afternoon your time at the latest. Of
course, you may have more feedback for me in the meantime. Let me know
if I should hold off on posting a new version.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-01-11 Thread Robert Haas

On Thu, Jan 11, 2018 at 3:25 PM, Peter Geoghegan  wrote:
> On Thu, Jan 11, 2018 at 12:06 PM, Peter Geoghegan  wrote:
>> It might make sense to have the "minimum memory per participant" value
>> come from a GUC, rather than be hard coded (it's currently hard-coded
>> to 32MB).
>
>> What do you think of that idea?
>
> A third option here is to specifically recognize that
> compute_parallel_worker() returned a value based on the table storage
> param max_workers, and for that reason alone no "insufficient memory
> per participant" decrementing/vetoing should take place. That is, when
> the max_workers param is set, perhaps it should be completely
> impossible for CREATE INDEX to ignore it for any reason other than an
> inability to launch parallel workers (though that could be due to the
> max_parallel_workers GUC's setting).
>
> You could argue that we should do this anyway, I suppose.

Yes, I think this sounds like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

1 2 >

1 - 100 of 125 matches

Mail list logo