Re: [HACKERS] Parallel Hash take II

2017-11-13 Thread Thomas Munro
Hi Andres and Peter,

Please see below for inline responses to your feedback.  New patch attached.

On Wed, Nov 8, 2017 at 10:01 AM, Andres Freund  wrote:
> +set min_parallel_table_scan_size = 0;
> +set parallel_setup_cost = 0;
> +-- Make a simple relation with well distributed keys and correctly
> +-- estimated size.
> +create table simple as
> +  select generate_series(1, 2) AS id, 
> 'aa';
> +alter table simple set (parallel_workers = 2);
> +analyze simple;
> +-- Make a relation whose size we will under-estimate.  We want stats
> +-- to say 1000 rows, but actually there are 20,000 rows.
> +create table bigger_than_it_looks as
> +  select generate_series(1, 2) as id, 
> 'aa';
> +alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
> +alter table bigger_than_it_looks set (parallel_workers = 2);
> +delete from bigger_than_it_looks where id <= 19000;
> +vacuum bigger_than_it_looks;
> +analyze bigger_than_it_looks;
> +insert into bigger_than_it_looks
> +  select generate_series(1, 19000) as id, 
> 'aa';
>
> It seems kinda easier to just manipulate ndistinct and reltuples...

Done.

> +set max_parallel_workers_per_gather = 0;
> +set work_mem = '4MB';
>
> I hope there's a fair amount of slop here - with different archs you're
> going to see quite some size differences.

Yeah, this is a problem I wrestled with.  See next.

> +-- The "good" case: batches required, but we plan the right number; we
> +-- plan for 16 batches, and we stick to that number, and peak memory
> +-- usage says within our work_mem budget
> +-- non-parallel
> +set max_parallel_workers_per_gather = 0;
> +set work_mem = '128kB';
>
> So how do we know that's actually the case we're testing rather than
> something arbitrarily different? There's IIRC tests somewhere that just
> filter the json explain output to the right parts...

Yeah, good idea.  My earlier attempts to dump out the hash join
dimensions ran into problems with architecture sensitivity and then
some run-to-run non-determinism in the parallel case (due to varying
fragmentation depending on how many workers get involved in time).
The attached version tells you about batch growth without reporting
the exact numbers, except in the "ugly" case where we know that there
is only one possible outcome because the extreme skew detector is
guaranteed to go off after the first nbatch increase (I got rid of all
other tuples except ones with the same key to make this true).

This exercise did reveal a bug in
0008-Show-hash-join-per-worker-information-in-EXPLAIN-ANA.patch
though: it is capturing shared instrumentation too soon in the
non-Parallel Hash case so the nbatch reported by EXPLAIN ANALYZE might
be too low if we grew while probing.  Oops.  Will post a fix for that.

> +/*
> + * Build the name for a given segment of a given BufFile.
> + */
> +static void
> +MakeSharedSegmentName(char *name, const char *buffile_name, int segment)
> +{
> +   snprintf(name, MAXPGPATH, "%s.%d", buffile_name, segment);
> +}
>
> Not a fan of this name - you're not "making" a filename here (as in
> allocating or such). I think I'd just remove the Make prefix.

Done.  I also changed some similar code where I'd used GetXXX when
building paths.

> +/*
> + * Open a file that was previously created in another backend with
> + * BufFileCreateShared in the same SharedFileSet using the same name.  The
> + * backend that created the file must have called BufFileClose() or
> + * BufFileExport() to make sure that it is ready to be opened by other
> + * backends and render it read-only.
> + */
>
> Is it actually guaranteed that it's another backend / do we rely on
> that?

No, it could be any backend that is attached to the SharedFileSet,
including the current one.  Wording improved.

> +BufFile *
> +BufFileOpenShared(SharedFileSet *fileset, const char *name)
> +{
>
> +   /*
> +* If we didn't find any files at all, then no BufFile exists with 
> this
> +* tag.
> +*/
> +   if (nfiles == 0)
> +   return NULL;
>
> s/taag/name/?

Fixed.

> +/*
> + * Delete a BufFile that was created by BufFileCreateShared in the given
> + * SharedFileSet using the given name.
> + *
> + * It is not necessary to delete files explicitly with this function.  It is
> + * provided only as a way to delete files proactively, rather than waiting 
> for
> + * the SharedFileSet to be cleaned up.
> + *
> + * Only one backend should attempt to delete a given name, and should know
> + * that it exists and has been exported or closed.
> + */
> +void
> +BufFileDeleteShared(SharedFileSet *fileset, const char *name)
> +{
> +   charsegment_name[MAXPGPATH];
> +   int segment = 0;
> +   boolfound = false;
> +
> +   /*
> +* We don't know how many segments the file has.  We'll keep deleting
> +* until we 

Re: [HACKERS] Parallel Hash take II

2017-11-08 Thread Andres Freund
Hi,

@@ -747,7 +747,7 @@ try_hashjoin_path(PlannerInfo *root,
 * never have any output pathkeys, per comments in create_hashjoin_path.
 */
initial_cost_hashjoin(root, , jointype, hashclauses,
- outer_path, inner_path, 
extra);
+ outer_path, inner_path, 
extra, false);
 
if (add_path_precheck(joinrel,
  workspace.startup_cost, 
workspace.total_cost,
@@ -761,6 +761,7 @@ try_hashjoin_path(PlannerInfo *root,
  extra,
  
outer_path,
  
inner_path,
+ 
false,/* parallel_hash */
  
extra->restrictlist,
  
required_outer,
  
hashclauses));
@@ -776,6 +777,10 @@ try_hashjoin_path(PlannerInfo *root,
  * try_partial_hashjoin_path
  *   Consider a partial hashjoin join path; if it appears useful, push it 
into
  *   the joinrel's partial_pathlist via add_partial_path().
+ *   The outer side is partial.  If parallel_hash is true, then the inner 
path
+ *   must be partial and will be run in parallel to create one or more 
shared
+ *   hash tables; otherwise the inner path must be complete and a copy of 
it
+ *   is run in every process to create separate identical private hash 
tables.
  */

When do we have "or more shared hash tables" rather than one? Are you
thinking about subordinate nodes?


@@ -1839,6 +1846,10 @@ hash_inner_and_outer(PlannerInfo *root,
 * able to properly guarantee uniqueness.  Similarly, we can't 
handle
 * JOIN_FULL and JOIN_RIGHT, because they can produce false null
 * extended rows.  Also, the resulting path must not be 
parameterized.
+* We should be able to support JOIN_FULL and JOIN_RIGHT for 
Parallel
+* Hash, since in that case we're back to a single hash table 
with a
+* single set of match bits for each batch, but that will 
require
+* figuring out a deadlock-free way to wait for the probe to 
finish.
 */

s/should be able/would be able/?



index 6a45b68e5df..2d38a5efae8 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -451,7 +451,6 @@ BarrierDetachImpl(Barrier *barrier, bool arrive)
release = true;
barrier->arrived = 0;
++barrier->phase;
-   Assert(barrier->selected);
barrier->selected = false;
}

Uh, what?




diff --git a/src/test/regress/expected/join.out 
b/src/test/regress/expected/join.out
index 35523bd8065..40a076d976f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5821,6 +5821,9 @@ analyze extremely_skewed;
 insert into extremely_skewed
   select 42 as id, 'aa'
   from generate_series(1, 19000);
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 32, 'x') 
as t;
+alter table wide set (parallel_workers = 2);

I'm doubtful this is actually going to be a wide tuple - this'll get
compressed down quite a bit, no?

postgres[26465][1]=# SELECT octet_length(t), pg_column_size(t) FROM wide ;
┌──┬┐
│ octet_length │ pg_column_size │
├──┼┤
│   32 │   3671 │
│   32 │   3671 │
└──┴┘
(2 rows)


(and yes, it's ridiculous that a compressed datum of that size still
takes up 3kb)



+-- parallel with parallel-aware hash join
+set max_parallel_workers_per_gather = 2;
+set work_mem = '128kB';
+set enable_parallel_hash = on;

I think it'd be better if we structured the file so we just sat guc's
with SET LOCAL inside a transaction.


+-- parallel with parallel-aware hash join
+set max_parallel_workers_per_gather = 2;
+set work_mem = '64kB';
+set enable_parallel_hash = on;
+explain (costs off)
+  select count(*) from simple r join extremely_skewed s using (id);
+  QUERY PLAN   
+---
+ Finalize Aggregate
+   ->  Gather
+ Workers Planned: 2
+ ->  Partial Aggregate
+   ->  Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ ->  Parallel Seq Scan on simple r
+ ->  Parallel Hash
+   

Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Andres Freund
Hi,

 * avoids wasting memory on duplicated hash tables
 * avoids wasting disk space on duplicated batch files
 * avoids wasting CPU executing duplicate subplans

What's the last one referring to?





+static void
+MultiExecParallelHash(HashState *node)
+{

+   switch (BarrierPhase(build_barrier))
+   {
+   case PHJ_BUILD_ALLOCATING:
+
+   /*
+* Either I just allocated the initial hash table in
+* ExecHashTableCreate(), or someone else is doing 
that.  Either
+* way, wait for everyone to arrive here so we can 
proceed, and
+* then fall through.
+*/
+   BarrierArriveAndWait(build_barrier, 
WAIT_EVENT_HASH_BUILD_ALLOCATING);

Can you add a /* fallthrough */ comment here? Gcc is warning if you
don't. While we currently have lotsa other places not having the
annotation, it seem reasonable to have it in new code.


+   case PHJ_BUILD_HASHING_INNER:
+
+   /*
+* It's time to begin hashing, or if we just arrived 
here then
+* hashing is already underway, so join in that effort. 
 While
+* hashing we have to be prepared to help increase the 
number of
+* batches or buckets at any time, and if we arrived 
here when
+* that was already underway we'll have to help 
complete that work
+* immediately so that it's safe to access batches and 
buckets
+* below.
+*/
+   if 
(PHJ_GROW_BATCHES_PHASE(BarrierAttach(>grow_batches_barrier)) !=
+   PHJ_GROW_BATCHES_ELECTING)
+   ExecParallelHashIncreaseNumBatches(hashtable);
+   if 
(PHJ_GROW_BUCKETS_PHASE(BarrierAttach(>grow_buckets_barrier)) !=
+   PHJ_GROW_BUCKETS_ELECTING)
+   ExecParallelHashIncreaseNumBuckets(hashtable);
+   ExecParallelHashEnsureBatchAccessors(hashtable);

"accessors" sounds a bit weird for a bunch of pointers, but maybe that's
just my ESL senses tingling wrongly.



/* 
@@ -240,12 +427,15 @@ ExecEndHash(HashState *node)
  * 
  */
 HashJoinTable
-ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
+ExecHashTableCreate(HashState *state, List *hashOperators, bool keepNulls)

+   /*
+* Parallel Hash tries to use the combined work_mem of all workers to
+* avoid the need to batch.  If that won't work, it falls back to 
work_mem
+* per worker and tries to process batches in parallel.
+*/

One day we're going to need a better approach to this. I have no idea
how, but this per-node, and now per_node * max_parallelism, approach has
only implementation simplicity as its benefit.





+static HashJoinTuple
+ExecParallelHashLoadTuple(HashJoinTable hashtable, MinimalTuple tuple,
+ dsa_pointer *shared)
+{

+static void
+ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
+{



+/*
+ * Get the first tuple in a given bucket identified by number.
+ */
+static HashJoinTuple
+ExecHashFirstTupleInBucket(HashJoinTable hashtable, int bucketno)
+{
+   if (hashtable->parallel_state)
+   {
+   dsa_pointer p =
+   dsa_pointer_atomic_read(>buckets.shared[bucketno]);

Can you make this, and possibly a few other places, more readable by
introducing a temporary variable?


+/*
+ * Insert a tuple at the front of a chain of tuples in DSA memory atomically.
+ */
+static void
+ExecParallelHashPushTuple(dsa_pointer_atomic *head,
+ HashJoinTuple tuple,
+ dsa_pointer tuple_shared)
+{
+   do
+   {
+   tuple->next.shared = dsa_pointer_atomic_read(head);
+   } while (!dsa_pointer_atomic_compare_exchange(head,
+   
  >next.shared,
+   
  tuple_shared));
+}

This is hard to read.


+ * While in the phase PHJ_BUILD_HASHING_INNER a separate pair of barriers may
+ * be used repeatedly as required to coordinate expansions in the number of
+ * batches or buckets.  Their phases are as follows:
+ *
+ *   PHJ_GROW_BATCHES_ELECTING   -- initial state
+ *   PHJ_GROW_BATCHES_ALLOCATING -- one allocates new batches
+ *   PHJ_GROW_BATCHES_REPARTITIONING -- all rep
s/rep/repartition/?

 #include "access/htup_details.h"
+#include "access/parallel.h"
 #include 

Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Thomas Munro
Hi Peter,

See responses to a couple of points below.  I'll respond to the other
points separately (ie with code/comment changes).

On Wed, Nov 8, 2017 at 10:32 AM, Peter Geoghegan  wrote:
> On Tue, Nov 7, 2017 at 1:01 PM, Andres Freund  wrote:
>> +/*
>> + * Delete a BufFile that was created by BufFileCreateShared in the given
>> + * SharedFileSet using the given name.
>> + *
>> + * It is not necessary to delete files explicitly with this function.  It is
>> + * provided only as a way to delete files proactively, rather than waiting 
>> for
>> + * the SharedFileSet to be cleaned up.
>> + *
>> + * Only one backend should attempt to delete a given name, and should know
>> + * that it exists and has been exported or closed.
>> + */
>
> This part is new to me. We now want one backend to delete a given
> filename. What changed? Please provide a Message-Id reference if that
> will help me to understand.
>
> For now, I'm going to guess that this development had something to do
> with the need to deal with virtual FDs that do a close() on an FD to
> keep under backend limits. Do I have that right?

No -- this is simply an option available to client code that wants to
clean up individual temporary files earlier.  Such client code is
responsible for meeting the synchronisation requirements described in
the comment, but it's entirely optional.  For example:
SharedTuplestore is backed by multiple files and it could take the
opportunity to delete individual files after they've been scanned, if
you told it you were going to scan only once
(SHARED_TUPLESTORE_SINGLE_PASS) and if it could prove that no other
backend could still need to read from it, as mentioned in a comment in
sts_end_parallel_scan().

However, since I changed to the page/chunk based model (spinlock while
advancing block counter, mimicking Parallel Seq Scan) instead of the
earlier Fisher Price version (LWLock while reading each tuple with a
shared read head maintained with tell/seek), I don't actually do that.
Hitting the end of the file no longer means that no one else is
reading from the file (someone else might still be reading from an
earlier chunk even though you've finished reading the final chunk in
the file, and the vfd systems means that they must be free to close
and reopen the file at any time).  In the current patch version the
files are cleaned up wholesale at two times: SharedFileSet cleanup
triggered by DSM destruction, and SharedFileSet reset triggered by
rescan.  Practically, it's always the former case.  It's vanishingly
rare that you'd actually want to be rescanning a Parallel Hash that
spills to disk but in that case we delete the old files and recreate,
and that case is tested in the regression tests.

If it bothers you that I have an API there that I'm not actually using
yet, I will remove it.

>> +   if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
>> +   {
>> +   /* Subtract its size from current usage (do first in case of 
>> error) */
>> +   temporary_files_size -= vfdP->fileSize;
>> +   vfdP->fileSize = 0;
>> +   }
>>
>> So, is it right to do so unconditionally and without regard for errors?
>> If the file isn't deleted, it shouldn't be subtracted from fileSize. I
>> guess you're managing that through the flag, but that's not entirely
>> obvious.
>
> I think that the problem here is that the accounting is expected to
> always work. It's not like there is a resowner style error path in
> which temporary_files_size gets reset to 0.

But there is a resowner error path in which File handles get
automatically closed and temporary_files_size gets adjusted.  The
counter goes up when you create, and goes down when you close or
resowner closes for you.  Eventually you either close and the
bookkeeping is consistent or you crash and it doesn't matter.  And
some kind of freak multiple close attempt is guarded against by
setting the files size to 0 so we can't double-subtract.  Do you see a
bug?

None of this has any impact on whether files are leaked: either
SharedFileSet removes the files, or you crash (or take a filesystem
snapshot, etc) and RemovePgTempFiles() mops them up at the next clean
startup.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Thomas Munro
On Wed, Nov 8, 2017 at 10:32 AM, Robert Haas  wrote:
> On Tue, Nov 7, 2017 at 4:01 PM, Andres Freund  wrote:
>> diff --git a/src/backend/utils/resowner/resowner.c 
>> b/src/backend/utils/resowner/resowner.c
>> index 4c35ccf65eb..8b91d5a6ebe 100644
>> --- a/src/backend/utils/resowner/resowner.c
>> +++ b/src/backend/utils/resowner/resowner.c
>> @@ -528,16 +528,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
>> PrintRelCacheLeakWarning(res);
>> RelationClose(res);
>> }
>> -
>> -   /* Ditto for dynamic shared memory segments */
>> -   while (ResourceArrayGetAny(&(owner->dsmarr), ))
>> -   {
>> -   dsm_segment *res = (dsm_segment *) 
>> DatumGetPointer(foundres);
>> -
>> -   if (isCommit)
>> -   PrintDSMLeakWarning(res);
>> -   dsm_detach(res);
>> -   }
>> }
>> else if (phase == RESOURCE_RELEASE_LOCKS)
>> {
>> @@ -654,6 +644,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
>> PrintFileLeakWarning(res);
>> FileClose(res);
>> }
>> +
>> +   /* Ditto for dynamic shared memory segments */
>> +   while (ResourceArrayGetAny(&(owner->dsmarr), ))
>> +   {
>> +   dsm_segment *res = (dsm_segment *) 
>> DatumGetPointer(foundres);
>> +
>> +   if (isCommit)
>> +   PrintDSMLeakWarning(res);
>> +   dsm_detach(res);
>> +   }
>> }
>>
>> Is that entirely unproblematic? Are there any DSM callbacks that rely on
>> locks still being held? Please split this part into a separate commit
>> with such analysis.
>
> FWIW, I think this change is a really good idea (I recommended it to
> Thomas at some stage, I think).

Yeah, it was Robert's suggestion; I thought I needed *something* like
this but was hesitant for the niggling reason that Andres mentions:
what if someone somewhere (including code outside our source tree)
depends on this ordering because of unlocking etc?

At that time I thought that my clean-up logic wasn't going to work on
Windows without this reordering, because we were potentially closing
file handles after unlinking the files, and I was under the impression
that Windows wouldn't like that.  Since then I've learned that Windows
does actually allow it, but only if all file handles were opened with
the FILE_SHARE_DELETE flag.  We always do that (see src/port/open.c),
so in fact this change is probably not needed for my patch set (theory
not tested).  I will put it in a separate patch as requested by
Andres, because it's generally a good idea anyway for the reasons that
Robert explained (ie you probably always want to clean up memory last,
since it might contain the meta-data/locks/control objects/whatever
you'll need to clean up anything else).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Peter Geoghegan
On Tue, Nov 7, 2017 at 1:01 PM, Andres Freund  wrote:
> +/*
> + * Build the name for a given segment of a given BufFile.
> + */
> +static void
> +MakeSharedSegmentName(char *name, const char *buffile_name, int segment)
> +{
> +   snprintf(name, MAXPGPATH, "%s.%d", buffile_name, segment);
> +}
>
> Not a fan of this name - you're not "making" a filename here (as in
> allocating or such). I think I'd just remove the Make prefix.

+1

Can we document the theory behind file naming here, if that isn't in
the latest version? This is a routine private to parallel hash join
(or shared tuplestore), not Buffile. Maybe Buffile should have some
opinion on this, though. Just as a matter of style.

> +/*
> + * Delete a BufFile that was created by BufFileCreateShared in the given
> + * SharedFileSet using the given name.
> + *
> + * It is not necessary to delete files explicitly with this function.  It is
> + * provided only as a way to delete files proactively, rather than waiting 
> for
> + * the SharedFileSet to be cleaned up.
> + *
> + * Only one backend should attempt to delete a given name, and should know
> + * that it exists and has been exported or closed.
> + */

This part is new to me. We now want one backend to delete a given
filename. What changed? Please provide a Message-Id reference if that
will help me to understand.

For now, I'm going to guess that this development had something to do
with the need to deal with virtual FDs that do a close() on an FD to
keep under backend limits. Do I have that right?

> +   /*
> +* We don't set FD_DELETE_AT_CLOSE for files opened this way, but we 
> still
> +* want to make sure they get closed at end of xact.
> +*/
> +   ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> +   ResourceOwnerRememberFile(CurrentResourceOwner, file);
> +   VfdCache[file].resowner = CurrentResourceOwner;
>
> So maybe I'm being pedantic here, but wouldn't the right order be to do
> ResourceOwnerEnlargeFiles() *before* creating the file? It's a memory
> allocating operation, so it can fail, which'd leak the file.

I remember going to pains to get this right with my own unifiable
BufFile concept. I'm going to wait for an my question about file
deletion + resowners before commenting further, though.

> +   if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
> +   {
> +   /* Subtract its size from current usage (do first in case of 
> error) */
> +   temporary_files_size -= vfdP->fileSize;
> +   vfdP->fileSize = 0;
> +   }
>
> So, is it right to do so unconditionally and without regard for errors?
> If the file isn't deleted, it shouldn't be subtracted from fileSize. I
> guess you're managing that through the flag, but that's not entirely
> obvious.

I think that the problem here is that the accounting is expected to
always work. It's not like there is a resowner style error path in
which temporary_files_size gets reset to 0.

> Is that entirely unproblematic? Are there any DSM callbacks that rely on
> locks still being held? Please split this part into a separate commit
> with such analysis.

I was always confused about the proper use of DSM callbacks myself.
They seemed generally underdocumented.

> +   /* Create the output buffer. */
> +   if (accessor->write_chunk != NULL)
> +   pfree(accessor->write_chunk);
> +   accessor->write_chunk = (SharedTuplestoreChunk *)
> +   palloc0(accessor->write_pages * BLCKSZ);
>
> Are we guaranteed to be in a long-lived memory context here?

I imagine that Thomas looked to tuplestore_begin_heap() + interXact as
a kind of precedent here. See comments above that function.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Robert Haas
On Tue, Nov 7, 2017 at 4:01 PM, Andres Freund  wrote:
> +   ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> +   ResourceOwnerRememberFile(CurrentResourceOwner, file);
> +   VfdCache[file].resowner = CurrentResourceOwner;
>
> So maybe I'm being pedantic here, but wouldn't the right order be to do
> ResourceOwnerEnlargeFiles() *before* creating the file? It's a memory
> allocating operation, so it can fail, which'd leak the file.

That's not pedantic ... that's a very sound criticism.  IMHO, anyway.

> diff --git a/src/backend/utils/resowner/resowner.c 
> b/src/backend/utils/resowner/resowner.c
> index 4c35ccf65eb..8b91d5a6ebe 100644
> --- a/src/backend/utils/resowner/resowner.c
> +++ b/src/backend/utils/resowner/resowner.c
> @@ -528,16 +528,6 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
> PrintRelCacheLeakWarning(res);
> RelationClose(res);
> }
> -
> -   /* Ditto for dynamic shared memory segments */
> -   while (ResourceArrayGetAny(&(owner->dsmarr), ))
> -   {
> -   dsm_segment *res = (dsm_segment *) 
> DatumGetPointer(foundres);
> -
> -   if (isCommit)
> -   PrintDSMLeakWarning(res);
> -   dsm_detach(res);
> -   }
> }
> else if (phase == RESOURCE_RELEASE_LOCKS)
> {
> @@ -654,6 +644,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
> PrintFileLeakWarning(res);
> FileClose(res);
> }
> +
> +   /* Ditto for dynamic shared memory segments */
> +   while (ResourceArrayGetAny(&(owner->dsmarr), ))
> +   {
> +   dsm_segment *res = (dsm_segment *) 
> DatumGetPointer(foundres);
> +
> +   if (isCommit)
> +   PrintDSMLeakWarning(res);
> +   dsm_detach(res);
> +   }
> }
>
> Is that entirely unproblematic? Are there any DSM callbacks that rely on
> locks still being held? Please split this part into a separate commit
> with such analysis.

FWIW, I think this change is a really good idea (I recommended it to
Thomas at some stage, I think).  The current positioning was decided
by me at a very early stage of parallel query development where I
reasoned as follows (1) the first thing we're going to implement is
going to be parallel quicksort, (2) that's going to allocate a huge
amount of DSM, (3) therefore we should try to free it as early as
possible.  However, I now thing that was wrongheaded, and not just
because parallel quicksort didn't turn out to be the first thing we
developed.  Memory is the very last resource we should release when
aborting a transaction, because any other resource we have is tracked
using data structures that are stored in memory.  Throwing the memory
away before the end therefore makes life very difficult. That's why,
for backend-private memory, we clean up most everything else in
AbortTransaction() and then only destroy memory contexts in
CleanupTransaction().  This change doesn't go as far, but it's in the
same general direction, and I think rightly so.  My error was in
thinking that the primary use of memory would be for storing data, but
really it's about where you put your control structures.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Andres Freund
Hi,

Here's a review of v24

+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+  select generate_series(1, 2) AS id, 'aa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate.  We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+  select generate_series(1, 2) as id, 'aa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+delete from bigger_than_it_looks where id <= 19000;
+vacuum bigger_than_it_looks;
+analyze bigger_than_it_looks;
+insert into bigger_than_it_looks
+  select generate_series(1, 19000) as id, 'aa';

It seems kinda easier to just manipulate ndistinct and reltuples...


+set max_parallel_workers_per_gather = 0;
+set work_mem = '4MB';

I hope there's a fair amount of slop here - with different archs you're
going to see quite some size differences.

+-- The "good" case: batches required, but we plan the right number; we
+-- plan for 16 batches, and we stick to that number, and peak memory
+-- usage says within our work_mem budget
+-- non-parallel
+set max_parallel_workers_per_gather = 0;
+set work_mem = '128kB';

So how do we know that's actually the case we're testing rather than
something arbitrarily different? There's IIRC tests somewhere that just
filter the json explain output to the right parts...





+/*
+ * Build the name for a given segment of a given BufFile.
+ */
+static void
+MakeSharedSegmentName(char *name, const char *buffile_name, int segment)
+{
+   snprintf(name, MAXPGPATH, "%s.%d", buffile_name, segment);
+}

Not a fan of this name - you're not "making" a filename here (as in
allocating or such). I think I'd just remove the Make prefix.



+/*
+ * Open a file that was previously created in another backend with
+ * BufFileCreateShared in the same SharedFileSet using the same name.  The
+ * backend that created the file must have called BufFileClose() or
+ * BufFileExport() to make sure that it is ready to be opened by other
+ * backends and render it read-only.
+ */

Is it actually guaranteed that it's another backend / do we rely on
that?

+BufFile *
+BufFileOpenShared(SharedFileSet *fileset, const char *name)
+{

+   /*
+* If we didn't find any files at all, then no BufFile exists with this
+* tag.
+*/
+   if (nfiles == 0)
+   return NULL;

s/taag/name/?


+/*
+ * Delete a BufFile that was created by BufFileCreateShared in the given
+ * SharedFileSet using the given name.
+ *
+ * It is not necessary to delete files explicitly with this function.  It is
+ * provided only as a way to delete files proactively, rather than waiting for
+ * the SharedFileSet to be cleaned up.
+ *
+ * Only one backend should attempt to delete a given name, and should know
+ * that it exists and has been exported or closed.
+ */
+void
+BufFileDeleteShared(SharedFileSet *fileset, const char *name)
+{
+   charsegment_name[MAXPGPATH];
+   int segment = 0;
+   boolfound = false;
+
+   /*
+* We don't know how many segments the file has.  We'll keep deleting
+* until we run out.  If we don't manage to find even an initial 
segment,
+* raise an error.
+*/
+   for (;;)
+   {
+   MakeSharedSegmentName(segment_name, name, segment);
+   if (!SharedFileSetDelete(fileset, segment_name, true))
+   break;
+   found = true;
+   ++segment;
+   }

Hm. Do we properly delete all the files via the resowner mechanism if
this fails midway? I.e. if there are no leading segments? Also wonder if
this doesn't need a CFI check.

+void
+PathNameCreateTemporaryDir(const char *basedir, const char *directory)
+{
+   if (mkdir(directory, S_IRWXU) < 0)
+   {
+   if (errno == EEXIST)
+   return;
+
+   /*
+* Failed.  Try to create basedir first in case it's missing. 
Tolerate
+* ENOENT to close a race against another process following the 
same
+* algorithm.
+*/
+   if (mkdir(basedir, S_IRWXU) < 0 && errno != ENOENT)
+   elog(ERROR, "cannot create temporary directory \"%s\": 
%m",
+basedir);

ENOENT or EEXIST?



+File
+PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
+{
+   Filefile;
+
+   /*
+* Open the file.  Note: we don't use O_EXCL, in case there is an 
orphaned
+* temp file that can be reused.
+*/
+   file = PathNameOpenFile(path, 

Re: [HACKERS] Parallel Hash take II

2017-11-02 Thread Thomas Munro
On Mon, Oct 30, 2017 at 1:49 PM, Thomas Munro
 wrote:
> A couple of stupid things outstanding:
>
> 1.  EXPLAIN ANALYZE for Parallel Hash "actual" shows the complete row
> count, which is interesting to know (or not?  maybe I should show it
> somewhere else?), but the costing shows the partial row estimate used
> for costing purposes.

Fixed.

> 2.  The BufFileSet's temporary directory gets created even if you
> don't need it for batches.  Duh.

Fixed.

I also refactored shared temporary files a bit while looking into
this.  The shared file ownership mechanism is now promoted to its own
translation unit sharedfileset.c and it now works with fd.c files.
buffile.c can still make use of it.  That seems like a better division
of labour.

> 3.  I don't have a good query rescan regression query yet.  I wish I
> could write my own query plans to test the executor.

I found a query that rescans a parallel-aware hash join and added a
couple of variants to the regression tests.

I also decluttered the EXPLAIN ANALYZE output for enable_parallel_hash
= off a bit.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v24.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-10-30 Thread Thomas Munro
On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund  wrote:
> On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
>> I'll report on performance separately.
>
> Looking forward to that ;)

Here are some experimental results from a Xeon E5-2695 v3 with a ton
of RAM and spinning disks (EDB lab server "scylla").  I used TPC-H
dbgen scale 10 with the additional indexes suggested by Tomas
Vondra[1].  10GB of source data (= 23GB pgdata dir) is obviously quite
small as these things go, and I hope we'll run some of these with a
much larger scale soon (it's a big job), but it's enough to runs
queries for tens of seconds to minutes so it's definitely in parallel
query territory and shows some pretty interesting effects IMHO.

First, here's a stupid self-join as a warm-up.  The query is SELECT
COUNT(*) FROM lineitem r JOIN lineitem s USING (l_orderkey,
l_linenumber), where lineitem is ~60 million rows.

(1) With work_mem set sky-high so no batching is required, how much
speed-up does each worker contribute with this feature off (= same as
unpatched master) and on?  In this table, each cell shows the speed-up
compared to w=0 (no workers):

 parallel_hash |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
---+---+---+---+---+---+---+---
 off   | 1.00x | 1.42x | 1.66x | 1.76x | 1.66x | 1.68x | 1.69x
 on| 1.00x | 1.76x | 2.54x | 3.21x | 3.80x | 4.30x | 4.79x

(2) With work_mem set to 128MB this query needs 32 batches.  Again,
how much speed-up does each worker contribute with the feature off and
on?

 parallel_hash |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
---+---+---+---+---+---+---+---
 off   | 1.00x | 1.37x | 1.49x | 1.32x | 1.67x | 1.63x | 1.64x
 on| 1.00x | 1.94x | 2.72x | 2.82x | 3.02x | 4.64x | 5.98x

I haven't tried to grok the shape of that curve yet.  Interestingly
(not shown here) the 32 batch parallel hash actually starts to beat
the single-batch parallel hash somewhere around 5-6 workers, and at 15
workers it achieves 9.53x speed-up compared to w=0 and is still
gaining as you add more workers, whereas the single-batch version tops
out around 8 workers.  This may be in part due to the trade-offs
discussed in "Design and Evaluation of Main Memory Hash Join
Algorithms for Multi-core CPUs" (short version: partitioning up front
can pay off by reducing cache misses at various levels and some
research databases would consider that), but I would think we're
probably pretty far away from that frontier and there other probably
other more basic problems.  Investigation/profiling required.

Next, here are some numbers from the TPC-H queries.  I included only
queries where a Parallel Hash was selected by the planner.  I stopped
at w=6 because that's the highest number of workers the planner would
pick by default at that scale.  (If I'm getting the maths right, TPC-H
scale 300's lineitem table should inspire about 10 workers to get out
of bed; you get an extra worker each time a table triples in size.)

(3) What is the speed-up with enable_parallel_hash = on vs
enable_parallel_hash = off?  Here is the result table for various
numbers of workers, with work_mem set to 1GB.

 query |  w=0  |  w=1  |  w=2  |  w=3  |  w=4  |  w=5  |  w=6
---+---+---+---+---+---+---+---
 3 | 1.02x | 1.16x | 1.37x | 1.79x | 1.95x | 2.29x | 2.44x
 5 | 1.03x | 1.15x | 1.20x | 1.44x | 1.95x | 2.05x | 1.34x
 7 | 1.02x | 1.26x | 1.54x | 2.18x | 2.57x | 1.25x | 1.32x
 8 | 1.00x | 1.56x | 1.49x | 1.47x | 1.40x | 0.55x | 0.55x
 9 | 1.02x | 1.24x | 1.35x | 1.50x | 1.59x | 1.82x | 1.82x
10 | 1.02x | 1.16x | 1.19x | 1.44x | 1.51x | 1.75x | 1.83x
12 | 1.01x | 1.22x | 1.53x | 0.72x | 0.74x | 0.73x | 0.99x
14 | 1.00x | 1.08x | 1.18x | 1.33x | 1.41x | 1.54x | 1.52x
16 | 1.01x | 1.22x | 1.10x | 1.10x | 1.10x | 1.11x | 1.10x
18 | 0.99x | 1.07x | 1.05x | 0.99x | 0.99x | 0.99x | 1.03x
21 | 1.00x | 1.28x | 1.24x | 1.34x | 0.18x | 0.19x | 0.23x

Some commentary on the cases where the performance was apparently hurt
by the feature: for Q21 with w=3 workers and above with
enable_parallel_hash = off the planner switched from a hash join to a
nested loop and that turned out to be better, but with
enable_parallel_hash = on it didn't give up on hash join until there
were 6 workers.  Something similar happened with Q8 around 5 workers.
Q21 has some major cardinality estimation problems as discussed
elsewhere, and on this run I didn't think to apply the patch that
fixes (?) that.  In other words, as far as I can tell, all of those
are cases where there is possibly room for general planner improvement
outside this project: the point at which we flip from one plan type to
another moves around, not necessarily indicating a problem with
Parallel Hash as an executor node.  That isn't to say I'm not
interested in understanding the causes better and trying to 

Re: [HACKERS] Parallel Hash take II

2017-10-29 Thread Thomas Munro
On Fri, Oct 27, 2017 at 12:24 AM, Rushabh Lathia
 wrote:
> While re-basing the parallel-B-tree-index-build patch on top v22 patch
> sets, found cosmetic review:

Thanks!

> 1) BufFileSetEstimate is removed but it's still into buffile.h
>
> +extern size_t BufFileSetEstimate(int stripes);

Fixed.

> 2) BufFileSetCreate is renamed with BufFileSetInit, but used at below
> place in comments:
>
> * Attach to a set of named BufFiles that was created with BufFileSetCreate.

Fixed.

Other minor tweaks: I fixed a harmless warning from Visual C++ and
added a CHECK_FOR_INTERRUPTS() to
ExecParallelHashJoinPartitionOuter()'s loop.

Two other changes:

1.  Improved concurrency for sharedtuplestore.c.

Last week I investigated some test failures on AppVeyor CI and
discovered a small problem with sharedtuplestore.c on Windows: it
could occasionally get ERROR_ACCESS_DENIED errors when attempting to
open files that were concurrently being unlinked (unlinking is not
atomic on Windows, see pgsql-bugs #14243 for another manifestation).
That code was a bit sloppy (though not incorrect), and was easy to fix
by doing some checks in a different order, but...

While hacking on that I realised that sharedtuplestore.c's parallel
scan efficiency was pretty terrible anyway, so I made an improvement
that I'd earlier threatened to make in a comment.  Instead of holding
a per-file lock while reading individual tuples, it now works in
page-multiple-sized chunks.  Readers only acquire a spinlock when they
need a new chunk, don't hold any locks while doing IO, and never read
overlapping pages.  From a user perspective, this means that EXPLAIN
(BUFFERS) temporary buffer read counts are approximately the same as
for the equivalent non-parallel hash join, because each worker reads a
disjoint set of temporary file pages instead of reading interleaved
tuples from the same pages, and there is no more LWLock "shared
tuplestore" that can show up in wait_event when backends pile up on
the same batch.  It writes very slightly more than it reads because of
unread pages at the end of the final chunk (because it reads back in
tuple-at-a-time and thus page-at-a-time, not whole chunk at a time --
I considered reading whole chunks and then returning pointer to
MinimalTuples in the chunk, but that required MAXALIGNing data in the
files on disk which made the files noticeably bigger).

So, to summarise, there are now three layers of contention avoidance
strategy being used by Parallel Hash Join for scanning batches in
parallel:

i)  Participants in a Parallel Hash Join try to work on different
batches so they avoid scanning the same SharedTuplestore completely.
That's visible with EXPLAIN ANALYZE as "Batches Probed" (that shows
the number of outer batches scanned; it doesn't seem worth the pixels
to show "Batches Loaded" for the number of inner batches scanned which
may be lower).

ii)  When that's not possible, they start out reading from different
backing files by starting with the one they wrote themselves and then
go around the full set.  That means they don't contend on the per-file
read-head lock until a reader drains a whole file and then choses
another one that's still being read by someone else.

iii)  [New in this version] Since they might still finish up reading
from the same file (and often do towards the end of a join), the
tuples are chopped into multi-page chunks and participants are
allocated chunks using a spinlock-protected counter.  This is quite a
lot like Parallel Sequential Scan, with some extra complications due
to variable sized chunks.

2.  Improved oversized tuple handling.

I added a new regression test case to exercise the oversized tuple
support in ExecParallelHashLoadTuple() and sts_puttuple(), as
requested by Andres a while back.  (Thanks to Andrew Gierth for a
pointer on how to get detoasted tuples into a hash join table without
disabling parallelism.)  While testing that I realised that my
defences against some degenerate behaviour with very small work_mem
weren't quite good enough.  Previously, I adjusted space_allowed to
make sure every backend could allocate at least one memory chunk
without triggering an increase in the number of batches.  Now, I leave
space_allowed alone but explicitly allow every backend to allocate at
least one chunk ignoring the memory budget (whether regular chunk size
or oversized tuple size), to avoid futile repeated batch increases
when a single monster tuple is never going to fit in work_mem.

A couple of stupid things outstanding:

1.  EXPLAIN ANALYZE for Parallel Hash "actual" shows the complete row
count, which is interesting to know (or not?  maybe I should show it
somewhere else?), but the costing shows the partial row estimate used
for costing purposes.
2.  The BufFileSet's temporary directory gets created even if you
don't need it for batches.  Duh.
3.  I don't have a good query rescan regression query yet.  I wish I
could write my own query plans to test the 

Re: [HACKERS] Parallel Hash take II

2017-10-26 Thread Rushabh Lathia
While re-basing the parallel-B-tree-index-build patch on top v22 patch
sets, found cosmetic review:

1) BufFileSetEstimate is removed but it's still into buffile.h

+extern size_t BufFileSetEstimate(int stripes);


2) BufFileSetCreate is renamed with BufFileSetInit, but used at below
place in comments:

* Attach to a set of named BufFiles that was created with BufFileSetCreate.

Thanks,

On Wed, Oct 25, 2017 at 11:33 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Tue, Oct 24, 2017 at 10:10 PM, Thomas Munro
>  wrote:
> > Here is an updated patch set that does that ^.
>
> It's a bit hard to understand what's going on with the v21 patch set I
> posted yesterday because EXPLAIN ANALYZE doesn't tell you anything
> interesting.  Also, if you apply the multiplex_gather patch[1] I
> posted recently and set multiplex_gather to off then it doesn't tell
> you anything at all, because the leader has no hash table (I suppose
> that could happen with unpatched master given sufficiently bad
> timing).  Here's a new version with an extra patch that adds some
> basic information about load balancing to EXPLAIN ANALYZE, inspired by
> what commit bf11e7ee did for Sort.
>
> Example output:
>
> enable_parallel_hash = on, multiplex_gather = on:
>
>  ->  Parallel Hash (actual rows=100 loops=3)
>Buckets: 131072  Batches: 16
>Leader:Shared Memory Usage: 3552kB  Hashed: 396120  Batches
> Probed: 7
>Worker 0:  Shared Memory Usage: 3552kB  Hashed: 276640  Batches
> Probed: 6
>Worker 1:  Shared Memory Usage: 3552kB  Hashed: 327240  Batches
> Probed: 6
>->  Parallel Seq Scan on simple s (actual rows=33 loops=3)
>
>  ->  Parallel Hash (actual rows=1000 loops=8)
>Buckets: 131072  Batches: 256
>Leader:Shared Memory Usage: 2688kB  Hashed: 1347720
> Batches Probed: 36
>Worker 0:  Shared Memory Usage: 2688kB  Hashed: 1131360
> Batches Probed: 33
>Worker 1:  Shared Memory Usage: 2688kB  Hashed: 1123560
> Batches Probed: 38
>Worker 2:  Shared Memory Usage: 2688kB  Hashed: 1231920
> Batches Probed: 38
>Worker 3:  Shared Memory Usage: 2688kB  Hashed: 1272720
> Batches Probed: 34
>Worker 4:  Shared Memory Usage: 2688kB  Hashed: 1234800
> Batches Probed: 33
>Worker 5:  Shared Memory Usage: 2688kB  Hashed: 1294680
> Batches Probed: 37
>Worker 6:  Shared Memory Usage: 2688kB  Hashed: 1363240
> Batches Probed: 35
>->  Parallel Seq Scan on big s2 (actual rows=125 loops=8)
>
> enable_parallel_hash = on, multiplex_gather = off (ie no leader
> participation):
>
>  ->  Parallel Hash (actual rows=100 loops=2)
>Buckets: 131072  Batches: 16
>Worker 0:  Shared Memory Usage: 3520kB  Hashed: 475920  Batches
> Probed: 9
>Worker 1:  Shared Memory Usage: 3520kB  Hashed: 524080  Batches
> Probed: 8
>->  Parallel Seq Scan on simple s (actual rows=50 loops=2)
>
> enable_parallel_hash = off, multiplex_gather = on:
>
>  ->  Hash (actual rows=100 loops=3)
>Buckets: 131072  Batches: 16
>Leader:Memory Usage: 3227kB
>Worker 0:  Memory Usage: 3227kB
>Worker 1:  Memory Usage: 3227kB
>->  Seq Scan on simple s (actual rows=100 loops=3)
>
> enable_parallel_hash = off, multiplex_gather = off:
>
>  ->  Hash (actual rows=100 loops=2)
>Buckets: 131072  Batches: 16
>Worker 0:  Memory Usage: 3227kB
>Worker 1:  Memory Usage: 3227kB
>->  Seq Scan on simple s (actual rows=100 loops=2)
>
> parallelism disabled (traditional single-line output, unchanged):
>
>  ->  Hash (actual rows=100 loops=1)
>Buckets: 131072  Batches: 16  Memory Usage: 3227kB
>->  Seq Scan on simple s (actual rows=100 loops=1)
>
> (It actually says "Tuples Hashed", not "Hashed" but I edited the above
> to fit on a standard punchcard.)  Thoughts?
>
> [1] https://www.postgresql.org/message-id/CAEepm%3D2U%2B%
> 2BLp3bNTv2Bv_kkr5NE2pOyHhxU%3DG0YTa4ZhSYhHiw%40mail.gmail.com
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>



-- 
Rushabh Lathia


Re: [HACKERS] Parallel Hash take II

2017-10-25 Thread Thomas Munro
On Tue, Oct 24, 2017 at 10:10 PM, Thomas Munro
 wrote:
> Here is an updated patch set that does that ^.

It's a bit hard to understand what's going on with the v21 patch set I
posted yesterday because EXPLAIN ANALYZE doesn't tell you anything
interesting.  Also, if you apply the multiplex_gather patch[1] I
posted recently and set multiplex_gather to off then it doesn't tell
you anything at all, because the leader has no hash table (I suppose
that could happen with unpatched master given sufficiently bad
timing).  Here's a new version with an extra patch that adds some
basic information about load balancing to EXPLAIN ANALYZE, inspired by
what commit bf11e7ee did for Sort.

Example output:

enable_parallel_hash = on, multiplex_gather = on:

 ->  Parallel Hash (actual rows=100 loops=3)
   Buckets: 131072  Batches: 16
   Leader:Shared Memory Usage: 3552kB  Hashed: 396120  Batches Probed: 7
   Worker 0:  Shared Memory Usage: 3552kB  Hashed: 276640  Batches Probed: 6
   Worker 1:  Shared Memory Usage: 3552kB  Hashed: 327240  Batches Probed: 6
   ->  Parallel Seq Scan on simple s (actual rows=33 loops=3)

 ->  Parallel Hash (actual rows=1000 loops=8)
   Buckets: 131072  Batches: 256
   Leader:Shared Memory Usage: 2688kB  Hashed: 1347720
Batches Probed: 36
   Worker 0:  Shared Memory Usage: 2688kB  Hashed: 1131360
Batches Probed: 33
   Worker 1:  Shared Memory Usage: 2688kB  Hashed: 1123560
Batches Probed: 38
   Worker 2:  Shared Memory Usage: 2688kB  Hashed: 1231920
Batches Probed: 38
   Worker 3:  Shared Memory Usage: 2688kB  Hashed: 1272720
Batches Probed: 34
   Worker 4:  Shared Memory Usage: 2688kB  Hashed: 1234800
Batches Probed: 33
   Worker 5:  Shared Memory Usage: 2688kB  Hashed: 1294680
Batches Probed: 37
   Worker 6:  Shared Memory Usage: 2688kB  Hashed: 1363240
Batches Probed: 35
   ->  Parallel Seq Scan on big s2 (actual rows=125 loops=8)

enable_parallel_hash = on, multiplex_gather = off (ie no leader participation):

 ->  Parallel Hash (actual rows=100 loops=2)
   Buckets: 131072  Batches: 16
   Worker 0:  Shared Memory Usage: 3520kB  Hashed: 475920  Batches Probed: 9
   Worker 1:  Shared Memory Usage: 3520kB  Hashed: 524080  Batches Probed: 8
   ->  Parallel Seq Scan on simple s (actual rows=50 loops=2)

enable_parallel_hash = off, multiplex_gather = on:

 ->  Hash (actual rows=100 loops=3)
   Buckets: 131072  Batches: 16
   Leader:Memory Usage: 3227kB
   Worker 0:  Memory Usage: 3227kB
   Worker 1:  Memory Usage: 3227kB
   ->  Seq Scan on simple s (actual rows=100 loops=3)

enable_parallel_hash = off, multiplex_gather = off:

 ->  Hash (actual rows=100 loops=2)
   Buckets: 131072  Batches: 16
   Worker 0:  Memory Usage: 3227kB
   Worker 1:  Memory Usage: 3227kB
   ->  Seq Scan on simple s (actual rows=100 loops=2)

parallelism disabled (traditional single-line output, unchanged):

 ->  Hash (actual rows=100 loops=1)
   Buckets: 131072  Batches: 16  Memory Usage: 3227kB
   ->  Seq Scan on simple s (actual rows=100 loops=1)

(It actually says "Tuples Hashed", not "Hashed" but I edited the above
to fit on a standard punchcard.)  Thoughts?

[1] 
https://www.postgresql.org/message-id/CAEepm%3D2U%2B%2BLp3bNTv2Bv_kkr5NE2pOyHhxU%3DG0YTa4ZhSYhHiw%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v22.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-10-24 Thread Thomas Munro
On Tue, Sep 19, 2017 at 8:06 AM, Robert Haas  wrote:
> On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro
>  wrote:
>> 3.  Gather Merge and Parallel Hash Join may have a deadlock problem.
>
> [...]
>
> Thomas and I spent about an hour and a half brainstorming about this
> just now.
>
> [...]
>
> First, as long as nbatches == 1, we can use a hash
> table of up to size (participants * work_mem); if we have to switch to
> multiple batches, then just increase the number of batches enough that
> the current memory usage drops below work_mem.  Second, following an
> idea originally by Ashutosh Bapat whose relevance to this issue Thomas
> Munro realized during our discussion, we can make all the batches
> small enough to fit in work_mem (rather than participants * work_mem
> as the current patch does) and spread them across the workers (in the
> style of Parallel Append, including potentially deploying multiple
> workers against the same batch if there are fewer batches than
> workers).

Here is an updated patch set that does that ^.

Assorted details:

1.  To avoid deadlocks, we can only wait at barriers when we know that
all other attached backends have either arrived already or are
actively executing the code preceding the barrier wait, so that
progress is guaranteed.  The rules is that executor nodes can remain
attached to a barrier after they've emitted a tuple, which is useful
for resource management (ie avoids inventing a separate reference
counting scheme), but must never again wait for it.  With that
programming rule there can be no deadlock between executor nodes.

2.  Multiple batches are processed at the same time in parallel,
rather than being processed serially.  Participants try to spread
themselves out over different batches to reduce contention.

3.  I no longer try to handle outer joins.  I have an idea for how to
do that while adhering to the above deadlock-avoidance programming
rule, but I would like to consider that for a later patch.

4.  I moved most of the parallel-aware code into ExecParallelHash*()
functions that exist alongside the private hash table versions.  This
avoids uglifying the existing hash join code and introducing
conditional branches all over the place, as requested by Andres.  This
made some of the earlier refactoring patches unnecessary.

5.  Inner batch repartitioning, if required, is now completed up front
for Parallel Hash.  Rather than waiting until we try to load hash
tables back into memory to discover that they don't fit, this version
tracks the size of hash table contents while writing the batches out.
That change has various pros and cons, but its main purpose is to
remove dependencies between batches.

6.  Outer batch repartitioning is now done up front too, if it's
necessary.  This removes the dependencies that exist between batch 0
and later batches, but means that outer batch 0 is now written to disk
if for multi-batch joins.  I don't see any way to avoid that while
adhering to the deadlock avoidance rule stated above.  If we've
already started emitting tuples for batch 0 (by returning control to
higher nodes) then we have no deadlock-free way to wait for the scan
of the outer relation to finish, so we can't safely process later
batches.  Therefore we must buffer batch 0's tuples.  This renders the
skew optimisation useless.

7.  There is now some book-keeping state for each batch.  For typical
cases the total space used is negligible but obviously you can
contrive degenerate cases where it eats a lot of memory (say by
creating a million batches, which is unlikely to work well for other
reasons).  I have some ideas on reducing their size, but on the other
hand I also have some ideas on increasing it profitably... (this is
the perfect place to put the Bloom filters discussed elsewhere that
would  make up for the loss of the skew optimisation, for selective
joins; a subject for another day).

8.  Barrier API extended slightly.  (1) BarrierWait() is renamed to
BarrierArriveAndWait().  (2) BarrierArriveAndDetach() is new.  The new
function is the mechanism required to get from PHJ_BATCH_PROBING to
PHJ_BATCH_DONE without waiting, and corresponds to the operation known
as Phaser.arriveAndDeregister() in Java (and maybe also
barrier::arrive_and_drop() in the C++ concurrency TS and Clock.drop()
in X10, not sure).

9.  I got rid of PARALLEL_KEY_EXECUTOR_NODE_NTH().  Previously I
wanted more than one reserved smh_toc key per executor node.  Robert
didn't like that.

10.  I got rid of "LeaderGate".  That earlier attempt at deadlock
avoidance clearly missed the mark.  (I think it probably defended
against the Gather + full TupleQueue deadlock but not the
GatherMerge-induced deadlock so it was a useless non-general
solution.)

11.  The previous incarnation of SharedTuplestore had a built-in
concept of partitions, which allowed the number of partitions to be
expanded any time but only allowed one partition to be read back 

Re: [HACKERS] Parallel Hash take II

2017-10-11 Thread Prabhat Sahu
Hi Thomas,

I was testing this feature with v20 patch, and I got a crash while doing
large joins with small work_mem, and lots of workers as below.

-- Machine Configuration: (d1.xlarge) CUPs : 8 , RAM  : 16GB , SIze : 640GB

-- postgres.conf setting as below:
work_mem = *64kB*
max_parallel_workers_per_gather = *128*
max_parallel_workers = *64*
enable_mergejoin = off
enable_nestloop = off
enable_hashjoin = on
force_parallel_mode = on
seq_page_cost = 0.1
random_page_cost = 0.1
effective_cache_size = 128MB
parallel_tuple_cost = 0
parallel_setup_cost = 0
parallel_synchronization_cost = 0

-- created only one table "lineitem" of size 93GB.
postgres=# select pg_size_pretty(pg_total_relation_size('lineitem'));
 pg_size_pretty

 93 GB
(1 row)

[centos@centos-prabhat bin]$ vi test10.sql
explain (analyze, costs off)
select  count(*) from lineitem t1 join lineitem t2 using(l_suppkey) join
lineitem t3 using(l_suppkey);
select  count(*) from lineitem t1 join lineitem t2 using(l_suppkey) join
lineitem t3 using(l_suppkey);

[centos@centos-prabhat bin]$ ./psql postgres -a -f test10.sql > test10.out

[centos@centos-prabhat bin]$ vi test10.out
explain (analyze, costs off)
select  count(*) from lineitem t1 join lineitem t2 using(l_suppkey) join
lineitem t3 using(l_suppkey);
psql:test10.sql:2: WARNING:  terminating connection because of crash of
another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
psql:test10.sql:2: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:test10.sql:2: connection to server was lost


Kindly check, if you can reproduce the same at your end.



*Thanks & Regards,*

*Prabhat Kumar Sahu*
Mob: 7758988455
Skype ID: prabhat.sahu1984

www.enterprisedb.co m


On Thu, Oct 5, 2017 at 1:15 PM, Thomas Munro 
wrote:

> On Thu, Oct 5, 2017 at 7:07 PM, Rushabh Lathia 
> wrote:
> > v20 patch set (I was trying 0008, 0009 patch)  not getting cleanly apply
> on
> > latest commit also getting compilation error due to refactor in below
> > commit.
> >
> > commit 0c5803b450e0cc29b3527df3f352e6f18a038cc6
>
> Hi Rushabh
>
> I am about to post a new patch set that fixes the deadlock problem,
> but in the meantime here is a rebase of those two patches (numbers
> changed to 0006 + 0007).  In the next version I think I should
> probably remove that 'stripe' concept.  The idea was to spread
> temporary files over the available temporary tablespaces, but it's a
> terrible API, since you have to promise to use the same stripe number
> when opening the same name later... Maybe I should use a hash of the
> name for that instead.  Thoughts?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


Re: [HACKERS] Parallel Hash take II

2017-10-05 Thread Thomas Munro
On Thu, Oct 5, 2017 at 7:07 PM, Rushabh Lathia  wrote:
> v20 patch set (I was trying 0008, 0009 patch)  not getting cleanly apply on
> latest commit also getting compilation error due to refactor in below
> commit.
>
> commit 0c5803b450e0cc29b3527df3f352e6f18a038cc6

Hi Rushabh

I am about to post a new patch set that fixes the deadlock problem,
but in the meantime here is a rebase of those two patches (numbers
changed to 0006 + 0007).  In the next version I think I should
probably remove that 'stripe' concept.  The idea was to spread
temporary files over the available temporary tablespaces, but it's a
terrible API, since you have to promise to use the same stripe number
when opening the same name later... Maybe I should use a hash of the
name for that instead.  Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com


0006-Remove-BufFile-s-isTemp-flag.patch
Description: Binary data


0007-Add-BufFileSet-for-sharing-temporary-files-between-b.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-10-05 Thread Rushabh Lathia
v20 patch set (I was trying 0008, 0009 patch)  not getting cleanly apply on
latest commit also getting compilation error due to refactor in below
commit.

commit 0c5803b450e0cc29b3527df3f352e6f18a038cc6
Author: Peter Eisentraut 
Date:   Sat Sep 23 09:49:22 2017 -0400

Refactor new file permission handling



On Mon, Sep 25, 2017 at 11:38 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Thu, Sep 21, 2017 at 5:49 AM, Peter Geoghegan  wrote:
> > Graefe's "Query Evaluation Techniques for Large Databases" has several
> > pages on deadlock avoidance strategies. It was written almost 25 years
> > ago, but still has some good insights IMV (you'll recall that Graefe
> > is the author of the Volcano paper; this reference paper seems like
> > his follow-up). Apparently, deadlock avoidance strategy becomes
> > important for parallel sort with partitioning. You may be able to get
> > some ideas from there. And even if you don't, his handling of the
> > topic is very deliberate and high level, which suggests that ours
> > should be, too.
>
> Very interesting and certainly relevant (the parts I've read so far),
> though we don't have multiple consumers.  Multiplexing one thread so
> that it is both a consumer and a producer is an extra twist though.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Rushabh Lathia


Re: [HACKERS] Parallel Hash take II

2017-09-25 Thread Thomas Munro
On Thu, Sep 21, 2017 at 5:49 AM, Peter Geoghegan  wrote:
> Graefe's "Query Evaluation Techniques for Large Databases" has several
> pages on deadlock avoidance strategies. It was written almost 25 years
> ago, but still has some good insights IMV (you'll recall that Graefe
> is the author of the Volcano paper; this reference paper seems like
> his follow-up). Apparently, deadlock avoidance strategy becomes
> important for parallel sort with partitioning. You may be able to get
> some ideas from there. And even if you don't, his handling of the
> topic is very deliberate and high level, which suggests that ours
> should be, too.

Very interesting and certainly relevant (the parts I've read so far),
though we don't have multiple consumers.  Multiplexing one thread so
that it is both a consumer and a producer is an extra twist though.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-20 Thread Peter Geoghegan
On Mon, Sep 18, 2017 at 1:06 PM, Robert Haas  wrote:
> If we don't adopt some approach along these lines, then I think we've
> got to articulate some alternative deadlock-avoidance rule and make
> sure every parallel query facility follows it.  I welcome ideas on
> that front, but I don't think the rule mentioned above is a bad one,
> and I'd obviously like to minimize the amount of rework that we need
> to do.  Assuming we do settle on the above rule, it clearly needs to
> be documented someplace -- not sure of the place.  I think that it
> doesn't belong in README.parallel because it's an executor-specific
> rule, not necessarily a general rule to which other users of
> parallelism must adhere; they can choose their own strategies.

+1

Graefe's "Query Evaluation Techniques for Large Databases" has several
pages on deadlock avoidance strategies. It was written almost 25 years
ago, but still has some good insights IMV (you'll recall that Graefe
is the author of the Volcano paper; this reference paper seems like
his follow-up). Apparently, deadlock avoidance strategy becomes
important for parallel sort with partitioning. You may be able to get
some ideas from there. And even if you don't, his handling of the
topic is very deliberate and high level, which suggests that ours
should be, too.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-18 Thread Robert Haas
On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro
 wrote:
> 3.  Gather Merge and Parallel Hash Join may have a deadlock problem.
> Since Gather Merge needs to block waiting for tuples, but workers wait
> for all participants (including the leader) to reach barriers.  TPCH
> Q18 (with a certain set of indexes and settings, YMMV) has Gather
> Merge over Sort over Parallel Hash Join, and although it usually runs
> successfully I have observed one deadlock.  Ouch.  This seems to be a
> more fundamental problem than the blocked TupleQueue scenario.  Not
> sure what to do about that.

Thomas and I spent about an hour and a half brainstorming about this
just now.  Parallel query doesn't really have a documented deadlock
avoidance strategy, yet all committed and proposed patches other than
this one manage to avoid deadlock.  This one has had a number of
problems crop up in this area, so it struck me that it might be
violating a rule which every other patch was following.  I struggled
for a bit and finally managed to articulate what I think the
deadlock-avoidance rule is that is generally followed by other
committed and proposed patches:


Once you enter a state in which other participants might wait for you,
you must exit that state before doing anything that might wait for
another participant.


>From this, it's easy to see that the waits-for graph can't contain any
cycles: if every parallel query node obeys the above rule, then a
given node can have in-arcs or out-arcs, but not both.  I also believe
it to be the case that every existing node follows this rule.  For
instance, Gather and Gather Merge wait for workers, but they aren't at
that point doing anything that can make the workers wait for them.
Parallel Bitmap Heap Scan waits for the leader to finish building the
bitmap, but that leader never waits for anyone else while building the
bitmap.  Parallel Index(-Only) Scan waits for the process advancing
the scan to reach the next page, but that process never waits for any
other while so doing.  Other types of parallel nodes -- including the
proposed Parallel Append node, which is an interesting case because
like Parallel Hash it appears in the "middle" of the parallel portion
of the plan tree rather than the root like Gather or the leaves like a
parallel scan -- don't wait at all, except for short
spinlock-protected or LWLock-protected critical sections during which
they surely don't go into any sort of long-term wait (which would be
unacceptable for other reasons anyway).

Parallel hash violates this rule only in the case of a multi-batch
hash join, and for only one reason: to avoid blowing out work_mem.
Since, consistent with resource management decisions elsewhere, each
participant is entitled to an amount of memory equal to work_mem, the
shared hash table can and does use up to (participants * work_mem),
which means that we must wait for everybody to be done with the hash
table for batch N before building the hash table for batch N+1.  More
properly, if the hash table for the current batch happens to be
smaller than the absolute maximum amount of memory we can use, we can
build the hash table for the next batch up to the point where all the
memory is used, but must then pause and wait for the old hash table to
go away before continuing.  But that means that the process for which
we are waiting violated the rule mentioned above: by not being done
with the memory, it's making other processes wait, and by returning a
tuple, it's allowing other parts of the executor to do arbitrary
computations which can themselves wait.  So, kaboom.

One simple and stupid way to avoid this deadlock is to reduce the
memory budget for the shared hash table to work_mem and remove the
barriers that prevent more than one such hash table from existing at a
time.  In the worst case, we still use (participants * work_mem),
frequently we'll use less, but there are no longer any waits for
processes that might not even be running the parallel has node
(ignoring the moment the problem of right and full parallel hash
joins, which might need more thought).  So no deadlock.

We can do better.  First, as long as nbatches == 1, we can use a hash
table of up to size (participants * work_mem); if we have to switch to
multiple batches, then just increase the number of batches enough that
the current memory usage drops below work_mem.  Second, following an
idea originally by Ashutosh Bapat whose relevance to this issue Thomas
Munro realized during our discussion, we can make all the batches
small enough to fit in work_mem (rather than participants * work_mem
as the current patch does) and spread them across the workers (in the
style of Parallel Append, including potentially deploying multiple
workers against the same batch if there are fewer batches than
workers).  Then, single-batch parallel hash joins use the maximum
allowable memory always, and multi-batch parallel hash joins use the
maximum allowable memory after 

Re: [HACKERS] Parallel Hash take II

2017-09-14 Thread Thomas Munro
On Thu, Sep 14, 2017 at 11:57 AM, Thomas Munro
 wrote:
> On Thu, Sep 14, 2017 at 12:51 AM, Prabhat Sahu
>  wrote:
>> Setting with lower "shared_buffers" and "work_mem" as below,  query getting 
>> crash but able to see explain plan.
>
> Thanks Prabhat.  A small thinko in the batch reset code means that it
> sometimes thinks the shared skew hash table is present and tries to
> probe it after batch 1.  I have a fix for that and I will post a new
> patch set just as soon as I have a good regression test figured out.

Fixed in the attached version, by adding a missing
"hashtable->shared->num_skew_buckets = 0;" to ExecHashFreeSkewTable().
I did some incidental tidying of the regression tests, but didn't
manage to find a version of your example small enough to put in a
regression tests.  I also discovered some other things:

1.  Multi-batch Parallel Hash Join could occasionally produce a
resowner warning about a leaked temporary File associated with
SharedTupleStore objects.  Fixed by making sure we call routines that
close all files handles in ExecHashTableDetach().

2.  Since last time I tested, a lot fewer TPCH queries choose a
Parallel Hash plan.  Not sure why yet.  Possibly because Gather Merge
and other things got better.  Will investigate.

3.  Gather Merge and Parallel Hash Join may have a deadlock problem.
Since Gather Merge needs to block waiting for tuples, but workers wait
for all participants (including the leader) to reach barriers.  TPCH
Q18 (with a certain set of indexes and settings, YMMV) has Gather
Merge over Sort over Parallel Hash Join, and although it usually runs
successfully I have observed one deadlock.  Ouch.  This seems to be a
more fundamental problem than the blocked TupleQueue scenario.  Not
sure what to do about that.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v20.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-13 Thread Thomas Munro
On Thu, Sep 14, 2017 at 12:51 AM, Prabhat Sahu
 wrote:
> Setting with lower "shared_buffers" and "work_mem" as below,  query getting 
> crash but able to see explain plan.

Thanks Prabhat.  A small thinko in the batch reset code means that it
sometimes thinks the shared skew hash table is present and tries to
probe it after batch 1.  I have a fix for that and I will post a new
patch set just as soon as I have a good regression test figured out.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-13 Thread Prabhat Sahu
Hi Thomas,

Setting with lower "shared_buffers" and "work_mem" as below,  query getting
crash but able to see explain plan.

shared_buffers = 1MB
work_mem = 1MB
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
enable_mergejoin = off
enable_nestloop = off
enable_hashjoin = on
force_parallel_mode = on
seq_page_cost = 0.1
random_page_cost = 0.1
effective_cache_size = 128MB
parallel_tuple_cost = 0
parallel_setup_cost = 0
parallel_synchronization_cost = 0

CREATE TABLE t1 (a int, b text);
INSERT INTO t1 (SELECT x%2, x%2||'_b' FROM
generate_series(1,20) x);
ANALYZE;

postgres=# explain select * from t1, t1 t2 where t1.a = t2.a;
   QUERY PLAN


-
 Gather  (cost=2852.86..16362.74 rows=2069147 width=22)
   Workers Planned: 1
   ->  Parallel Hash Join  (cost=2852.86..16362.74 rows=1217145 width=22)
 Hash Cond: (t1.a = t2.a)
 ->  Parallel Seq Scan on t1  (cost=0.00..1284.57 rows=117647
width=11)
 ->  Parallel Hash  (cost=1284.57..1284.57 rows=117647 width=11)
   ->  Parallel Seq Scan on t1 t2  (cost=0.00..1284.57
rows=117647 width=11)
(7 rows)

postgres=# select * from t1, t1 t2 where t1.a = t2.a;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>


-- After assigning more "shared_buffers(10MB)" and "work_mem(10MB)" query
execute successfully.

Kindly check, if you can reproduce this at your end.

*Thanks & Regards,*

*Prabhat Kumar Sahu*
Mob: 7758988455
Skype ID: prabhat.sahu1984

www.enterprisedb.co m


On Wed, Sep 13, 2017 at 12:34 PM, Prabhat Sahu <
prabhat.s...@enterprisedb.com> wrote:

>
> On Thu, Aug 31, 2017 at 6:23 PM, Thomas Munro <
> thomas.mu...@enterprisedb.com> wrote:
>
>> Here's a new rebased and debugged patch set.
>
>
> Hi Thomas,
>
> I have applied the recent patch (v19) and started testing on this feature
> and i got a crash with below testcase.
>
> with default setting on "postgres.conf" file
>
> create table tab1 (a int, b text);
> create table tab2 (a int, b text);
> insert into tab1 (select x, x||'_b' from generate_series(1,20) x);
> insert into tab2 (select x%2, x%2||'_b' from
> generate_series(1,20) x);
> ANALYZE;
> select * from tab1 t1, tab2 t2, tab1 t3 where t1.a = t2.a and  t2.b = t3.b
> order by 1;
>
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the
> current transaction and exit, because another server process exited
> abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>
>
> Kindly check, if you can reproduce this at your end.
>
>
> *Thanks & Regards,*
>
> *Prabhat Kumar Sahu*
> Mob: 7758988455
> Skype ID: prabhat.sahu1984
>
> www.enterprisedb.co m
> 
>
>


Re: [HACKERS] Parallel Hash take II

2017-09-13 Thread Prabhat Sahu
On Thu, Aug 31, 2017 at 6:23 PM, Thomas Munro  wrote:

> Here's a new rebased and debugged patch set.


Hi Thomas,

I have applied the recent patch (v19) and started testing on this feature
and i got a crash with below testcase.

with default setting on "postgres.conf" file

create table tab1 (a int, b text);
create table tab2 (a int, b text);
insert into tab1 (select x, x||'_b' from generate_series(1,20) x);
insert into tab2 (select x%2, x%2||'_b' from
generate_series(1,20) x);
ANALYZE;
select * from tab1 t1, tab2 t2, tab1 t3 where t1.a = t2.a and  t2.b = t3.b
order by 1;

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Kindly check, if you can reproduce this at your end.


*Thanks & Regards,*

*Prabhat Kumar Sahu*
Mob: 7758988455
Skype ID: prabhat.sahu1984

www.enterprisedb.co m



Re: [HACKERS] Parallel Hash take II

2017-09-01 Thread Robert Haas
On Fri, Sep 1, 2017 at 7:42 PM, Thomas Munro
 wrote:
>> I'm thinking about something like this:
>>
>> Gather
>> -> Nested Loop
>>   -> Parallel Seq Scan
>>   -> Hash Join
>> -> Seq Scan
>> -> Parallel Hash
>>   -> Parallel Seq Scan
>>
>> The hash join has to be rescanned for every iteration of the nested loop.
>
> I think you mean:
>
>  Gather
>  -> Nested Loop
>-> Parallel Seq Scan
>-> Parallel Hash Join
>  -> Parallel Seq Scan
>  -> Parallel Hash
>-> Parallel Seq Scan

I don't, though, because that's nonsense.  Maybe what I wrote is also
nonsense, but it is at least different nonsense.

Let's try it again with some table names:

Gather
-> Nested Loop
  -> Parallel Seq Scan on a
  -> (Parallel?) Hash Join
-> Seq Scan on b (NOT A PARALLEL SEQ SCAN)
-> Parallel Hash
  -> Parallel Seq Scan on c

I argue that this is a potentially valid plan.  b, of course, has to
be scanned in its entirety by every worker every time through, which
is why it's not a Parallel Seq Scan, but that requirement does not
apply to c.  If we take all the rows in c and stick them into a
DSM-based hash table, we can reuse them every time the hash join is
rescanned and, AFAICS, that should work just fine, and it's probably a
win over letting each worker build a separate copy of the hash table
on c, too.

Of course, there's the "small" problem that I have no idea what to do
if the b-c join is (or becomes) multi-batch.  When I was thinking
about this before, I was imagining that this case might Just Work with
your patch provided that you could generate a plan shaped like this,
but now I see that that's not actually true, because of multiple
batches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-01 Thread Thomas Munro
On Sat, Sep 2, 2017 at 10:45 AM, Robert Haas  wrote:
> On Fri, Sep 1, 2017 at 6:32 PM, Thomas Munro
>  wrote:
>> On Sat, Sep 2, 2017 at 5:13 AM, Robert Haas  wrote:
>>> On Thu, Aug 31, 2017 at 8:53 AM, Thomas Munro
>>>  wrote:
 Check out ExecReScanGather(): it shuts down and waits for all workers
 to complete, which makes the assumptions in ExecReScanHashJoin() true.
 If a node below Gather but above Hash Join could initiate a rescan
 then the assumptions would not hold.  I am not sure what it would mean
 though and we don't generate any such plans today to my knowledge.  It
 doesn't seem to make sense for the inner side of Nested Loop to be
 partial.  Have I missed something here?
>>>
>>> I bet this could happen, although recent commits have demonstrated
>>> that my knowledge of how PostgreSQL handles rescans is less than
>>> compendious.  Suppose there's a Nested Loop below the Gather and above
>>> the Hash Join, implementing a join condition that can't give rise to a
>>> parameterized path, like a.x + b.x = 0.
>>
>> Hmm.  I still don't see how that could produce a rescan of a partial
>> path without an intervening Gather, and I would really like to get to
>> the bottom of this.
>
> I'm thinking about something like this:
>
> Gather
> -> Nested Loop
>   -> Parallel Seq Scan
>   -> Hash Join
> -> Seq Scan
> -> Parallel Hash
>   -> Parallel Seq Scan
>
> The hash join has to be rescanned for every iteration of the nested loop.

I think you mean:

 Gather
 -> Nested Loop
   -> Parallel Seq Scan
   -> Parallel Hash Join
 -> Parallel Seq Scan
 -> Parallel Hash
   -> Parallel Seq Scan

... but we can't make plans like that and they would produce nonsense
output.  The Nested Loop's inner plan is partial, but
consider_parallel_nestloop only makes plans with parallel-safe but
non-partial ("complete") inner paths.

/*
 * consider_parallel_nestloop
 *Try to build partial paths for a joinrel by joining a
partial path for the
 *outer relation to a complete path for the inner relation.
 *

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-01 Thread Robert Haas
On Fri, Sep 1, 2017 at 6:32 PM, Thomas Munro
 wrote:
> On Sat, Sep 2, 2017 at 5:13 AM, Robert Haas  wrote:
>> On Thu, Aug 31, 2017 at 8:53 AM, Thomas Munro
>>  wrote:
>>> Check out ExecReScanGather(): it shuts down and waits for all workers
>>> to complete, which makes the assumptions in ExecReScanHashJoin() true.
>>> If a node below Gather but above Hash Join could initiate a rescan
>>> then the assumptions would not hold.  I am not sure what it would mean
>>> though and we don't generate any such plans today to my knowledge.  It
>>> doesn't seem to make sense for the inner side of Nested Loop to be
>>> partial.  Have I missed something here?
>>
>> I bet this could happen, although recent commits have demonstrated
>> that my knowledge of how PostgreSQL handles rescans is less than
>> compendious.  Suppose there's a Nested Loop below the Gather and above
>> the Hash Join, implementing a join condition that can't give rise to a
>> parameterized path, like a.x + b.x = 0.
>
> Hmm.  I still don't see how that could produce a rescan of a partial
> path without an intervening Gather, and I would really like to get to
> the bottom of this.

I'm thinking about something like this:

Gather
-> Nested Loop
  -> Parallel Seq Scan
  -> Hash Join
-> Seq Scan
-> Parallel Hash
  -> Parallel Seq Scan

The hash join has to be rescanned for every iteration of the nested loop.

Maybe I'm confused.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-01 Thread Thomas Munro
On Sat, Sep 2, 2017 at 5:13 AM, Robert Haas  wrote:
> On Thu, Aug 31, 2017 at 8:53 AM, Thomas Munro
>  wrote:
>> Check out ExecReScanGather(): it shuts down and waits for all workers
>> to complete, which makes the assumptions in ExecReScanHashJoin() true.
>> If a node below Gather but above Hash Join could initiate a rescan
>> then the assumptions would not hold.  I am not sure what it would mean
>> though and we don't generate any such plans today to my knowledge.  It
>> doesn't seem to make sense for the inner side of Nested Loop to be
>> partial.  Have I missed something here?
>
> I bet this could happen, although recent commits have demonstrated
> that my knowledge of how PostgreSQL handles rescans is less than
> compendious.  Suppose there's a Nested Loop below the Gather and above
> the Hash Join, implementing a join condition that can't give rise to a
> parameterized path, like a.x + b.x = 0.

Hmm.  I still don't see how that could produce a rescan of a partial
path without an intervening Gather, and I would really like to get to
the bottom of this.

At the risk of mansplaining the code that you wrote and turning out to
be wrong:  A Nested Loop can't ever have a partial path on the inner
side.  Under certain circumstances it can have a partial path on the
outer side, because its own results are partial, but for each outer
row it needs to do a total (non-partial) scan of the inner side so
that it can reliably find or not find matches.  Therefore we'll never
rescan partial paths directly, we'll only ever rescan partial paths
indirectly via a Gatheroid node that will synchronise the rescan of
all children to produce a non-partial result.

There may be more reasons to rescan that I'm not thinking of.  But the
whole idea of a rescan seems to make sense only for non-partial paths.
What would it even mean for a worker process to decide to rescan (say)
a Seq Scan without any kind of consensus?

Thought experiment: I suppose we could consider replacing Gather's
clunky shut-down-and-relaunch-workers synchronisation technique with a
new protocol where the Gather node sends a 'rescan!' message to each
worker and then discards their tuples until it receives 'OK, rescan
starts here', and then each parallel-aware node type supplies its own
rescan synchronisation logic as appropriate.  For example, Seq Scan
would somehow need to elect one participant to run
heap_parallelscan_reinitialize and others would wait until it has
done.  This might not be worth the effort, but thinking about this
problem helped me see that rescan of a partial plan without a Gather
node to coordinate doesn't make any sense.

Am I wrong?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-09-01 Thread Robert Haas
On Thu, Aug 31, 2017 at 8:53 AM, Thomas Munro
 wrote:
> Solution 2:  Teach tuple queues to spill to disk instead of blocking
> when full.  I think this behaviour should probably only be activated
> while the leader is running the plan rather than draining tuple
> queues; the current block-when-full behaviour would still be
> appropriate if the leader is simply unable to drain queues fast
> enough.  Then the deadlock risk would go away.
>
> When I wrote it, I figured that leader_gate.c was cheap and would do
> for now, but I have to admit that it's quite confusing and it sucks
> that later batches lose a core.  I'm now thinking that 2 may be a
> better idea.  My first thought is that Gather needs a way to advertise
> that it's busy while running the plan, shm_mq needs a slightly
> different all-or-nothing nowait mode, and TupleQueue needs to write to
> a shared tuplestore or other temp file-backed mechanism when
> appropriate.  Thoughts?

The problem with solution 2 is that it might lead to either (a)
unbounded amounts of stuff getting spooled to files or (b) small spool
files being repeatedly created and deleted depending on how the leader
is spending its time.  If you could spill only when the leader is
actually waiting for the worker, that might be OK.

> Check out ExecReScanGather(): it shuts down and waits for all workers
> to complete, which makes the assumptions in ExecReScanHashJoin() true.
> If a node below Gather but above Hash Join could initiate a rescan
> then the assumptions would not hold.  I am not sure what it would mean
> though and we don't generate any such plans today to my knowledge.  It
> doesn't seem to make sense for the inner side of Nested Loop to be
> partial.  Have I missed something here?

I bet this could happen, although recent commits have demonstrated
that my knowledge of how PostgreSQL handles rescans is less than
compendious.  Suppose there's a Nested Loop below the Gather and above
the Hash Join, implementing a join condition that can't give rise to a
parameterized path, like a.x + b.x = 0.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-08-31 Thread Thomas Munro
Here's a new rebased and debugged patch set.

On Tue, Aug 1, 2017 at 1:11 PM, Andres Freund  wrote:
> - Echoing concerns from other threads (Robert: ping): I'm doubtful that
>   it makes sense to size the number of parallel workers solely based on
>   the parallel scan node's size.  I don't think it's this patch's job to
>   change that, but to me it seriously amplifys that - I'd bet there's a
>   lot of cases with nontrivial joins where the benefit from parallelism
>   on the join level is bigger than on the scan level itself.  And the
>   number of rows in the upper nodes might also be bigger than on the
>   scan node level, making it more important to have higher number of
>   nodes.

Agreed that this is bogus.  The number of workers is really determined
by the outer path (the probe side), except that if the inner path (the
build side) is not big enough to warrant parallel workers at all then
parallelism is inhibited on that side.  That prevents small tables
from being loaded by Parallel Hash.  That is something we want, but
it's probably not doing it for the right reasons with the right
threshold -- about which more below.

> - If I understand the code in initial_cost_hashjoin() correctly, we
>   count the synchronization overhead once, independent of the number of
>   workers.  But on the other hand we calculate the throughput by
>   dividing by the number of workers.  Do you think that's right?

It's how long you think the average participant will have to wait for
the last participant to arrive, and I think that's mainly determined
by the parallel grain, not the number of workers.  If you're a work
that has reached the end of a scan, the best case is that every other
worker has already reached the end too and the worst case is that
another worker read the last granule (currently page) just before you
hit the end, so you'll have to wait for it to process a granule's
worth of work.

To show this I used dtrace to measure the number of microseconds spent
waiting at the barrier before probing while running a 5 million row
self-join 100 times, and got the following histograms:

1 worker:

   value  - Distribution - count
 < 0 | 0
   0 |@@   110
  20 | 1
  40 |@5
  60 |@24
  80 |@@@  14
 100 |@5
 120 |@@@  16
 140 |@@@  17
 160 |@@   8
 180 | 0

2 workers:

   value  - Distribution - count
 < 0 | 0
   0 |@@   107
  20 | 1
  40 |@@@  21
  60 |@@@  25
  80 |@@   16
 100 |@@   14
 120 |@38
 140 |@@@  51
 160 |@@@  20
 180 | 3
 200 | 1
 220 | 3
 240 | 0

3 workers:

   value  - Distribution - count
 < 0 | 0
   0 |@@@  113
  20 |@@   15
  40 |@@@  29
  60 | 35
  80 | 37
 100 |@@   56
 120 |@51
 140 |@@@  31
 160 |@@   21
 180 |@6

4 workers:

   value  - Distribution - count
 < 0 | 0
   0 |@@   121
  20 | 4
  40 |@@@  39
  60 |@@   29
  80 |@@

Re: [HACKERS] Parallel Hash take II

2017-08-18 Thread Thomas Munro
On Wed, Aug 2, 2017 at 10:06 PM, Thomas Munro
 wrote:
> On Tue, Aug 1, 2017 at 1:11 PM, Andres Freund  wrote:
>> WRT the main patch:
>
> Thanks for the review.  I will respond soon, but for now I just wanted
> to post a rebased version (no changes) because v16 no longer applies.

Rebased with better commit messages.  Sorry for the changed patch
names, I switched to using git-format properly...  (I'll be posting a
new version with some bigger changes to the 0010 patch and some
answers to good questions you've asked soon.)

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v18.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-08-02 Thread Thomas Munro
On Tue, Aug 1, 2017 at 1:11 PM, Andres Freund  wrote:
> WRT the main patch:

Thanks for the review.  I will respond soon, but for now I just wanted
to post a rebased version (no changes) because v16 no longer applies.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v17.patchset.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-07-31 Thread Robert Haas
On Mon, Jul 31, 2017 at 9:11 PM, Andres Freund  wrote:
> - Echoing concerns from other threads (Robert: ping): I'm doubtful that
>   it makes sense to size the number of parallel workers solely based on
>   the parallel scan node's size.  I don't think it's this patch's job to
>   change that, but to me it seriously amplifys that - I'd bet there's a
>   lot of cases with nontrivial joins where the benefit from parallelism
>   on the join level is bigger than on the scan level itself.  And the
>   number of rows in the upper nodes might also be bigger than on the
>   scan node level, making it more important to have higher number of
>   nodes.

Well, I feel like a broken record here but ... yeah, I agree we need
to improve that.  It's probably generally true that the more parallel
operators we add, the more potential benefit there is in doing
something about that problem.  But, like you say, not in this patch.

http://postgr.es/m/CA+TgmoYL-SQZ2gRL2DpenAzOBd5+SW30QB=a4csewtogejz...@mail.gmail.com

I think we could improve things significantly by generating multiple
partial paths with different number of parallel workers, instead of
just picking a number of workers based on the table size and going
with it.  For that to work, though, you'd need something built into
the costing to discourage picking paths with too many workers.  And
you'd need to be OK with planning taking a lot longer when parallelism
is involved, because you'd be carrying around more paths for longer.
There are other problems to solve, too.

I still think, though, that it's highly worthwhile to get at least a
few more parallel operators - and this one in particular - done before
we attack that problem in earnest.  Even with a dumb calculation of
the number of workers, this helps a lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-07-31 Thread Andres Freund
From: Thomas Munro 
Date: Wed 26 Jul 2017 19:58:20 NZST
Subject: [PATCH] Add support for parallel-aware hash joins.

Hi,

WRT the main patch:

- Echoing concerns from other threads (Robert: ping): I'm doubtful that
  it makes sense to size the number of parallel workers solely based on
  the parallel scan node's size.  I don't think it's this patch's job to
  change that, but to me it seriously amplifys that - I'd bet there's a
  lot of cases with nontrivial joins where the benefit from parallelism
  on the join level is bigger than on the scan level itself.  And the
  number of rows in the upper nodes might also be bigger than on the
  scan node level, making it more important to have higher number of
  nodes.

- If I understand the code in initial_cost_hashjoin() correctly, we
  count the synchronization overhead once, independent of the number of
  workers.  But on the other hand we calculate the throughput by
  dividing by the number of workers.  Do you think that's right?

- I haven't really grokked the deadlock issue you address. Could you
  expand the comments on that? Possibly somewhere central referenced by
  the various parts.

- maybe I'm overly paranoid, but it might not be bad to add some extra
  checks for ExecReScanHashJoin ensuring that it doesn't get called when
  workers are still doing something.

- seems like you're dereffing tuple unnecessarily here:

+   /*
+* If we detached a chain of tuples, transfer them to the main hash 
table
+* or batch storage.
+*/
+   if (regainable_space > 0)
+   {
+   HashJoinTuple tuple;
+
+   tuple = (HashJoinTuple)
+   dsa_get_address(hashtable->area, detached_chain_shared);
+   ExecHashTransferSkewTuples(hashtable, detached_chain,
+  
detached_chain_shared);
+
+   /* Remove from the total space used. */
+   LWLockAcquire(>shared->chunk_lock, LW_EXCLUSIVE);
+   Assert(hashtable->shared->size >= regainable_space);
+   hashtable->shared->size -= regainable_space;
+   LWLockRelease(>shared->chunk_lock);
+
+   /*
+* If the bucket we removed is the same as the bucket the 
caller just
+* overflowed, then we can forget about the overflowing part of 
the
+* tuple.  It's been moved out of the skew hash table.  
Otherwise, the
+* caller will call again; eventually we'll either succeed in
+* allocating space for the overflow or reach this case.
+*/
+   if (bucket_to_remove == bucketno)
+   {
+   hashtable->spaceUsedSkew = 0;
+   hashtable->spaceAllowedSkew = 0;
+   }
+   }


- The names here could probably improved some:
+   case WAIT_EVENT_HASH_SHRINKING1:
+   event_name = "Hash/Shrinking1";
+   break;
+   case WAIT_EVENT_HASH_SHRINKING2:
+   event_name = "Hash/Shrinking2";
+   break;
+   case WAIT_EVENT_HASH_SHRINKING3:
+   event_name = "Hash/Shrinking3";
+   break;
+   case WAIT_EVENT_HASH_SHRINKING4:
+   event_name = "Hash/Shrinking4";

- why are we restricting rows_total bit to parallel aware?

+   /*
+* If parallel-aware, the executor will also need an estimate of the 
total
+* number of rows expected from all participants so that it can size the
+* shared hash table.
+*/
+   if (best_path->jpath.path.parallel_aware)
+   {
+   hash_plan->plan.parallel_aware = true;
+   hash_plan->rows_total = best_path->inner_rows_total;
+   }
+

- seems we need a few more test - I don't think the existing tests are
  properly going to exercise the skew stuff, multiple batches, etc?
  This is nontrivial code, I'd really like to see a high test coverage
  of the new code.

- might not hurt to reindent before the final submission

- Unsurprisingly, please implement the FIXME ;)


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-07-31 Thread Thomas Munro
On Tue, Aug 1, 2017 at 9:28 AM, Andres Freund  wrote:
> On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
>> 2.  Simplified costing.  There is now just one control knob
>> "parallel_synchronization_cost", which I charge for each time the
>> participants will wait for each other at a barrier, to be set high
>> enough to dissuade the planner from using Parallel Hash for tiny hash
>> tables that would be faster in a parallel-oblivious hash join.
>> Earlier ideas about modelling the cost of shared memory access didn't
>> work out.
>
> Hm. You say, "didn't work out" - could you expand a bit on that? I'm
> quite doubtful that justaccounting for barriers will be good enough.

The earlier approach and some variants I played with were based on the
idea that we should try to estimate the cost of using shared memory.
But there's no precedent for costing the cache hierarchy beyond disk
vs memory, and it depends so much on your hardware (NUMA vs UMA) and
the data distribution.  I have no doubt that variations in memory
access costs are important (for example, it was data distribution that
determined whether big-cache-oblivious-shared-hash-table or
MonetDB-style cache-aware approach won in that paper I've mentioned
here before[1]), but it seems like a hard problem and I didn't feel
like it was necessary.  Or do you have a different idea here?

Another point is that in the earlier versions I was trying to teach
the planner how to choose among Hash, Shared Hash and Parallel Shared
Hash.  The difference in costing between Hash and Shared Hash (one
worker builds, all workers probe) was important and sensitive, because
the only difference between them would be the cost of memory sharing.
When I dropped Shared Hash from the patch set, it no longer seemed
necessary to try to deal with such subtle costing, because Hash and
Parallel Hash (now without the word 'Shared') already have wildly
different costs: the latter is divided over N CPUs.  So I felt I could
get away with a much blunter instrument: just something to avoid
parallel build overheads for tiny tables like TPC-H "nation".

I still wanted something that makes intuitive sense and that could be
set using experimental evidence though.  Parallel_synchronization_cost
is an estimate of how long the average backend will have to wait for
the last backend to complete the phase and arrive at each barrier.
The most interesting case is the build phase: how long will the the
last backend make us wait before probing can begin?  Well, that
depends on the parallel grain.  Currently, the ultimate source of all
parallelism in our executor is Parallel Seq Scan and Parallel Index
Scan, and they hand out a page at a time.  Of course, any number of
nodes may sit between the hash join and the scan, and one of them
might include a function that sleeps for 100 years in one backend or
performs a join that generates wildly different numbers of tuples in
each backend.  I don't know what to do about that, other than to
assume we have perfectly spherical cows and reason on the basis of an
expected parallel grain reaching us from the scans.

One thing to note about parallel_synchronization_cost is that the cost
units, where 1 is traditionally the cost of scanning a page, actually
make *some* kind of sense here, though it's a bit tenuous: the last
worker to complete is the one that scans the final pages, while the
others see the scan finished.  What's really wanted here is not simply
page scanning cost but rather a percentage of the total cost that
represents how much extra work the lanterne rouge of backends has to
do.

Two relevant projects here are:

1.  David Rowley proposes changing the seq scan grain[2], perhaps
adaptively.  I suppose as this number increases the time at which two
workers finish can vary more greatly.
2.  The parallel-append project introduces a completely different type
of granularity based on unrelated and separately costed subplans
rather than pages.  Perhaps there are things that could be done here
to model the fact that some workers might finish a long time before
others, but I don't know.

Perhaps what parallel hash really needs is not a user-controlled
parallel_synchronization_cost, but some number produced by the planner
to describe the expected distribution of tuple counts over workers.
Armed with something like that and the cost per tuple you might be
able to estimate how long we expect hash join barriers to make you
wait without introducing any new GUCs at all.  I thought about some of
these things a bit but it seemed like a big research project of its
own and I was persuaded in an off-list discussion by Robert to try to
find the simplest thing that would avoid parallel-aware hash for
little tables that are already built very cheaply.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.225.3495
[2] 
https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com

-- 
Thomas Munro

Re: [HACKERS] Parallel Hash take II

2017-07-31 Thread Andres Freund
Hi,

On 2017-07-26 20:12:56 +1200, Thomas Munro wrote:
> Here is a new version of my parallel-aware hash join patchset.

Yay!

Working on reviewing this. Will send separate emails for individual
patch reviews.


> 2.  Simplified costing.  There is now just one control knob
> "parallel_synchronization_cost", which I charge for each time the
> participants will wait for each other at a barrier, to be set high
> enough to dissuade the planner from using Parallel Hash for tiny hash
> tables that would be faster in a parallel-oblivious hash join.
> Earlier ideas about modelling the cost of shared memory access didn't
> work out.

Hm. You say, "didn't work out" - could you expand a bit on that? I'm
quite doubtful that justaccounting for barriers will be good enough.


> I'll report on performance separately.

Looking forward to that ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers