Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-11-02 Thread Thomas Munro
On Fri, Nov 3, 2017 at 2:24 PM, Peter Geoghegan  wrote:
> Thomas Munro  wrote:
>> That way you don't have to opt in to BufFile's
>> double buffering and segmentation schemes just to get shared file
>> clean-up, if for some reason you want direct file handles.
>
> Is that something that you really think is possible?

It's pretty far fetched, but maybe shared temporary relation files
accessed via smgr.c/md.c?  Or maybe future things that don't want to
read/write through a buffer but instead want to mmap it.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-11-02 Thread Peter Geoghegan

Thomas Munro  wrote:

I'm going to make an item on my personal TODO list for that. No useful
insights on that right now, though.


I decided to try that, but it didn't really work: fd.h gets included
by front-end code, so I can't very well define a struct and declare
functions that deal in dsm_segment and slock_t.  On the other hand it
does seem a bit better to for these shared file sets to work in terms
of File, not BufFile.


Realistically, fd.h has a number of functions that are really owned by
buffile.c already. This sounds fine.


That way you don't have to opt in to BufFile's
double buffering and segmentation schemes just to get shared file
clean-up, if for some reason you want direct file handles.


Is that something that you really think is possible?

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-11-02 Thread Thomas Munro
On Wed, Nov 1, 2017 at 2:11 PM, Peter Geoghegan  wrote:
> On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
>  wrote:
>> Another complaint is that perhaps fd.c
>> knows too much about buffile.c's business.  For example,
>> RemovePgTempFilesInDir() knows about the ".set" directories created by
>> buffile.c, which might be called a layering violation.  Perhaps the
>> set/directory logic should move entirely into fd.c, so you'd call
>> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
>> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
>> Thoughts?
>
> I'm going to make an item on my personal TODO list for that. No useful
> insights on that right now, though.

I decided to try that, but it didn't really work: fd.h gets included
by front-end code, so I can't very well define a struct and declare
functions that deal in dsm_segment and slock_t.  On the other hand it
does seem a bit better to for these shared file sets to work in terms
of File, not BufFile.  That way you don't have to opt in to BufFile's
double buffering and segmentation schemes just to get shared file
clean-up, if for some reason you want direct file handles.  So I in
the v24 parallel hash patch set I just posted over in the other thread
I have moved it into its own translation unit sharedfileset.c and made
it work with File objects.  buffile.c knows how to use it as a source
of segment files.  I think that's better.

> If the new standard is that you have temp file names that suggest the
> purpose of each temp file, then that may be something that parallel
> CREATE INDEX should buy into.

Yeah, I guess that could be useful.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-10-31 Thread Peter Geoghegan
On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
 wrote:
> So that's this bit:
>
> + pg_itoa(worker, filename);
> + lts->pfile = BufFileCreateShared(fileset, filename);
>
> ... and:
>
> + pg_itoa(i, filename);
> + file = BufFileOpenShared(fileset, filename);

Right.

> What's wrong with using a worker number like this?

I guess nothing, though there is the question of discoverability for
DBAs, etc. You do address this separately, by having (potentially)
descriptive filenames, as you go into.

> It's not random choice: buffile.c creates a uniquely named directory
> (or directories, if you have more than one location configured in the
> temp_tablespaces GUC) to hold all the backing files involved in each
> BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
> problem, and a worker number seems like a reasonable choice to me.  It
> won't collide with a concurrent parallel CREATE INDEX because that'll
> be using its own BufFileSet.

Oh, I see. I may have jumped the gun on that one.

> One complaint about the current coding that someone might object to:
> MakeSharedSegmentPath() just dumps the caller's BufFile name into a
> path without sanitisation: I should fix that so that we only accept
> fairly limited strings here.  Another complaint is that perhaps fd.c
> knows too much about buffile.c's business.  For example,
> RemovePgTempFilesInDir() knows about the ".set" directories created by
> buffile.c, which might be called a layering violation.  Perhaps the
> set/directory logic should move entirely into fd.c, so you'd call
> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
> Thoughts?

I'm going to make an item on my personal TODO list for that. No useful
insights on that right now, though.

> 3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
> its shared BufFiles in there.  Earlier versions created and owned a
> BufFileSet, but in the current Parallel Hash patch I create loads of
> separate SharedTuplestore objects but I didn't want to create load of
> directories to back them, so you can give them all the same
> BufFileSet.  That works because SharedTuplestores are also given a
> name, and it's the caller's job (in my case nodeHash.c) to make sure
> the SharedTuplestores are given unique names within the same
> BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
> batch 3 of 8).  There is no need for there to be in any sort of
> central registry for that though, because it rides on top of the
> guarantees from 2 above: buffile.c will put those files into a
> uniquely named directory, and that works as long as no one else is
> allowed to create files or directories in the temp directory that
> collide with its reserved pattern /^pgsql_tmp.+\.set$/.  For the same
> reason, parallel CREATE INDEX is free to use worker numbers as BufFile
> names, since it has its own BufFileSet to work within.

If the new standard is that you have temp file names that suggest the
purpose of each temp file, then that may be something that parallel
CREATE INDEX should buy into.

> In an earlier version, BufFileSet was one of those annoying data
> structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
> incomplete type (declared but not defined in the includable header),
> and here it was being used "inside" (or rather after) SharedSort,
> which *itself* had a FLEXIBLE_ARRAY_MEMBER.  The reason for the
> variable sized object was that I needed all backends to agree on the
> set of temporary tablespace OIDs, of which there could be any number,
> but I also needed a 'flat' (pointer-free) object I could stick in
> relocatable shared memory.  In the newest version I changed that
> flexible array to tablespaces[8], because 8 should be enough
> tablespaces for anyone (TM).

I guess that that's something that you'll need to take up with Andres,
if you haven't already. I have a hard time imagining a single query
needed to use more than that many tablespaces at once, so maybe this
is fine.

> I don't really believe anyone uses
> temp_tablespaces for IO load balancing anymore and I hate code like
> the above.  So I think Rushabh should now remove the above-quoted code
> and just use a BufFileSet directly as a member of SharedSort.

FWIW, I agree with you that nobody uses temp_tablespaces this way
these days. This seems like a discussion for your hash join patch,
though. I'm happy to buy into that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-10-31 Thread Thomas Munro
On Wed, Nov 1, 2017 at 11:29 AM, Peter Geoghegan  wrote:
> On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
>  wrote:
>> Attaching the re based patch according to the v22 parallel-hash patch sets
>
> I took a quick look at this today, and noticed a few issues:
>
> * make_name() is used to name files in sharedtuplestore.c, which is
> what is passed to BufFileOpenShared() for parallel hash join. Your
> using your own logic for that within the equivalent logtape.c call to
> BufFileOpenShared(), presumably because make_name() wants to identify
> participants by PID rather than by an ordinal identifier number.

So that's this bit:

+ pg_itoa(worker, filename);
+ lts->pfile = BufFileCreateShared(fileset, filename);

... and:

+ pg_itoa(i, filename);
+ file = BufFileOpenShared(fileset, filename);

What's wrong with using a worker number like this?

> I think that we need some kind of central registry for things that use
> shared buffiles. It could be that sharedtuplestore.c is further
> generalized to support this, or it could be that they both call
> something else that takes care of naming. It's not okay to have this
> left to random chance.

It's not random choice: buffile.c creates a uniquely named directory
(or directories, if you have more than one location configured in the
temp_tablespaces GUC) to hold all the backing files involved in each
BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
problem, and a worker number seems like a reasonable choice to me.  It
won't collide with a concurrent parallel CREATE INDEX because that'll
be using its own BufFileSet.

> You're going to have to ask Thomas about this.  You should also use
> MAXPGPATH for the char buffer on the stack.

Here's a summary of namespace management scheme I currently have at
the three layers fd.c, buffile.c, sharedtuplestore.c:

1.  fd.c has new lower-level functions provides
PathNameCreateTemporaryFile(const char *path) and
PathNameOpenTemporaryFile(const char *path).  It also provides
PathNameCreateTemporaryDir().  Clearly callers of these interfaces
will need to be very careful about managing the names they use.
Callers also own the problem of cleaning up files, since there is no
automatic cleanup of files created this way.  My intention was that
these facilities would *only* be used by BufFileSet, since it has
machinery to manage those things.

2.  buffile.c introduces BufFileSet, which is conceptually a set of
BufFiles that can be shared by multiple backends with DSM
segment-scoped cleanup.  It is implemented as a set of directories:
one for each tablespace in temp_tablespaces.  It controls the naming
of those directories.  The BufFileSet directories are named similarly
to fd.c's traditional temporary file names using the usual recipe
"pgsql_tmp" + PID + per-process counter but have an additional ".set"
suffix.  RemovePgTempFilesInDir() recognises directories with that
prefix and suffix as junk left over from a crash when cleaning up.  I
suppose it's that knowledge about reserved name patterns and cleanup
that you are thinking of as a central registry?  As for the BufFiles
that are in a BufFileSet, buffile.c has no opinion on that: the
calling code (parallel CREATE INDEX, sharedtuplestore.c, ...) is
responsible for coming up with its own scheme.  If parallel CREATE
INDEX wants to name shared BufFiles "walrus" and "banana", that's OK
by me, and those files won't collide with anything in another
BufFileSet because each BufFileSet has its own directory (-ies).

One complaint about the current coding that someone might object to:
MakeSharedSegmentPath() just dumps the caller's BufFile name into a
path without sanitisation: I should fix that so that we only accept
fairly limited strings here.  Another complaint is that perhaps fd.c
knows too much about buffile.c's business.  For example,
RemovePgTempFilesInDir() knows about the ".set" directories created by
buffile.c, which might be called a layering violation.  Perhaps the
set/directory logic should move entirely into fd.c, so you'd call
FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
Thoughts?

3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
its shared BufFiles in there.  Earlier versions created and owned a
BufFileSet, but in the current Parallel Hash patch I create loads of
separate SharedTuplestore objects but I didn't want to create load of
directories to back them, so you can give them all the same
BufFileSet.  That works because SharedTuplestores are also given a
name, and it's the caller's job (in my case nodeHash.c) to make sure
the SharedTuplestores are given unique names within the same
BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
batch 3 of 8).  There is no need for there to be in any sort of
central registry for that though, because it rides on top of the
guarantees from 2 above: buffile.c 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-10-31 Thread Peter Geoghegan
On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
 wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets

I took a quick look at this today, and noticed a few issues:

* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.

I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.

You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.

* This logtape.c comment needs to be updated, as it's no longer true:

 * successfully.  In general, workers can take it that the leader will
 * reclaim space in files under their ownership, and so should not
 * reread from tape.

* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.

* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).

* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).

* tuplesort.h should note that randomAccess isn't supported, too.

* What's this all about?:

+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared)\
+   ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.

* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.

I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)

That's all I have right now.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-09-29 Thread Peter Geoghegan
On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
 wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.

> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
> related to ltsUnify() ?  If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.

> To be frank its too early for me to comment anything in this area.  I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-09-20 Thread Robert Haas
On Wed, Sep 20, 2017 at 5:32 AM, Rushabh Lathia
 wrote:
> First application for the tuplesort here is CREATE INDEX and that doesn't
> need randomAccess. But as you said and in the thread its been discussed,
> randomAccess is an important and we should sure put an efforts to support
> the same.

There's no direct benefit of working on randomAccess support unless we
have some code that wants to use that support for something.  Indeed,
it would just leave us with code we couldn't test.

While I do agree that there are probably use cases for randomAccess, I
think what we should do right now is try to get this patch reviewed
and committed so that we have parallel CREATE INDEX for btree indexes.
And in so doing, let's keep it as simple as possible.  Parallel CREATE
INDEX for btree indexes is a great feature without adding any more
complexity.

Later, anybody who wants to work on randomAccess support -- and
whatever planner and executor changes are needed to make effective use
of it -- can do so.  For example, one can imagine a plan like this:

Gather
-> Merge Join
  -> Parallel Index Scan
  -> Parallel Sort
-> Parallel Seq Scan

If the parallel sort reads out all of the output in every worker, then
it becomes legal to do this kind of thing -- it would end up, I think,
being quite similar to Parallel Hash.  However, there's some question
in my mind as to whether want to do this or, say, hash-partition both
relations and then perform separate joins on each partition.  The
above plan is clearly better than what we can do today, where every
worker would have to repeat the sort, ugh, but I don't know if it's
the best plan.  Fortunately, to get this patch committed, we don't
have to figure that out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-09-20 Thread Rushabh Lathia
On Wed, Sep 20, 2017 at 5:17 AM, Peter Geoghegan  wrote:

> On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
>  wrote:
> > As per the earlier discussion in the thread, I did experiment using
> > BufFileSet interface from parallel-hash-v18.patchset.  I took the
> reference
> > of parallel-hash other patches to understand the BufFileSet APIs, and
> > incorporate the changes to parallel create index.
> >
> > In order to achieve the same:
> >
> > - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> > 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> > parallel-hash-v18.patchset.
> > - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> > INDEX v10 patch.
> > - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> > INDEX.
> > - Changes into few existing functions as well as added few to support the
> > BufFileSet changes.
>
> I'm glad that somebody is working on this. (Someone closer to the more
> general work on shared/parallel BufFile infrastructure than I am.)
>
> I do have some quick feedback, and I hope to be able to provide that
> to both you and Thomas, as needed to see this one through. I'm not
> going to get into the tricky details around resource management just
> yet. I'll start with some simpler questions, to get a general sense of
> the plan here.
>
>
Thanks Peter.


> I gather that you're at least aware that your v11 of the patch doesn't
> preserve randomAccess support for parallel sorts, because you didn't
> include my 0002-* testing GUCs patch, which was specifically designed
> to make various randomAccess stuff testable. I also figured this to be
> true because I noticed this FIXME among (otherwise unchanged)
> tuplesort code:
>
>
Yes, I haven't touched the randomAccess part yet. My initial goal was
to incorporate the BufFileSet api's here.


> > +static void
> > +leader_takeover_tapes(Tuplesortstate *state)
> > +{
> > +   Sharedsort *shared = state->shared;
> > +   int nLaunched = state->nLaunched;
> > +   int j;
> > +
> > +   Assert(LEADER(state));
> > +   Assert(nLaunched >= 1);
> > +   Assert(nLaunched == shared->workersFinished);
> > +
> > +   /*
> > +* Create the tapeset from worker tapes, including a leader-owned
> tape at
> > +* the end.  Parallel workers are far more expensive than logical
> tapes,
> > +* so the number of tapes allocated here should never be excessive.
> FIXME
> > +*/
> > +   inittapestate(state, nLaunched + 1);
> > +   state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> > + state->fileset, state->worker);
>
> It's not surprising to me that you do not yet have this part working,
> because much of my design was about changing as little as possible
> above the BufFile interface, in order for tuplesort.c (and logtape.c)
> code like this to "just work" as if it was the serial case.


Right. I just followed your design in the your earlier patches.


> It doesn't
> look like you've added the kind of BufFile multiplexing code that I
> expected to see in logtape.c. This is needed to compensate for the
> code removed from fd.c and buffile.c. Perhaps it would help me to go
> look at Thomas' latest parallel hash join patch -- did it gain some
> kind of transparent multiplexing ability that you actually (want to)
> use here?
>
>
Sorry, I didn't get this part. Are you talking about the your patch changes
into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
related to ltsUnify() ?  If that's the case, I don't think it require with
the
BufFileSet. Correct me if I am wrong here.

Though randomAccess isn't used by CREATE INDEX in general, and so not
> supporting randomAccess within tuplesort.c for parallel callers
> doesn't matter as far as this CREATE INDEX user-visible feature is
> concerned, I still believe that randomAccess is important (IIRC,
> Robert thought so too). Specifically, it seems like a good idea to
> have randomAccess support, both on general principle (why should the
> parallel case be different?), and because having it now will probably
> enable future enhancements to logtape.c. Enhancements that have it
> manage parallel sorts based on partitioning/distribution/bucketing
> [1]. I'm pretty sure that partitioning-based parallel sort is going to
> become very important in the future, especially for parallel
> GroupAggregate. The leader needs to truly own the tapes it reclaims
> from workers in order for all of this to work.
>
>
First application for the tuplesort here is CREATE INDEX and that doesn't
need randomAccess. But as you said and in the thread its been discussed,
randomAccess is an important and we should sure put an efforts to support
the same.

Questions on where you're going with randomAccess support:
>
> 1. Is randomAccess support a goal for you here at all?
>
> 2. If so, is preserving eager recycling of temp file space during
> randomAccess 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-09-19 Thread Peter Geoghegan
On Tue, Sep 19, 2017 at 3:21 AM, Rushabh Lathia
 wrote:
> As per the earlier discussion in the thread, I did experiment using
> BufFileSet interface from parallel-hash-v18.patchset.  I took the reference
> of parallel-hash other patches to understand the BufFileSet APIs, and
> incorporate the changes to parallel create index.
>
> In order to achieve the same:
>
> - Applied 0007-Remove-BufFile-s-isTemp-flag.patch and
> 0008-Add-BufFileSet-for-sharing-temporary-files-between-b.patch from the
> parallel-hash-v18.patchset.
> - Removed the buffile.c/logtap.c/fd.c changes from the parallel CREATE
> INDEX v10 patch.
> - incorporate the BufFileSet API to the parallel tuple sort for CREATE
> INDEX.
> - Changes into few existing functions as well as added few to support the
> BufFileSet changes.

I'm glad that somebody is working on this. (Someone closer to the more
general work on shared/parallel BufFile infrastructure than I am.)

I do have some quick feedback, and I hope to be able to provide that
to both you and Thomas, as needed to see this one through. I'm not
going to get into the tricky details around resource management just
yet. I'll start with some simpler questions, to get a general sense of
the plan here.

I gather that you're at least aware that your v11 of the patch doesn't
preserve randomAccess support for parallel sorts, because you didn't
include my 0002-* testing GUCs patch, which was specifically designed
to make various randomAccess stuff testable. I also figured this to be
true because I noticed this FIXME among (otherwise unchanged)
tuplesort code:

> +static void
> +leader_takeover_tapes(Tuplesortstate *state)
> +{
> +   Sharedsort *shared = state->shared;
> +   int nLaunched = state->nLaunched;
> +   int j;
> +
> +   Assert(LEADER(state));
> +   Assert(nLaunched >= 1);
> +   Assert(nLaunched == shared->workersFinished);
> +
> +   /*
> +* Create the tapeset from worker tapes, including a leader-owned tape at
> +* the end.  Parallel workers are far more expensive than logical tapes,
> +* so the number of tapes allocated here should never be excessive. FIXME
> +*/
> +   inittapestate(state, nLaunched + 1);
> +   state->tapeset = LogicalTapeSetCreate(nLaunched + 1, shared->tapes,
> + state->fileset, state->worker);

It's not surprising to me that you do not yet have this part working,
because much of my design was about changing as little as possible
above the BufFile interface, in order for tuplesort.c (and logtape.c)
code like this to "just work" as if it was the serial case. It doesn't
look like you've added the kind of BufFile multiplexing code that I
expected to see in logtape.c. This is needed to compensate for the
code removed from fd.c and buffile.c. Perhaps it would help me to go
look at Thomas' latest parallel hash join patch -- did it gain some
kind of transparent multiplexing ability that you actually (want to)
use here?

Though randomAccess isn't used by CREATE INDEX in general, and so not
supporting randomAccess within tuplesort.c for parallel callers
doesn't matter as far as this CREATE INDEX user-visible feature is
concerned, I still believe that randomAccess is important (IIRC,
Robert thought so too). Specifically, it seems like a good idea to
have randomAccess support, both on general principle (why should the
parallel case be different?), and because having it now will probably
enable future enhancements to logtape.c. Enhancements that have it
manage parallel sorts based on partitioning/distribution/bucketing
[1]. I'm pretty sure that partitioning-based parallel sort is going to
become very important in the future, especially for parallel
GroupAggregate. The leader needs to truly own the tapes it reclaims
from workers in order for all of this to work.

Questions on where you're going with randomAccess support:

1. Is randomAccess support a goal for you here at all?

2. If so, is preserving eager recycling of temp file space during
randomAccess (materializing a final output tape within the leader)
another goal for you here? Do we need to preserve that property of
serial external sorts, too, so that it remains true that logtape.c
ensures that "the total space usage is essentially just the actual
data volume, plus insignificant bookkeeping and start/stop overhead"?
(I'm quoting from master's logtape.c header comments.)

3. Any ideas on next steps in support of those 2 goals? What problems
do you foresee, if any?

> CREATE INDEX serial_idx ON parallel_sort_test (randint);
>
> Without patch:
>
> Time: 3430054.220 ms (57:10.054)
>
> With patch (max_parallel_workers_maintenance  = 8):
>
> Time: 1163445.271 ms (19:23.445)

This looks very similar to my v10. While I will need to follow up on
this, to make sure, it seems likely that this patch has exactly the
same performance characteristics as v10.

Thanks

[1] 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-08-02 Thread Andres Freund
On 2017-02-10 07:52:57 -0500, Robert Haas wrote:
> On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
> > Up until two minutes ago I assumed that policy would leave only two
> > possibilities: you attach to the DSM segment and attach to the
> > SharedBufFileManager successfully or you attach to the DSM segment and
> > then die horribly (but not throw) and the postmaster restarts the
> > whole cluster and blows all temp files away with RemovePgTempFiles().
> > But I see now in the comment of that function that crash-induced
> > restarts don't call that because "someone might want to examine the
> > temp files for debugging purposes".  Given that policy for regular
> > private BufFiles, I don't see why that shouldn't apply equally to
> > shared files: after a crash restart, you may have some junk files that
> > won't be cleaned up until your next clean restart, whether they were
> > private or shared BufFiles.
> 
> I think most people (other than Tom) would agree that that policy
> isn't really sensible any more; it probably made sense when the
> PostgreSQL user community was much smaller and consisted mostly of the
> people developing PostgreSQL, but these days it's much more likely to
> cause operational headaches than to help a developer debug.

FWIW, we have restart_after_crash = false. If you need to debug things,
you can enable that. Hence the whole RemovePgTempFiles() crash-restart
exemption isn't required anymore, we have a much more targeted solution.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-22 Thread Peter Geoghegan
On Wed, Mar 22, 2017 at 5:44 AM, Robert Haas  wrote:
>> Actually, that's quite possible with the design I came up with.
>
> I don't think it is.  What sequence of calls do the APIs you've
> proposed would accomplish that goal?  I don't see anything in this
> patch set that would permit anything other than a handoff from the
> worker to the leader.  There seems to be no way for the ref count to
> be more than 1 (or 2?).

See my remarks on this below.

>> The
>> restriction that Thomas can't live with as I've left things is that
>> you have to know the number of BufFiles ahead of time. I'm pretty sure
>> that that's all it is. (I do sympathize with the fact that that isn't
>> very helpful to him, though.)
>
> I feel like there's some cognitive dissonance here.  On the one hand,
> you're saying we should use your design.

No, I'm not. I'm saying that my design is complete on its own terms,
and has some important properties that a mechanism like this ought to
have. I think I've been pretty clear on my general uncertainty about
the broader question.

> On the other hand, you are
> admitting that in at least one key respect, it won't meet Thomas's
> requirements.  On the third hand, you just said that you weren't
> arguing for two mechanisms for sharing a BufFile across cooperating
> parallel processes.  I don't see how you can hold all three of those
> positions simultaneously.

I respect your position as the person that completely owns parallelism
here. You are correct when you say that there has to be some overlap
between the requirements for the mechanisms used by each patch --
there just *has* to be. As I said, I only know very approximately how
much overlap that is or should be, even at this late date, and I am
unfortunately not in a position to spend more time on it to find out.
C'est la vie.

I know that I have no chance of convincing you to adopt my design
here, and you are right not to accept the design, because there is a
bigger picture. And, because it's just too late now. My efforts to get
ahead of that, and anticipate and provide for Thomas' requirements
have failed. I admit that. But, you are asserting that my patch has
specific technical defects that it does not have.

I structured things this way for a reason. You are not required to
agree with me in full to see that I might have had a point. I've
described it as a trade-off already. I think that it will be of
practical value to you to see that trade-off. This insight is what
allowed me to immediately zero in on resource leak bugs in Thomas'
revision of the patch from yesterday.

>> It is true that a worker resowner can unlink() the files
>> mid-unification, in the same manner as with conventional temp files,
>> and not decrement its refcount in shared memory, or care at all in any
>> special way. This is okay because the leader (in the case of parallel
>> tuplesort) will realize that it should not "turn out the lights",
>> finding that remaining reference when it calls BufFileClose() in
>> registered callback, as it alone must. It doesn't matter that the
>> unlink() may have already occurred, or may be just about to occur,
>> because we are only operating on already-opened files, and never on
>> the link itself (we don't have to stat() the file link for example,
>> which is naturally only a task for the unlink()'ing backend anyway).
>> You might say that the worker only blows away the link itself, not the
>> file proper, since it may still be open in leader (say).
>
> Well, that sounds like it's counting on fd.c not to close the file
> descriptor at an inconvenient point in time and reopen it later, which
> is not guaranteed.

It's true that in an error path, if the FD of the file we just opened
gets swapped out, that could happen. That seems virtually impossible,
and in any case the consequence is no worse than a confusing LOG
message. But, yes, that's a weakness.

>> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
>> since only the worker itself can reliably know how many segments it
>> has opened at every single instant in time. Because it's the owner!
>
> Above, you said that your design would allow for a group of processes
> to share access to a file, with the last one that abandons it "turning
> out the lights".  But here, you are referring to it as having one
> owner - the "only the worker itself" can know the number of segments.
> Those things are exact opposites of each other.

You misunderstood.

Under your analogy, the worker needs to wait for someone else to enter
the room before leaving, because otherwise, as an "environmentally
conscious" worker, it would be compelled to turn the lights out before
anyone else ever got to do anything with its files. But once someone
else is in the room, the worker is free to leave without turning out
the lights. I could provide a mechanism for the leader, or whatever
the other backend is, to do another hand off. You're right that that
is left 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-22 Thread Robert Haas
On Tue, Mar 21, 2017 at 7:37 PM, Peter Geoghegan  wrote:
>>> Isn't that an essential part of having a refcount, in general? You
>>> were the one that suggested refcounting.
>>
>> No, quite the opposite.  My point in suggesting adding a refcount was
>> to avoid needing to have a single owner.  Instead, the process that
>> decrements the reference count to zero becomes responsible for doing
>> the cleanup.  What you've done with the ref count is use it as some
>> kind of medium for transferring responsibility from backend A to
>> backend B; what I want is to allow backends A, B, C, D, E, and F to
>> attach to the same shared resource, and whichever one of them happens
>> to be the last one out of the room shuts off the lights.
>
> Actually, that's quite possible with the design I came up with.

I don't think it is.  What sequence of calls do the APIs you've
proposed would accomplish that goal?  I don't see anything in this
patch set that would permit anything other than a handoff from the
worker to the leader.  There seems to be no way for the ref count to
be more than 1 (or 2?).

> The
> restriction that Thomas can't live with as I've left things is that
> you have to know the number of BufFiles ahead of time. I'm pretty sure
> that that's all it is. (I do sympathize with the fact that that isn't
> very helpful to him, though.)

I feel like there's some cognitive dissonance here.  On the one hand,
you're saying we should use your design.  On the other hand, you are
admitting that in at least one key respect, it won't meet Thomas's
requirements.  On the third hand, you just said that you weren't
arguing for two mechanisms for sharing a BufFile across cooperating
parallel processes.  I don't see how you can hold all three of those
positions simultaneously.

>> As I've said before, I think that's an anti-goal.  This is a different
>> problem, and trying to reuse the solution we chose for the
>> non-parallel case doesn't really work.  resowner.c could end up owning
>> a shared reference count which it's responsible for decrementing --
>> and then decrementing it removes the file if the result is zero.  But
>> it can't own performing the actual unlink(), because then we can't
>> support cases where the file may have multiple readers, since whoever
>> owns the unlink() might try to zap the file out from under one of the
>> others.
>
> Define "zap the file". I think, based on your remarks here, that
> you've misunderstood my design. I think you should at least understand
> it fully if you're going to dismiss it.

zap was a colloquialism for unlink().  I concede that I don't fully
understand your design, and am trying to understand those things I do
not yet understand.

> It is true that a worker resowner can unlink() the files
> mid-unification, in the same manner as with conventional temp files,
> and not decrement its refcount in shared memory, or care at all in any
> special way. This is okay because the leader (in the case of parallel
> tuplesort) will realize that it should not "turn out the lights",
> finding that remaining reference when it calls BufFileClose() in
> registered callback, as it alone must. It doesn't matter that the
> unlink() may have already occurred, or may be just about to occur,
> because we are only operating on already-opened files, and never on
> the link itself (we don't have to stat() the file link for example,
> which is naturally only a task for the unlink()'ing backend anyway).
> You might say that the worker only blows away the link itself, not the
> file proper, since it may still be open in leader (say).

Well, that sounds like it's counting on fd.c not to close the file
descriptor at an inconvenient point in time and reopen it later, which
is not guaranteed.

> Thomas' design cannot reliably know how many segments there are in
> workers in error paths, which necessitates his unlink()-ENOENT-ignore
> hack. My solution is that workers/owners look after their own temp
> segments in the conventional way, until they reach BufFileClose(),
> which may never come if there is an error. The only way that clean-up
> won't happen in conventional resowner.c-in-worker fashion is if
> BufFileClose() is reached in owner/worker. BufFileClose() must be
> reached when there is no error, which has to happen anyway when using
> temp files. (Else there is a temp file leak warning from resowner.c.)
>
> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
> since only the worker itself can reliably know how many segments it
> has opened at every single instant in time. Because it's the owner!

Above, you said that your design would allow for a group of processes
to share access to a file, with the last one that abandons it "turning
out the lights".  But here, you are referring to it as having one
owner - the "only the worker itself" can know the number of segments.
Those things are exact opposites of each other.

I don't think there's any problem with ignoring 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Peter Geoghegan
On Tue, Mar 21, 2017 at 2:49 PM, Thomas Munro
 wrote:
> I'm going to experiment with refactoring the v10 parallel CREATE INDEX
> patch to use the SharedBufFileSet interface from
> hj-shared-buf-file-v8.patch today and see what problems I run into.

I would be happy if you took over parallel CREATE INDEX completely. It
makes a certain amount of sense, and not just because I am no longer
able to work on it.

You're the one doing things with shared BufFiles that are of
significant complexity. Certainly more complicated than what parallel
CREATE INDEX needs in every way, and necessarily so. I will still have
some more feedback on your shared BufFile design, though, while it's
fresh in my mind.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Peter Geoghegan
On Tue, Mar 21, 2017 at 2:03 PM, Robert Haas  wrote:
> I agree that the extent to which code reuse is possible here is
> somewhat unclear, but I am 100% confident that the answer is non-zero.
> You and Thomas both need BufFiles that can be shared across multiple
> backends associated with the same ParallelContext.  I don't understand
> how you can argue that it's reasonable to have two different ways of
> sharing the same kind of object across the same set of processes.

I didn't argue that. Rather, I argued that there is going to be
significant additional requirements for PHJ, because it has to support
arbitrary many BufFiles, rather than either 1 or 2 (one per
tuplesort/logtapeset). Just how "signficant" that would be I cannot
say, regrettably. (Or, we're going to have to make logtape.c multiplex
BufFiles, which risks breaking other logtape.c routines that aren't
even used just yet.)

>> Isn't that an essential part of having a refcount, in general? You
>> were the one that suggested refcounting.
>
> No, quite the opposite.  My point in suggesting adding a refcount was
> to avoid needing to have a single owner.  Instead, the process that
> decrements the reference count to zero becomes responsible for doing
> the cleanup.  What you've done with the ref count is use it as some
> kind of medium for transferring responsibility from backend A to
> backend B; what I want is to allow backends A, B, C, D, E, and F to
> attach to the same shared resource, and whichever one of them happens
> to be the last one out of the room shuts off the lights.

Actually, that's quite possible with the design I came up with. The
restriction that Thomas can't live with as I've left things is that
you have to know the number of BufFiles ahead of time. I'm pretty sure
that that's all it is. (I do sympathize with the fact that that isn't
very helpful to him, though.)

> As I've said before, I think that's an anti-goal.  This is a different
> problem, and trying to reuse the solution we chose for the
> non-parallel case doesn't really work.  resowner.c could end up owning
> a shared reference count which it's responsible for decrementing --
> and then decrementing it removes the file if the result is zero.  But
> it can't own performing the actual unlink(), because then we can't
> support cases where the file may have multiple readers, since whoever
> owns the unlink() might try to zap the file out from under one of the
> others.

Define "zap the file". I think, based on your remarks here, that
you've misunderstood my design. I think you should at least understand
it fully if you're going to dismiss it.

It is true that a worker resowner can unlink() the files
mid-unification, in the same manner as with conventional temp files,
and not decrement its refcount in shared memory, or care at all in any
special way. This is okay because the leader (in the case of parallel
tuplesort) will realize that it should not "turn out the lights",
finding that remaining reference when it calls BufFileClose() in
registered callback, as it alone must. It doesn't matter that the
unlink() may have already occurred, or may be just about to occur,
because we are only operating on already-opened files, and never on
the link itself (we don't have to stat() the file link for example,
which is naturally only a task for the unlink()'ing backend anyway).
You might say that the worker only blows away the link itself, not the
file proper, since it may still be open in leader (say).

** We rely on the fact that files are themselves a kind of reference
counted thing, in general; they have an independent existence from the
link originally used to open() them. **

The reason that there is a brief wait in workers for parallel
tuplesort is because it gives us the opportunity to have the
immediately subsequent worker BufFileClose() not turn out the lights
in worker, because leader must have a reference on the BufFile when
workers are released. So, there is a kind of interlock that makes sure
that there is always at least 1 owner.

** There would be no need for an additional wait but for the fact the
leader wants to unify multiple worker BufFiles as one, and must open
them all at once for the sake of simplicity. But that's just how
parallel tuplesort in particular happens to work, since it has only
one BufFile in the leader, which it wants to operate on with
everything set up up-front. **

Thomas' design cannot reliably know how many segments there are in
workers in error paths, which necessitates his unlink()-ENOENT-ignore
hack. My solution is that workers/owners look after their own temp
segments in the conventional way, until they reach BufFileClose(),
which may never come if there is an error. The only way that clean-up
won't happen in conventional resowner.c-in-worker fashion is if
BufFileClose() is reached in owner/worker. BufFileClose() must be
reached when there is no error, which has to happen anyway when using
temp files. (Else there is 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Thomas Munro
On Wed, Mar 22, 2017 at 10:03 AM, Robert Haas  wrote:
> On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan  wrote:
>> I disagree with that. It is a
>> trade-off, I suppose. I have now run out of time to work through it
>> with you or Thomas, though.
>
> Bummer.

I'm going to experiment with refactoring the v10 parallel CREATE INDEX
patch to use the SharedBufFileSet interface from
hj-shared-buf-file-v8.patch today and see what problems I run into.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Robert Haas
On Tue, Mar 21, 2017 at 3:50 PM, Peter Geoghegan  wrote:
> On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas  wrote:
>> From my point of view, the main point is that having two completely
>> separate mechanisms for managing temporary files that need to be
>> shared across cooperating workers is not a good decision.  That's a
>> need that's going to come up over and over again, and it's not
>> reasonable for everybody who needs it to add a separate mechanism for
>> doing it.  We need to have ONE mechanism for it.
>
> Obviously I understand that there is value in code reuse in general.
> The exact extent to which code reuse is possible here has been unclear
> throughout, because it's complicated for all kinds of reasons. That's
> why Thomas and I had 2 multi-hour Skype calls all about it.

I agree that the extent to which code reuse is possible here is
somewhat unclear, but I am 100% confident that the answer is non-zero.
You and Thomas both need BufFiles that can be shared across multiple
backends associated with the same ParallelContext.  I don't understand
how you can argue that it's reasonable to have two different ways of
sharing the same kind of object across the same set of processes.  And
if that's not reasonable, then somehow we need to come up with a
single mechanism that can meet both your requirements and Thomas's
requirements.

>> It's just not OK in my book for a worker to create something that it
>> initially owns and then later transfer it to the leader.
>
> Isn't that an essential part of having a refcount, in general? You
> were the one that suggested refcounting.

No, quite the opposite.  My point in suggesting adding a refcount was
to avoid needing to have a single owner.  Instead, the process that
decrements the reference count to zero becomes responsible for doing
the cleanup.  What you've done with the ref count is use it as some
kind of medium for transferring responsibility from backend A to
backend B; what I want is to allow backends A, B, C, D, E, and F to
attach to the same shared resource, and whichever one of them happens
to be the last one out of the room shuts off the lights.

>> The cooperating backends should have joint ownership of the objects from
>> the beginning, and the last process to exit the set should clean up
>> those resources.
>
> That seems like a facile summary of the situation. There is a sense in
> which there is always joint ownership of files with my design. But
> there is also a sense is which there isn't, because it's impossible to
> do that while not completely reinventing resource management of temp
> files. I wanted to preserve resowner.c ownership of fd.c segments.

As I've said before, I think that's an anti-goal.  This is a different
problem, and trying to reuse the solution we chose for the
non-parallel case doesn't really work.  resowner.c could end up owning
a shared reference count which it's responsible for decrementing --
and then decrementing it removes the file if the result is zero.  But
it can't own performing the actual unlink(), because then we can't
support cases where the file may have multiple readers, since whoever
owns the unlink() might try to zap the file out from under one of the
others.

> You maintain that it's better to have the leader unlink() everything
> at the end, and suppress the errors when that doesn't work, so that
> that path always just plows through.

I don't want the leader to be responsible for anything.  I want the
last process to detach to be responsible for cleanup, regardless of
which process that ends up being.  I want that for lots of good
reasons which I have articulated including (1) it's how all other
resource management for parallel query already works, e.g. DSM, DSA,
and group locking; (2) it avoids the need for one process to sit and
wait until another process assumes ownership, which isn't a feature
even if (as you contend, and I'm not convinced) it doesn't hurt much;
and (3) it allows for use cases where multiple processes are reading
from the same shared BufFile without the risk that some other process
will try to unlink() the file while it's still in use.  The point for
me isn't so much whether unlink() ever ignores errors as whether
cleanup (however defined) is an operation guaranteed to happen exactly
once.

> I disagree with that. It is a
> trade-off, I suppose. I have now run out of time to work through it
> with you or Thomas, though.

Bummer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Peter Geoghegan
On Tue, Mar 21, 2017 at 12:06 PM, Robert Haas  wrote:
> From my point of view, the main point is that having two completely
> separate mechanisms for managing temporary files that need to be
> shared across cooperating workers is not a good decision.  That's a
> need that's going to come up over and over again, and it's not
> reasonable for everybody who needs it to add a separate mechanism for
> doing it.  We need to have ONE mechanism for it.

Obviously I understand that there is value in code reuse in general.
The exact extent to which code reuse is possible here has been unclear
throughout, because it's complicated for all kinds of reasons. That's
why Thomas and I had 2 multi-hour Skype calls all about it.

> It's just not OK in my book for a worker to create something that it
> initially owns and then later transfer it to the leader.

Isn't that an essential part of having a refcount, in general? You
were the one that suggested refcounting.

> The cooperating backends should have joint ownership of the objects from
> the beginning, and the last process to exit the set should clean up
> those resources.

That seems like a facile summary of the situation. There is a sense in
which there is always joint ownership of files with my design. But
there is also a sense is which there isn't, because it's impossible to
do that while not completely reinventing resource management of temp
files. I wanted to preserve resowner.c ownership of fd.c segments.

You maintain that it's better to have the leader unlink() everything
at the end, and suppress the errors when that doesn't work, so that
that path always just plows through. I disagree with that. It is a
trade-off, I suppose. I have now run out of time to work through it
with you or Thomas, though.

> But even if were true that the waits will always be brief, I still
> think the way you've done it is a bad idea, because now tuplesort.c
> has to know that it needs to wait because of some detail of
> lower-level resource management about which it should not have to
> care.  That alone is a sufficient reason to want a better approach.

There is already a point at which the leader needs to wait, so that it
can accumulate stats that nbtsort.c cares about. So we already need a
leader wait point within nbtsort.c (that one is called directly by
nbtsort.c). Doesn't seem like too bad of a wart to have the same thing
for workers.

>> I believe that the main reason that you like the design I came up with
>> on the whole is that it's minimally divergent from the serial case.
>
> That's part of it, I guess, but it's more that the code you've added
> to do parallelism here looks an awful lot like what's gotten added to
> do parallelism in other cases, like parallel query.  That's probably a
> good sign.

It's also a good sign that it makes CREATE INDEX approximately 3 times faster.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Robert Haas
On Tue, Mar 21, 2017 at 2:03 PM, Peter Geoghegan  wrote:
> I think that since the comment refers to code from before 1999, it can
> go. Any separate patch to remove it would have an entirely negative
> linediff.

It's a good general principle that a patch should do one thing well
and not make unrelated changes.  I try hard to adhere to that
principle in my commits, and I think other committers generally do
(and should), too.  Of course, different people draw the line in
different places.  If you can convince another committer to include
that change in their commit of this patch, well, that's not my cup of
tea, but so be it.  If you want me to consider committing this, you're
going to have to submit that part separately, preferably on a separate
thread with a suitably descriptive subject line.

> Obviously iff you write to the file in the leader, there is little
> that the worker can do afterwards, but it's not a given that you'd
> want to do that, and this patch actually never does. You could equally
> well say that PHJ fails to provide for my requirement for having the
> leader write to the files sensibly in order to recycle blocks, a
> requirement that its shared BufFile mechanism expressly does not
> support.

>From my point of view, the main point is that having two completely
separate mechanisms for managing temporary files that need to be
shared across cooperating workers is not a good decision.  That's a
need that's going to come up over and over again, and it's not
reasonable for everybody who needs it to add a separate mechanism for
doing it.  We need to have ONE mechanism for it.

The second point is that I'm pretty convinced that the design you've
chosen is fundamentally wrong.  I've attempt to explain that multiple
times, starting about three months ago with
http://postgr.es/m/ca+tgmoyp0vzpw64dfmqt1jhy6szyavjoglkj3ermzzzn2f9...@mail.gmail.com
and continuing across many subsequent emails on multiple threads.
It's just not OK in my book for a worker to create something that it
initially owns and then later transfer it to the leader.  The
cooperating backends should have joint ownership of the objects from
the beginning, and the last process to exit the set should clean up
those resources.

>> That would cut hundreds of
>> lines from this patch with no real disadvantage that I can see --
>> including things like worker_wait(), which are only needed because of
>> the shortcomings of the underlying mechanism.
>
> I think it would definitely be a significant net gain in LOC. And,
> worker_wait() will probably be replaced by the use of the barrier
> abstraction anyway.

No, because if you do it Thomas's way, the worker can exit right away,
without waiting.  You don't have to wait via a different method; you
escape waiting altogether.  I understand that your point is that the
wait will always be brief, but I think that's probably an optimistic
assumption and definitely an unnecessary assumption.  It's optimistic
because there is absolutely no guarantee that all workers will take
the same amount of time to sort the data they read.  It is absolutely
not the case that all data sets sort at the same speed.  Because of
the way parallel sequential scan works, we're somewhat insulated from
that; workers that sort faster will get a larger chunk of the table.
However, that only means that workers will finish generating their
sorted runs at about the same time, not that they will finish merging
at the same time.  And, indeed, if some workers end up with more data
than others (so that they finish building runs at about the same time)
then some will probably take longer to complete the merging than
others.

But even if were true that the waits will always be brief, I still
think the way you've done it is a bad idea, because now tuplesort.c
has to know that it needs to wait because of some detail of
lower-level resource management about which it should not have to
care.  That alone is a sufficient reason to want a better approach.

I completely accept that whatever abstraction we use at the BufFile
level has to be something that can be plumbed into logtape.c, and if
Thomas's mechanism can't be bolted in there in a sensible way then
that's a problem.  But I feel quite strongly that the solution to that
problem isn't to adopt the approach you've taken here.

>> + * run.  Parallel workers always use quicksort, however.
>>
>> Comment fails to mention a reason.
>
> Well, I don't think that there is any reason to use replacement
> selection at all, what with the additional merge heap work last year.
> But, the theory there remains that RS is good when you can get one big
> run and no merge. You're not going to get that with parallel sort in
> any case, since the leader must merge. Besides, merging in the workers
> happens in the workers. And, the backspace requirement of 32MB of
> workMem per participant pretty much eliminates any use of RS that
> you'd get otherwise.

So, please mention that briefly in 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Peter Geoghegan
On Tue, Mar 21, 2017 at 9:10 AM, Robert Haas  wrote:
> - * This code is moderately slow (~10% slower) compared to the regular
> - * btree (insertion) build code on sorted or well-clustered data.  On
> - * random data, however, the insertion build code is unusable -- the
> - * difference on a 60MB heap is a factor of 15 because the random
> - * probes into the btree thrash the buffer pool.  (NOTE: the above
> - * "10%" estimate is probably obsolete, since it refers to an old and
> - * not very good external sort implementation that used to exist in
> - * this module.  tuplesort.c is almost certainly faster.)
>
> While I agree that the old comment is probably inaccurate, I don't
> think dropping it without comment in a patch to implement parallel
> sorting is the way to go. How about updating it to be more current as
> a separate patch?

I think that since the comment refers to code from before 1999, it can
go. Any separate patch to remove it would have an entirely negative
linediff.

> +/* Magic numbers for parallel state sharing */

> 1, 2, and 3 would probably work just as well.

Okay.

> Why is this being put in planner.c rather than something specific to
> creating indexes?  Not sure that's a good idea.

The idea is that it's the planner's domain, but this is a utility
statement, so it makes sense to put it next to the CLUSTER function
that determine whether CLUSTER sorts rather than does an index scan. I
don't have strong feelings on how appropriate that is.

> + * This should be called when workers have flushed out temp file buffers and
> + * yielded control to caller's process.  Workers should hold open their
> + * BufFiles at least until the caller's process is able to call here and
> + * assume ownership of BufFile.  The general pattern is that workers make
> + * available data from their temp files to one nominated process; there is
> + * no support for workers that want to read back data from their original
> + * BufFiles following writes performed by the caller, or any other
> + * synchronization beyond what is implied by caller contract.  All
> + * communication occurs in one direction.  All output is made available to
> + * caller's process exactly once by workers, following call made here at the
> + * tail end of processing.
>
> Thomas has designed a system for sharing files among cooperating
> processes that lacks several of these restrictions.  With his system,
> it's still necessary for all data to be written and flushed by the
> writer before anybody tries to read it.  But the restriction that the
> worker has to hold its BufFile open until the leader can assume
> ownership goes away.  That's a good thing; it avoids the need for
> workers to sit around waiting for the leader to assume ownership of a
> resource instead of going away faster and freeing up worker slots for
> some other query, or moving on to some other computation.   The
> restriction that the worker can't reread the data after handing off
> the file also goes away.

There is no restriction about workers not being able to reread data.
That comment makes it clear that that's only when the leader writes to
the file. It alludes to rereading within a worker following the leader
writing to their files in order to recycle blocks within logtape.c,
which the patch never has to do, unless you enable one of the 0002-*
testing GUCs to force randomAccess.

Obviously iff you write to the file in the leader, there is little
that the worker can do afterwards, but it's not a given that you'd
want to do that, and this patch actually never does. You could equally
well say that PHJ fails to provide for my requirement for having the
leader write to the files sensibly in order to recycle blocks, a
requirement that its shared BufFile mechanism expressly does not
support.

> That would cut hundreds of
> lines from this patch with no real disadvantage that I can see --
> including things like worker_wait(), which are only needed because of
> the shortcomings of the underlying mechanism.

I think it would definitely be a significant net gain in LOC. And,
worker_wait() will probably be replaced by the use of the barrier
abstraction anyway. It didn't seem worth creating a dependency on
early given my simple requirements. PHJ uses barriers instead,
presumably because there is much more of this stuff. The workers
generally won't have to wait at all. It's expected to be pretty much
instantaneous.

> + * run.  Parallel workers always use quicksort, however.
>
> Comment fails to mention a reason.

Well, I don't think that there is any reason to use replacement
selection at all, what with the additional merge heap work last year.
But, the theory there remains that RS is good when you can get one big
run and no merge. You're not going to get that with parallel sort in
any case, since the leader must merge. Besides, merging in the workers
happens in the workers. And, the backspace requirement of 32MB of
workMem per participant pretty 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-21 Thread Robert Haas
On Sun, Mar 19, 2017 at 9:03 PM, Peter Geoghegan  wrote:
> On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan  wrote:
>> I attach my V9 of the patch. I came up some stuff for the design of
>> resource management that I think meets every design goal that we have
>> for shared/unified BufFiles:
>
> Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should
> now pass -1 as the index block argument to compute_parallel_worker(),
> just as all callers that aren't parallel index scan do after that
> commit. This issue caused V9 to never choose parallel CREATE INDEX
> within nbtsort.c. There was also a small amount of bitrot.
>
> Attached V10 fixes this regression. I also couldn't resist adding a
> few new assertions that I thought were worth having to buffile.c, plus
> dedicated wait events for parallel tuplesort. And, I fixed a silly bug
> added in V9 around where worker_wait() should occur.

Some initial review comments:

- * This code is moderately slow (~10% slower) compared to the regular
- * btree (insertion) build code on sorted or well-clustered data.  On
- * random data, however, the insertion build code is unusable -- the
- * difference on a 60MB heap is a factor of 15 because the random
- * probes into the btree thrash the buffer pool.  (NOTE: the above
- * "10%" estimate is probably obsolete, since it refers to an old and
- * not very good external sort implementation that used to exist in
- * this module.  tuplesort.c is almost certainly faster.)

While I agree that the old comment is probably inaccurate, I don't
think dropping it without comment in a patch to implement parallel
sorting is the way to go. How about updating it to be more current as
a separate patch?

+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BTREE_SHARED  UINT64CONST(0xA001)
+#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA002)
+#define PARALLEL_KEY_TUPLESORT_SPOOL2  UINT64CONST(0xA003)

1, 2, and 3 would probably work just as well.  The parallel
infrastructure uses high-numbered values to avoid conflict with
plan_node_id values, but this is a utility statement so there's no
such problem.  But it doesn't matter very much.

+ * Note: caller had better already hold some type of lock on the table and
+ * index.
+ */
+int
+plan_create_index_workers(Oid tableOid, Oid indexOid)

Caller should pass down the Relation rather than the Oid.  That is
better both because it avoids unnecessary work and because it more or
less automatically avoids the problem mentioned in the note.

Why is this being put in planner.c rather than something specific to
creating indexes?  Not sure that's a good idea.

+ * This should be called when workers have flushed out temp file buffers and
+ * yielded control to caller's process.  Workers should hold open their
+ * BufFiles at least until the caller's process is able to call here and
+ * assume ownership of BufFile.  The general pattern is that workers make
+ * available data from their temp files to one nominated process; there is
+ * no support for workers that want to read back data from their original
+ * BufFiles following writes performed by the caller, or any other
+ * synchronization beyond what is implied by caller contract.  All
+ * communication occurs in one direction.  All output is made available to
+ * caller's process exactly once by workers, following call made here at the
+ * tail end of processing.

Thomas has designed a system for sharing files among cooperating
processes that lacks several of these restrictions.  With his system,
it's still necessary for all data to be written and flushed by the
writer before anybody tries to read it.  But the restriction that the
worker has to hold its BufFile open until the leader can assume
ownership goes away.  That's a good thing; it avoids the need for
workers to sit around waiting for the leader to assume ownership of a
resource instead of going away faster and freeing up worker slots for
some other query, or moving on to some other computation.   The
restriction that the worker can't reread the data after handing off
the file also goes away.  The files can be read and written by any
participant in any order, as many times as you like, with only the
restriction that the caller must guarantee that data will be written
and flushed from private buffers before it can be read.  I don't see
any reason to commit both his system and your system, and his is more
general so I think you should use it.  That would cut hundreds of
lines from this patch with no real disadvantage that I can see --
including things like worker_wait(), which are only needed because of
the shortcomings of the underlying mechanism.

+ * run.  Parallel workers always use quicksort, however.

Comment fails to mention a reason.

+elog(LOG, "%d using " INT64_FORMAT " KB of memory for read
buffers among %d input tapes",
+ state->worker, 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-19 Thread Peter Geoghegan
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan  wrote:
> I attach my V9 of the patch. I came up some stuff for the design of
> resource management that I think meets every design goal that we have
> for shared/unified BufFiles:

Commit 2609e91fc broke the parallel CREATE INDEX cost model. I should
now pass -1 as the index block argument to compute_parallel_worker(),
just as all callers that aren't parallel index scan do after that
commit. This issue caused V9 to never choose parallel CREATE INDEX
within nbtsort.c. There was also a small amount of bitrot.

Attached V10 fixes this regression. I also couldn't resist adding a
few new assertions that I thought were worth having to buffile.c, plus
dedicated wait events for parallel tuplesort. And, I fixed a silly bug
added in V9 around where worker_wait() should occur.

-- 
Peter Geoghegan


0001-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data


0002-Add-temporary-testing-tools.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-12 Thread Peter Geoghegan
On Sun, Mar 12, 2017 at 3:05 PM, Peter Geoghegan  wrote:
> There is still an open item here, though: The leader-as-worker
> Tuplesortstate, a special case, can still leak files.

I phrased this badly. What I mean is that there can be instances where
temp files are left on disk following a failure such as palloc() OOM;
no backend ends up doing an unlink() iff a leader-as-worker
Tuplesortstate was used and we get unlucky. I did not mean a leak of
virtual or real file descriptors, which would see Postgres print a
refcount leak warning from resowner.c. Naturally, these "leaked" files
will eventually be deleted by the next restart of the server at the
latest, within RemovePgTempFiles(). Note also that a duplicate
unlink() (with annoying LOG message) is impossible under any
circumstances with V9, regardless of whether or not a leader-as-worker
Tuplesort state is involved.

Anyway, I was sure that I needed to completely nail this down in order
to be consistent with existing guarantees, but another look at
OpenTemporaryFile() makes me doubt that. ResourceOwnerEnlargeFiles()
is called, which itself uses palloc(), which can of course fail. There
are remarks over that function within resowner.c about OOM:

/*
 * Make sure there is room for at least one more entry in a ResourceOwner's
 * files reference array.
 *
 * This is separate from actually inserting an entry because if we run out
 * of memory, it's critical to do so *before* acquiring the resource.
 */
void
ResourceOwnerEnlargeFiles(ResourceOwner owner)
{
...
}

But this happens after OpenTemporaryFileInTablespace() has already
returned. Taking care to allocate memory up-front here is motivated by
keeping the vFD cache entry and current resource owner in perfect
agreement about the FD_XACT_TEMPORARY-ness of a file, and that's it.
It's *not* true that there is a broader sense in which
OpenTemporaryFile() is atomic, which for some reason I previously
believed to be the case.

So, I haven't failed to prevent an outcome that wasn't already
possible. It doesn't seem like it would be that hard to fix this, and
then have the parallel tuplesort patch live up to that new higher
standard. But, it's possible that Tom or maybe someone else would
consider that a bad idea, for roughly the same reason that we don't
call RemovePgTempFiles() for *crash* induced restarts, as mentioned by
Thomas up-thead:

 * NOTE: we could, but don't, call this during a post-backend-crash restart
 * cycle.  The argument for not doing it is that someone might want to examine
 * the temp files for debugging purposes.  This does however mean that
 * OpenTemporaryFile had better allow for collision with an existing temp
 * file name.
 */
void
RemovePgTempFiles(void)
{
...
}

Note that I did put some thought into making sure OpenTemporaryFile()
does the right thing with collisions with existing temp files. So,
maybe the right thing is to do nothing at all. I don't have strong
feelings either way on this question.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-12 Thread Peter Geoghegan
On Thu, Feb 16, 2017 at 8:45 AM, Peter Geoghegan  wrote:
>> I do not think there should be any reason why we can't get the
>> resource accounting exactly correct here.  If a single backend manages
>> to remove every temporary file that it creates exactly once (and
>> that's currently true, modulo system crashes), a group of cooperating
>> backends ought to be able to manage to remove every temporary file
>> that any of them create exactly once (again, modulo system crashes).
>
> I believe that we are fully in agreement here. In particular, I think
> it's bad that there is an API that says "caller shouldn't throw an
> elog error between these two points", and that will be fixed before
> too long. I just think that it's worth acknowledging a certain nuance.

I attach my V9 of the patch. I came up some stuff for the design of
resource management that I think meets every design goal that we have
for shared/unified BufFiles:

* Avoids both resource leaks, and spurious double-freeing of resources
(e.g., a second unlink() for a file from a different process) when
there are errors. The latter problem was possible before, a known
issue with V8 of the patch. I believe that this revision avoids these
problems in a way that is *absolutely bulletproof* in the face of
arbitrary failures (e.g., palloc() failure) in any process at any
time. Although, be warned that there is a remaining open item
concerning resource management in the leader-as-worker case, which I
go into below.

There are now what you might call "critical sections" in one function.
That is, there are points where we cannot throw an error (without a
BEGIN_CRIT_SECTION()!), but those are entirely confined to unification
code within the leader, where we can be completely sure that no error
can be raised. The leader can even fail before some but not all of a
particular worker's segments are in its local resource manager, and we
still do the right thing. I've been testing this by adding code that
randomly throws errors at points interspersed throughout worker and
leader unification hand-off points. I then leave this stress-test
build to run for a few hours, while monitoring for leaked files and
spurious fd.c reports of double-unlink() and similar issues. Test
builds change LOG to PANIC within several places in fd.c, while
MAX_PHYSICAL_FILESIZE was reduced from 1GiB to BLCKSZ.

All of these guarantees are made without any special care from caller
to buffile.c. The only V9 change to tuplesort.c or logtape.c in this
general area is that they have to pass a dynamic shared memory segment
to buffile.c, so that it can register a new callback. That's it. This
may be of particular interest to Thomas. All complexity is confined to
buffile.c.

* No expansion in the use of shared memory to manage resources.
BufFile refcount is still per-worker. The role of local resource
managers is unchanged.

* Additional complexity over and above ordinary BufFile resource
management is confined to the leader process and its on_dsm_detach()
callback. Only the leader registers a callback. Of course, refcount
management within BufFileClose() can still take place in workers, but
that isn't something that we rely on (that's only for non-error
paths). In general, worker processes mostly have resource managers
managing their temp file segments as a thing that has nothing to do
with BufFiles (BufFiles are still not owned by resowner.c/fd.c --
they're blissfully unaware of all of this stuff).

* In general, unified BufFiles can still be treated in exactly the
same way as conventional BufFiles, and things just work, without any
special cases being exercised internally.

There is still an open item here, though: The leader-as-worker
Tuplesortstate, a special case, can still leak files. So,
stress-testing will only show the patch to be completely robust
against resource leaks when nbtsort.c is modified to enable
FORCE_SINGLE_WORKER testing. Despite the name FORCE_SINGLE_WORKER, you
can also modify that file to force there to be arbitrary-many workers
requested (just change "requested = 1" to something else). The
leader-as-worker problem is avoided because we don't have the leader
participating as a worker this way, which would otherwise present
issues for resowner.c that I haven't got around to fixing just yet. It
isn't hard to imagine why this is -- one backend with two FDs for
certain fd.c temp segments is just going to cause problems for
resowner.c without additional special care. Didn't seem worth blocking
on that. I want to prove that my general approach is workable. That
problem is confined to one backend's resource manager when it is the
leader participating as a worker. It is not a refcount problem. The
simplest solution here would be to ban the leader-as-worker case by
contract. Alternatively, we could pass fd.c segments from the
leader-as-worker Tuplesortstate's BufFile to the leader
Tuplesortstate's BufFile without opening or closing anything. This
way, there will be no 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-06 Thread Thomas Munro
On Wed, Mar 1, 2017 at 10:29 PM, Thomas Munro
 wrote:
> I'm testing a patch that lets you set up a fixed sized
> SharedBufFileSet object in a DSM segment, with its own refcount for
> the reason you explained.  It supports a dynamically expandable set of
> numbered files, so each participant gets to export file 0, file 1,
> file 2 and so on as required, in any order.  I think this should suit
> both Parallel Tuplesort which needs to export just one file from each
> participant, and Parallel Shared Hash which doesn't know up front how
> many batches it will produce.  Not quite ready but I will post a
> version tomorrow to get Peter's reaction.

See 0007-hj-shared-buf-file-v6.patch in the v6 tarball in the parallel
shared hash thread.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-01 Thread Thomas Munro
On Sat, Feb 11, 2017 at 1:52 AM, Robert Haas  wrote:
> On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
>  wrote:
>> Yes, potentially unbounded in rare case.  If we plan for N batches,
>> and then run out of work_mem because our estimates were just wrong or
>> the distributions of keys is sufficiently skewed, we'll run
>> HashIncreaseNumBatches, and that could happen more than once.  I have
>> a suite of contrived test queries that hits all the various modes and
>> code paths of hash join, and it includes a query that plans for one
>> batch but finishes up creating many, and then the leader exits.  I'll
>> post that to the other thread along with my latest patch series soon.
>
> Hmm, OK.  So that's going to probably require something where a fixed
> amount of DSM can describe an arbitrary number of temp file series.
> But that also means this is an even-more-special-purpose tool that
> shouldn't be deeply tied into parallel.c so that it can run before any
> errors happen.
>
> Basically, I think the "let's write the code between here and here so
> it throws no errors" technique is, for 99% of PostgreSQL programming,
> difficult and fragile.  We shouldn't rely on it if there is some other
> reasonable option.

I'm testing a patch that lets you set up a fixed sized
SharedBufFileSet object in a DSM segment, with its own refcount for
the reason you explained.  It supports a dynamically expandable set of
numbered files, so each participant gets to export file 0, file 1,
file 2 and so on as required, in any order.  I think this should suit
both Parallel Tuplesort which needs to export just one file from each
participant, and Parallel Shared Hash which doesn't know up front how
many batches it will produce.  Not quite ready but I will post a
version tomorrow to get Peter's reaction.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Robert Haas
On Thu, Feb 16, 2017 at 11:45 AM, Peter Geoghegan  wrote:
> Maybe I'm just being pedantic here, since we both actually want the
> code to do the same thing.

Pedantry from either of us?  Nah...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Peter Geoghegan
On Wed, Feb 15, 2017 at 6:05 PM, Thomas Munro
 wrote:
> Here are some results with your latest patch, using the same test as
> before but this time with SCALE=100 (= 100,000,000 rows).

Cool.

> To reduce the number of combinations I did only unique data and built
> only non-unique indexes with only 'wide' tuples (= key plus a text
> column that holds a 151-character wide string, rather than just the
> key), and also didn't bother with the 1MB memory size as suggested.
> Here are the results up to 4 workers (a results table going up to 8
> workers is attached, since it wouldn't format nicely if I pasted it
> here).

I think that you are still I/O bound in a way that is addressable by
adding more disks. The exception is the text cases, where the patch
does best. (I don't place too much emphasis on that because I know
that in the long term, we'll have abbreviated keys, which will take
some of the sheen off of that.)

> Again, the w = 0 time is seconds, the rest show relative
> speed-up.

I think it's worth pointing out that while there are cases where we
see no benefit from going from 4 to 8 workers, it tends to hardly hurt
at all, or hardly help at all. It's almost irrelevant that the number
of workers used is excessive, at least up until the point when all
cores have their own worker. That's a nice quality for this to have --
the only danger is that we use parallelism when we shouldn't have at
all, because the serial case could manage an internal sort, and the
sort was small enough that that could be a notable factor.

> "textwide" "asc" is nearly an order of magnitude faster than other
> initial orders without parallelism, but then parallelism doesn't seem
> to help it much.  Also, using more that 64MB doesn't ever seem to help
> very much; in the "desc" case it hinders.

Maybe it's CPU cache efficiency? There are edge cases where multiple
passes are faster than one pass. That'ks the only explanation I can
think of.

> I was curious to understand how performance changes if we become just
> a bit less correlated (rather than completely uncorrelated or
> perfectly inversely correlated), so I tried out a 'banana skin' case:
> I took the contents of the textwide asc table and copied it to a new
> table, and then moved the 900 words matching 'banana%' to the physical
> end of the heap by deleting and reinserting them in one transaction.

A likely problem with that is that most runs will actually not have
their own banana skin, so to speak. You only see a big drop in
performance when every quicksort operation has presorted input, but
with one or more out-of-order tuples at the end. In order to see a
really unfortunate case with parallel CREATE INDEX, you'd probably
have to have enough memory that workers don't need to do their own
merge (and so worker's work almost entirely consists of one big
quicksort operation), with enough "banana skin heap pages" that the
parallel heap scan is pretty much guaranteed to end up giving "banana
skin" (out of order) tuples to every worker, making all of them "have
a slip" (throw away a huge amount of work as the presorted
optimization is defeated right at the end of its sequential read
through).

A better approach would be to have several small localized areas
across input where input tuples are a little out of order. That would
probably show that the performance is pretty in line with random
cases.

> It's hard to speculate about this, but I guess that a significant
> number of indexes in real world databases might be uncorrelated to
> insert order.

That would certainly be true with text, where we see a risk of (small)
regressions.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Peter Geoghegan
On Thu, Feb 16, 2017 at 6:28 AM, Robert Haas  wrote:
> On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan  wrote:
>> At the risk of stating the obvious, ISTM that the right way to do
>> this, at a high level, is to err on the side of unneeded extra
>> unlink() calls, not leaking files. And, to make the window for problem
>> ("remaining hole that you haven't quite managed to plug") practically
>> indistinguishable from no hole at all, in a way that's kind of baked
>> into the API.
>
> I do not think there should be any reason why we can't get the
> resource accounting exactly correct here.  If a single backend manages
> to remove every temporary file that it creates exactly once (and
> that's currently true, modulo system crashes), a group of cooperating
> backends ought to be able to manage to remove every temporary file
> that any of them create exactly once (again, modulo system crashes).

I believe that we are fully in agreement here. In particular, I think
it's bad that there is an API that says "caller shouldn't throw an
elog error between these two points", and that will be fixed before
too long. I just think that it's worth acknowledging a certain nuance.

> I do agree that a duplicate unlink() call isn't as bad as a missing
> unlink() call, at least if there's no possibility that the filename
> could have been reused by some other process, or some other part of
> our own process, which doesn't want that new file unlinked.  But it's
> messy.  If the seatbelts in your car were to randomly unbuckle, that
> would be a safety hazard.  If they were to randomly refuse to
> unbuckle, you wouldn't say "that's OK because it's not a safety
> hazard", you'd say "these seatbelts are badly designed".  And I think
> the same is true of this mechanism.

If it happened in the lifetime of only one out of a million seatbelts
manufactured, and they were manufactured at a competitive price (not
over-engineered), I probably wouldn't say that. The fact that the
existing resource manger code only LOGs most temp file related
failures suggests to me that that's a "can't happen" condition, but we
still hedge. I would still like to hedge against even (theoretically)
impossible risks.

Maybe I'm just being pedantic here, since we both actually want the
code to do the same thing.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-16 Thread Robert Haas
On Thu, Feb 9, 2017 at 7:10 PM, Peter Geoghegan  wrote:
> At the risk of stating the obvious, ISTM that the right way to do
> this, at a high level, is to err on the side of unneeded extra
> unlink() calls, not leaking files. And, to make the window for problem
> ("remaining hole that you haven't quite managed to plug") practically
> indistinguishable from no hole at all, in a way that's kind of baked
> into the API.

I do not think there should be any reason why we can't get the
resource accounting exactly correct here.  If a single backend manages
to remove every temporary file that it creates exactly once (and
that's currently true, modulo system crashes), a group of cooperating
backends ought to be able to manage to remove every temporary file
that any of them create exactly once (again, modulo system crashes).

I do agree that a duplicate unlink() call isn't as bad as a missing
unlink() call, at least if there's no possibility that the filename
could have been reused by some other process, or some other part of
our own process, which doesn't want that new file unlinked.  But it's
messy.  If the seatbelts in your car were to randomly unbuckle, that
would be a safety hazard.  If they were to randomly refuse to
unbuckle, you wouldn't say "that's OK because it's not a safety
hazard", you'd say "these seatbelts are badly designed".  And I think
the same is true of this mechanism.

The way to make this 100% reliable is to set things up so that there
is joint ownership from the beginning and shared state that lets you
know whether the work has already been done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-15 Thread Thomas Munro
On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan  wrote:
> It might just have been that the table was too small to be an
> effective target for parallel sequential scan with so many workers,
> and so a presorted best case CREATE INDEX, which isn't that different,
> also fails to see much benefit (compared to what you'd see with a
> similar case involving a larger table). In other words, I might have
> jumped the gun in emphasizing issues with hardware and I/O bandwidth
> over issues around data volume (that I/O parallelism is inherently not
> very helpful with these relatively small tables).
>
> As I've pointed out a couple of times before, bigger sorts will be
> more CPU bound because sorting itself has costs that grow
> linearithmically, whereas writing out runs has costs that grow
> linearly. The relative cost of the I/O can be expected to go down as
> input goes up for this reason. At the same time, a larger input might
> make better use of I/O parallelism, which reduces the cost paid in
> latency to write out runs in absolute terms.

Here are some results with your latest patch, using the same test as
before but this time with SCALE=100 (= 100,000,000 rows).  The table
sizes are:

 List of relations
 Schema | Name | Type  |Owner | Size  | Description
+--+---+--+---+-
 public | million_words| table | thomas.munro | 42 MB |
 public | some_words   | table | thomas.munro | 19 MB |
 public | test_intwide_u_asc   | table | thomas.munro | 18 GB |
 public | test_intwide_u_desc  | table | thomas.munro | 18 GB |
 public | test_intwide_u_rand  | table | thomas.munro | 18 GB |
 public | test_textwide_u_asc  | table | thomas.munro | 19 GB |
 public | test_textwide_u_desc | table | thomas.munro | 19 GB |
 public | test_textwide_u_rand | table | thomas.munro | 19 GB |

To reduce the number of combinations I did only unique data and built
only non-unique indexes with only 'wide' tuples (= key plus a text
column that holds a 151-character wide string, rather than just the
key), and also didn't bother with the 1MB memory size as suggested.
Here are the results up to 4 workers (a results table going up to 8
workers is attached, since it wouldn't format nicely if I pasted it
here).  Again, the w = 0 time is seconds, the rest show relative
speed-up.  This data was all in the OS page cache because of a dummy
run done first, and I verified with 'sar' that there was exactly 0
reading from the block device.  The CPU was pegged on leader + workers
during sort runs, and then the leader's CPU hovered around 93-98%
during the merge/btree build.  I had some technical problems getting a
cold-cache read-from-actual-disk-each-time test run to work properly,
but can go back and do that again if anyone thinks that would be
interesting data to see.

   tab| ord  | mem |  w = 0  | w = 1 | w = 2 | w = 3 | w = 4
--+--+-+-+---+---+---+---
 intwide  | asc  |  64 |   67.91 | 1.26x | 1.46x | 1.62x | 1.73x
 intwide  | asc  | 256 |   67.84 | 1.23x | 1.48x | 1.63x | 1.79x
 intwide  | asc  | 512 |   69.01 | 1.25x | 1.50x | 1.63x | 1.80x
 intwide  | desc |  64 |   98.08 | 1.48x | 1.83x | 2.03x | 2.25x
 intwide  | desc | 256 |   99.87 | 1.43x | 1.80x | 2.03x | 2.29x
 intwide  | desc | 512 |  104.09 | 1.44x | 1.85x | 2.09x | 2.33x
 intwide  | rand |  64 |  138.03 | 1.56x | 2.04x | 2.42x | 2.58x
 intwide  | rand | 256 |  139.44 | 1.61x | 2.04x | 2.38x | 2.56x
 intwide  | rand | 512 |  138.96 | 1.52x | 2.03x | 2.28x | 2.57x
 textwide | asc  |  64 |  207.10 | 1.20x | 1.07x | 1.09x | 1.11x
 textwide | asc  | 256 |  200.62 | 1.19x | 1.06x | 1.04x | 0.99x
 textwide | asc  | 512 |  191.42 | 1.16x | 0.97x | 1.01x | 0.94x
 textwide | desc |  64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x
 textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x
 textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x
 textwide | rand |  64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x
 textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x
 textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x

"textwide" "asc" is nearly an order of magnitude faster than other
initial orders without parallelism, but then parallelism doesn't seem
to help it much.  Also, using more that 64MB doesn't ever seem to help
very much; in the "desc" case it hinders.

I was curious to understand how performance changes if we become just
a bit less correlated (rather than completely uncorrelated or
perfectly inversely correlated), so I tried out a 'banana skin' case:
I took the contents of the textwide asc table and copied it to a new
table, and then moved the 900 words matching 'banana%' to the physical
end of the heap by deleting and reinserting them in one transaction.
I guess if we were to use this technology for CLUSTER, this might be
representative of a situation where you regularly 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-10 Thread Robert Haas
On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
 wrote:
> Here's my thought process... please tell me where I'm going wrong:
>
> I have been assuming that it's not enough to just deal with this when
> the leader detaches on the theory that other participants will always
> detach first: that probably isn't true in some error cases, and could
> contribute to spurious racy errors where other workers complain about
> disappearing files if the leader somehow shuts down and cleans up
> while a worker is still running.  Therefore we need *some* kind of
> refcounting, whether it's a new kind or a new mechanism based on the
> existing kind.

+1.

> I have also been assuming that we don't want to teach dsm.c directly
> about this stuff; it shouldn't need to know about other modules, so we
> don't want it talking to buffile.c directly and managing a special
> table of files; instead we want a system of callbacks.  Therefore
> client code needs to do something after attaching to the segment in
> each backend.

+1.

> It doesn't matter whether we use an on_dsm_detach() callback and
> manage our own refcount to infer that destruction is imminent, or a
> new on_dsm_destroy() callback which tells us so explicitly: both ways
> we'll need to make sure that anyone who attaches to the segment also
> "attaches" to this shared BufFile manager object inside it, because
> any backend might turn out to be the one that is last to detach.

Not entirely.  In the first case, you don't need the requirement that
everyone who attaches the segment must attach to the shared BufFile
manager.  In the second case, you do.

> That bring us to the race you mentioned.  Isn't it sufficient to say
> that you aren't allowed to do anything that might throw in between
> attaching to the segment and attaching to the SharedBufFileManager
> that it contains?

That would be sufficient, but I think it's not a very good design.  It
means, for example, that nothing between the time you attach to the
segment and the time you attach to this manager can palloc() anything.
So, for example, it would have to happen before ParallelWorkerMain
reaches the call to shm_mq_attach, which kinda sucks because we want
to do that as soon as possible after attaching to the DSM segment so
that errors are reported properly thereafter.  Note that's the very
first thing we do now, except for working out what the arguments to
that call need to be.

Also, while it's currently safe to assume that shm_toc_attach() and
shm_toc_lookup() don't throw errors, I've thought about the
possibility of installing some sort of cache in shm_toc_lookup() to
amortize the cost of lookups, if the number of keys ever got too
large.  And that would then require a palloc().  Generally, backend
code should be free to throw errors.  When it's absolutely necessary
for a short segment of code to avoid that, then we do, but you can't
really rely on any substantial amount of code to be that way, or stay
that way.

And in this case, even if we didn't mind those problems or had some
solution to them, I think that the shared buffer manager shouldn't
have to be something that is whacked directly into parallel.c all the
way at the beginning of the initialization sequence so that nothing
can fail before it happens.  I think it should be an optional data
structure that clients of the parallel infrastructure can decide to
use, or to not use.  It should be at arm's length from the core code,
just like the way ParallelQueryMain() is distinct from
ParallelWorkerMain() and sets up its own set of data structures with
their own set of keys.  All that stuff is happy to happen after
whatever ParallelWorkerMain() feels that it needs to do, even if
ParallelWorkerMain might throw errors for any number of unknown
reasons.  Similarly, I think this new things should be something than
an executor node can decide to create inside its own per-node space --
reserved via ExecParallelEstimate, initialized
ExecParallelInitializeDSM, etc.  There's no need for it to be deeply
coupled to parallel.c itself unless we force that choice by sticking a
no-fail requirement in there.

> Up until two minutes ago I assumed that policy would leave only two
> possibilities: you attach to the DSM segment and attach to the
> SharedBufFileManager successfully or you attach to the DSM segment and
> then die horribly (but not throw) and the postmaster restarts the
> whole cluster and blows all temp files away with RemovePgTempFiles().
> But I see now in the comment of that function that crash-induced
> restarts don't call that because "someone might want to examine the
> temp files for debugging purposes".  Given that policy for regular
> private BufFiles, I don't see why that shouldn't apply equally to
> shared files: after a crash restart, you may have some junk files that
> won't be cleaned up until your next clean restart, whether they were
> private or shared BufFiles.

I think most people (other than Tom) would agree that 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-09 Thread Peter Geoghegan
On Thu, Feb 9, 2017 at 2:31 PM, Robert Haas  wrote:
> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource.  But ASLR breaks that, especially for loadable modules.  You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se.  But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors.   Maybe you can think of a way to
> plug that hole too but you're way over your $8 budget by this
> point.

At the risk of stating the obvious, ISTM that the right way to do
this, at a high level, is to err on the side of unneeded extra
unlink() calls, not leaking files. And, to make the window for problem
("remaining hole that you haven't quite managed to plug") practically
indistinguishable from no hole at all, in a way that's kind of baked
into the API.

It's not like we currently throw an error when there is a problem with
deleting temp files that are no longer needed on resource manager
cleanup. We simply log the fact that it happened, and limp on.

I attach my V8. This does not yet do anything with on_dsm_detach().
I've run out of time to work on it this week, and am starting a new
job next week at VMware, which I'll need time to settle into. So I'm
posting this now, since you can still very much see the direction I'm
going in, and can give me any feedback that you have. If anyone wants
to show me how its done by building on this, and finishing what I have
off, be my guest. The new stuff probably isn't quite as polished as I
would prefer, but time grows short, so I won't withhold it.

Changes:

* Implements refcount thing, albeit in a way that leaves a small
window for double unlink() calls if there is an error during the small
window in which there is worker/leader co-ownership of a BufFile (just
add an "elog(ERROR)" just before leader-as-worker Tuplesort state is
ended within _bt_leafbuild() to see what I mean). This implies that
background workers can be reclaimed once the leader needs to start its
final on-the-fly merge, which is nice. As an example of how that's
nice, this change makes maintenance_work_mem a budget that we more
strictly adhere to.

* Fixes bitrot caused by recent logtape.c bugfix in master branch.

* No local segment is created during unification unless and until one
is required. (In practice, for current use of BufFile infrastructure,
no "local" segment is ever created, even if we force a randomAccess
case using one of the testing GUCs from 0002-* -- we'd have to use
another GUC to *also* force there to be no reclaimation.)

* Better testing. As I just mentioned, we can now force logtape.c to
not reclaim blocks, so you make new local segments as part of a
unified BufFile, which have different considerations from a resource
management point of view. Despite being part of the same "unified"
BufFile from the leader's perspective, it behaves like a local
segment, so it definitely seems like a good idea to have test coverage
for this, at least during development. (I have a pretty rough test
suite that I'm using; development of this patch has been somewhat test
driven.)

* Better encapsulation of BufFile stuff. I am even closer to the ideal
of this whole sharing mechanism being a fairly generic BufFile thing
that logtape.c piggy-backs on without having special knowledge of the
mechanism. It's still true that the mechanism (sharing/unification) is
written principally with logtape.c in mind, but that's just because of
its performance characteristics. Nothing to do with the interface.

* Worked through items raised by Thomas in his 2017-01-30 mail to this thread.

 Secondly, I might not want to be constrained by a
 fixed-sized DSM segment to hold my SharedBufFile objects... there are
 cases where I need to shared a number of batch files that is unknown
 at the start of execution time when the DSM segment is sized (I'll
 write about that shortly on the Parallel Shared Hash thread).  Maybe I
 can find a way to get rid of that requirement.  Or maybe it could
 support DSA memory too, but I don't think it's possible to use
 on_dsm_detach-based cleanup routines that refer to DSA memory because
 by the time any given DSM segment's detach hook runs, there's no
 telling which other DSM segments have been detached already, so the
 DSA area may already have partially vanished; some other kind of hook
 that runs earlier would be needed...
>>>
>>> Again, wrench.

I like the wrench analogy too, FWIW.

>> My problem here is that I don't know how many batches I'll finish up
>> creating.  In general that's OK because I can hold onto them as
>> private BufFiles owned by participants with 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-09 Thread Thomas Munro
On Fri, Feb 10, 2017 at 11:31 AM, Robert Haas  wrote:
> On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro
>  wrote:
>> I agree that the pinned segment case doesn't matter right now, I just
>> wanted to point it out.  I like your $10 wrench analogy, but maybe it
>> could be argued that adding a dsm_on_destroy() callback mechanism is
>> not only better than adding another refcount to track that other
>> refcount, but also a steal at only $8.
>
> If it's that simple, it might be worth doing, but I bet it's not.  One
> problem is that there's a race condition: there will inevitably be a
> period of time after you've called dsm_attach() and before you've
> attached to the specific data structure that we're talking about here.
> So suppose the last guy who actually knows about this data structure
> dies horribly and doesn't clean up because the DSM isn't being
> destroyed; moments later, you die horribly before reaching the code
> where you attach to this data structure.  Oops.

Right, I mentioned this problem earlier ("and also register the
on_dsm_detach callback, before any chance that an error might be
thrown (is that difficult?); failure to do so could result in file
leaks").

Here's my thought process... please tell me where I'm going wrong:

I have been assuming that it's not enough to just deal with this when
the leader detaches on the theory that other participants will always
detach first: that probably isn't true in some error cases, and could
contribute to spurious racy errors where other workers complain about
disappearing files if the leader somehow shuts down and cleans up
while a worker is still running.  Therefore we need *some* kind of
refcounting, whether it's a new kind or a new mechanism based on the
existing kind.

I have also been assuming that we don't want to teach dsm.c directly
about this stuff; it shouldn't need to know about other modules, so we
don't want it talking to buffile.c directly and managing a special
table of files; instead we want a system of callbacks.  Therefore
client code needs to do something after attaching to the segment in
each backend.

It doesn't matter whether we use an on_dsm_detach() callback and
manage our own refcount to infer that destruction is imminent, or a
new on_dsm_destroy() callback which tells us so explicitly: both ways
we'll need to make sure that anyone who attaches to the segment also
"attaches" to this shared BufFile manager object inside it, because
any backend might turn out to be the one that is last to detach.

That bring us to the race you mentioned.  Isn't it sufficient to say
that you aren't allowed to do anything that might throw in between
attaching to the segment and attaching to the SharedBufFileManager
that it contains?

Up until two minutes ago I assumed that policy would leave only two
possibilities: you attach to the DSM segment and attach to the
SharedBufFileManager successfully or you attach to the DSM segment and
then die horribly (but not throw) and the postmaster restarts the
whole cluster and blows all temp files away with RemovePgTempFiles().
But I see now in the comment of that function that crash-induced
restarts don't call that because "someone might want to examine the
temp files for debugging purposes".  Given that policy for regular
private BufFiles, I don't see why that shouldn't apply equally to
shared files: after a crash restart, you may have some junk files that
won't be cleaned up until your next clean restart, whether they were
private or shared BufFiles.

> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource.  But ASLR breaks that, especially for loadable modules.  You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se.  But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors.   Maybe you can think of a way to
> plug that hole too but you're way over your $8 budget by this
> point.

Agreed, those approaches seem like a non-starters.

>> My problem here is that I don't know how many batches I'll finish up
>> creating.  [...]
>
> I thought the idea was that the structure we're talking about here
> owns all the files, up to 2 from a leader that wandered off plus up to
> 2 for each worker.  Last process standing removes them.  Or are you
> saying each worker only needs 2 files but the leader needs a
> potentially unbounded number?

Yes, potentially unbounded in rare case.  If we plan for N batches,
and then run out of work_mem because our estimates were just wrong or
the distributions of keys is sufficiently skewed, we'll run
HashIncreaseNumBatches, and that could happen more than once.  I 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-09 Thread Robert Haas
On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro
 wrote:
> I agree that the pinned segment case doesn't matter right now, I just
> wanted to point it out.  I like your $10 wrench analogy, but maybe it
> could be argued that adding a dsm_on_destroy() callback mechanism is
> not only better than adding another refcount to track that other
> refcount, but also a steal at only $8.

If it's that simple, it might be worth doing, but I bet it's not.  One
problem is that there's a race condition: there will inevitably be a
period of time after you've called dsm_attach() and before you've
attached to the specific data structure that we're talking about here.
So suppose the last guy who actually knows about this data structure
dies horribly and doesn't clean up because the DSM isn't being
destroyed; moments later, you die horribly before reaching the code
where you attach to this data structure.  Oops.

You might think about plugging that hole by moving the registry of
on-destroy functions into the segment itself and making it a shared
resource.  But ASLR breaks that, especially for loadable modules.  You
could try to fix that problem, in turn, by storing arguments that can
later be passed to load_external_function() instead of a function
pointer per se.  But that sounds pretty fragile because some other
backend might not try to load the module until after it's attached the
DSM segment and it might then fail because loading the module runs
_PG_init() which can throw errors.   Maybe you can think of a way to
plug that hole too but you're way over your $8 budget by this
point.

>>> Secondly, I might not want to be constrained by a
>>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>>> cases where I need to shared a number of batch files that is unknown
>>> at the start of execution time when the DSM segment is sized (I'll
>>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>>> can find a way to get rid of that requirement.  Or maybe it could
>>> support DSA memory too, but I don't think it's possible to use
>>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>>> by the time any given DSM segment's detach hook runs, there's no
>>> telling which other DSM segments have been detached already, so the
>>> DSA area may already have partially vanished; some other kind of hook
>>> that runs earlier would be needed...
>>
>> Again, wrench.
>
> My problem here is that I don't know how many batches I'll finish up
> creating.  In general that's OK because I can hold onto them as
> private BufFiles owned by participants with the existing cleanup
> mechanism, and then share them just before they need to be shared (ie
> when we switch to processing the next batch so they need to be
> readable by all).  Now I only ever share one inner and one outer batch
> file per participant at a time, and then I explicitly delete them at a
> time that I know to be safe and before I need to share a new file that
> would involve recycling the slot, and I'm relying on DSM segment scope
> cleanup only to handle error paths.  That means that in generally I
> only need space for 2 * P shared BufFiles at a time.  But there is a
> problem case: when the leader needs to exit early, it needs to be able
> to transfer ownership of any files it has created, which could be more
> than we planned for, and then not participate any further in the hash
> join, so it can't participate in the on-demand sharing scheme.

I thought the idea was that the structure we're talking about here
owns all the files, up to 2 from a leader that wandered off plus up to
2 for each worker.  Last process standing removes them.  Or are you
saying each worker only needs 2 files but the leader needs a
potentially unbounded number?

> Perhaps we can find a way to describe a variable number of BufFiles
> (ie batches) in a fixed space by making sure the filenames are
> constructed in a way that lets us just have to say how many there are.

That could be done.

> Then the next problem is that for each BufFile we have to know how
> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
> ...), which Peter's code currently captures by publishing the file
> size in the descriptor... but if a fixed size object must describe N
> BufFiles, where can I put the size of each one?  Maybe I could put it
> in a header of the file itself (yuck!), or maybe I could decide that I
> don't care what the size is, I'll simply unlink "foo", then "foo.1",
> then "foo.2", ... until I get ENOENT.

There's nothing wrong with that algorithm as far as I'm concerned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-09 Thread Thomas Munro
On Fri, Feb 10, 2017 at 9:51 AM, Robert Haas  wrote:
> On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro
>  wrote:
>> Thinking a bit harder about this, I suppose there could be a kind of
>> object called a SharedBufFileManager [... description of that ...].
>
> I think this is approximately reasonable, but I think it could be made
> simpler by having fewer separate objects.  Let's assume the leader can
> put an upper bound on the number of shared BufFiles at the time it's
> sizing the DSM segment (i.e. before InitializeParallelDSM).  Then it
> can allocate a big ol' array with a header indicating the array size
> and each element containing enough space to identify the relevant
> details of 1 shared BufFile.  Now you don't need to do any allocations
> later on, and you don't need a linked list.  You just loop over the
> array and do what needs doing.

Makes sense.

>> There are a couple of problems with the above though.  Firstly, doing
>> reference counting in DSM segment on-detach hooks is really a way to
>> figure out when the DSM segment is about to be destroyed by keeping a
>> separate refcount in sync with the DSM segment's refcount, but it
>> doesn't account for pinned DSM segments.  It's not your use-case or
>> mine currently, but someone might want a DSM segment to live even when
>> it's not attached anywhere, to be reattached later.  If we're trying
>> to use DSM segment lifetime as a scope, we'd be ignoring this detail.
>> Perhaps instead of adding our own refcount we need a new kind of hook
>> on_dsm_destroy.
>
> I think it's good enough to plan for current needs now.  It's not
> impossible to change this stuff later, but we need something that
> works robustly right now without being too invasive.  Inventing whole
> new system concepts because of stuff we might someday want to do isn't
> a good idea because we may easily guess wrong about what direction
> we'll want to go in the future.  This is more like building a wrench
> than a 747: a 747 needs to be extensible and reconfigurable and
> upgradable because it costs $350 million.   A wrench costs $10 at
> Walmart and if it turns out we bought the wrong one, we can just throw
> it out and get a different one later.

I agree that the pinned segment case doesn't matter right now, I just
wanted to point it out.  I like your $10 wrench analogy, but maybe it
could be argued that adding a dsm_on_destroy() callback mechanism is
not only better than adding another refcount to track that other
refcount, but also a steal at only $8.

>> Secondly, I might not want to be constrained by a
>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>> cases where I need to shared a number of batch files that is unknown
>> at the start of execution time when the DSM segment is sized (I'll
>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>> can find a way to get rid of that requirement.  Or maybe it could
>> support DSA memory too, but I don't think it's possible to use
>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>> by the time any given DSM segment's detach hook runs, there's no
>> telling which other DSM segments have been detached already, so the
>> DSA area may already have partially vanished; some other kind of hook
>> that runs earlier would be needed...
>
> Again, wrench.

My problem here is that I don't know how many batches I'll finish up
creating.  In general that's OK because I can hold onto them as
private BufFiles owned by participants with the existing cleanup
mechanism, and then share them just before they need to be shared (ie
when we switch to processing the next batch so they need to be
readable by all).  Now I only ever share one inner and one outer batch
file per participant at a time, and then I explicitly delete them at a
time that I know to be safe and before I need to share a new file that
would involve recycling the slot, and I'm relying on DSM segment scope
cleanup only to handle error paths.  That means that in generally I
only need space for 2 * P shared BufFiles at a time.  But there is a
problem case: when the leader needs to exit early, it needs to be able
to transfer ownership of any files it has created, which could be more
than we planned for, and then not participate any further in the hash
join, so it can't participate in the on-demand sharing scheme.

Perhaps we can find a way to describe a variable number of BufFiles
(ie batches) in a fixed space by making sure the filenames are
constructed in a way that lets us just have to say how many there are.
Then the next problem is that for each BufFile we have to know how
many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
...), which Peter's code currently captures by publishing the file
size in the descriptor... but if a fixed size object must describe N
BufFiles, where can I put the size of each one?  Maybe I could put it
in a header of the 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-09 Thread Robert Haas
On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro
 wrote:
> Thinking a bit harder about this, I suppose there could be a kind of
> object called a SharedBufFileManager (insert better name) which you
> can store in a DSM segment.  The leader backend that initialises a DSM
> segment containing one of these would then call a constructor function
> that sets an internal refcount to 1 and registers an on_dsm_detach
> callback for its on-detach function.  All worker backends that attach
> to the DSM segment would need to call an attach function for the
> SharedBufFileManager to increment a refcount and also register the
> on_dsm_detach callback, before any chance that an error might be
> thrown (is that difficult?); failure to do so could result in file
> leaks.  Then, when a BufFile is to be shared (AKA exported, made
> unifiable), a SharedBufFile object can be initialised somewhere in the
> same DSM segment and registered with the SharedBufFileManager.
> Internally all registered SharedBufFile objects would be linked
> together using offsets from the start of the DSM segment for link
> pointers.  Now when SharedBufFileManager's on-detach function runs, it
> decrements the refcount in the SharedBufFileManager, and if that
> reaches zero then it runs a destructor that spins through the list of
> SharedBufFile objects deleting files that haven't already been deleted
> explicitly.

I think this is approximately reasonable, but I think it could be made
simpler by having fewer separate objects.  Let's assume the leader can
put an upper bound on the number of shared BufFiles at the time it's
sizing the DSM segment (i.e. before InitializeParallelDSM).  Then it
can allocate a big ol' array with a header indicating the array size
and each element containing enough space to identify the relevant
details of 1 shared BufFile.  Now you don't need to do any allocations
later on, and you don't need a linked list.  You just loop over the
array and do what needs doing.

> There are a couple of problems with the above though.  Firstly, doing
> reference counting in DSM segment on-detach hooks is really a way to
> figure out when the DSM segment is about to be destroyed by keeping a
> separate refcount in sync with the DSM segment's refcount, but it
> doesn't account for pinned DSM segments.  It's not your use-case or
> mine currently, but someone might want a DSM segment to live even when
> it's not attached anywhere, to be reattached later.  If we're trying
> to use DSM segment lifetime as a scope, we'd be ignoring this detail.
> Perhaps instead of adding our own refcount we need a new kind of hook
> on_dsm_destroy.

I think it's good enough to plan for current needs now.  It's not
impossible to change this stuff later, but we need something that
works robustly right now without being too invasive.  Inventing whole
new system concepts because of stuff we might someday want to do isn't
a good idea because we may easily guess wrong about what direction
we'll want to go in the future.  This is more like building a wrench
than a 747: a 747 needs to be extensible and reconfigurable and
upgradable because it costs $350 million.   A wrench costs $10 at
Walmart and if it turns out we bought the wrong one, we can just throw
it out and get a different one later.

> Secondly, I might not want to be constrained by a
> fixed-sized DSM segment to hold my SharedBufFile objects... there are
> cases where I need to shared a number of batch files that is unknown
> at the start of execution time when the DSM segment is sized (I'll
> write about that shortly on the Parallel Shared Hash thread).  Maybe I
> can find a way to get rid of that requirement.  Or maybe it could
> support DSA memory too, but I don't think it's possible to use
> on_dsm_detach-based cleanup routines that refer to DSA memory because
> by the time any given DSM segment's detach hook runs, there's no
> telling which other DSM segments have been detached already, so the
> DSA area may already have partially vanished; some other kind of hook
> that runs earlier would be needed...

Again, wrench.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-08 Thread Thomas Munro
On Wed, Feb 8, 2017 at 8:40 PM, Thomas Munro
 wrote:
> On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan  wrote:
>> Does anyone have any suggestions on how to tackle this?
>
> Hmm.  One approach might be like this:
>
> [hand-wavy stuff]

Thinking a bit harder about this, I suppose there could be a kind of
object called a SharedBufFileManager (insert better name) which you
can store in a DSM segment.  The leader backend that initialises a DSM
segment containing one of these would then call a constructor function
that sets an internal refcount to 1 and registers an on_dsm_detach
callback for its on-detach function.  All worker backends that attach
to the DSM segment would need to call an attach function for the
SharedBufFileManager to increment a refcount and also register the
on_dsm_detach callback, before any chance that an error might be
thrown (is that difficult?); failure to do so could result in file
leaks.  Then, when a BufFile is to be shared (AKA exported, made
unifiable), a SharedBufFile object can be initialised somewhere in the
same DSM segment and registered with the SharedBufFileManager.
Internally all registered SharedBufFile objects would be linked
together using offsets from the start of the DSM segment for link
pointers.  Now when SharedBufFileManager's on-detach function runs, it
decrements the refcount in the SharedBufFileManager, and if that
reaches zero then it runs a destructor that spins through the list of
SharedBufFile objects deleting files that haven't already been deleted
explicitly.

I retract the pin/unpin and per-file refcounting stuff I mentioned
earlier.  You could make the default that all files registered with a
SharedBufFileManager survive until the containing DSM segment is
detached everywhere using that single refcount in the
SharedBufFileManager object, but also provide a 'no really delete this
particular shared file now' operation for client code that knows it's
safe to do that sooner (which would be the case for me, I think).  I
don't think per-file refcounts are needed.

There are a couple of problems with the above though.  Firstly, doing
reference counting in DSM segment on-detach hooks is really a way to
figure out when the DSM segment is about to be destroyed by keeping a
separate refcount in sync with the DSM segment's refcount, but it
doesn't account for pinned DSM segments.  It's not your use-case or
mine currently, but someone might want a DSM segment to live even when
it's not attached anywhere, to be reattached later.  If we're trying
to use DSM segment lifetime as a scope, we'd be ignoring this detail.
Perhaps instead of adding our own refcount we need a new kind of hook
on_dsm_destroy.  Secondly, I might not want to be constrained by a
fixed-sized DSM segment to hold my SharedBufFile objects... there are
cases where I need to shared a number of batch files that is unknown
at the start of execution time when the DSM segment is sized (I'll
write about that shortly on the Parallel Shared Hash thread).  Maybe I
can find a way to get rid of that requirement.  Or maybe it could
support DSA memory too, but I don't think it's possible to use
on_dsm_detach-based cleanup routines that refer to DSA memory because
by the time any given DSM segment's detach hook runs, there's no
telling which other DSM segments have been detached already, so the
DSA area may already have partially vanished; some other kind of hook
that runs earlier would be needed...

Hmm.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-07 Thread Thomas Munro
On Tue, Feb 7, 2017 at 5:43 PM, Peter Geoghegan  wrote:
> However, there are some specific implementation issues with this that
> I didn't quite anticipate. I would like to get feedback on these
> issues now, from both Thomas and Robert. The issues relate to how much
> the patch can or should "buy into resource management". You might
> guess that this new resource management code is something that should
> live in fd.c, alongside the guts of temp file resource management,
> within the function FileClose(). That way, it would be called by every
> possible path that might delete a temp file, including
> ResourceOwnerReleaseInternal(). That's not what I've done, though.
> Instead, refcount management is limited to a few higher level routines
> in buffile.c. Initially, resource management in FileClose() is made to
> assume that it must delete the file. Then, if and when directed to by
> BufFileClose()/refcount, a backend may determine that it is not its
> job to do the deletion -- it will not be the one that must "turn out
> the lights", and so indicates to FileClose() that it should not delete
> the file after all (it should just release vFDs, close(), and so on).
> Otherwise, when refcount reaches zero, temp files are deleted by
> FileClose() in more or less the conventional manner.
>
> The fact that there could, in general, be any error that causes us to
> attempt a double-deletion (deletion of a temp file from more than one
> backend) for a time is less of a problem than you might think. This is
> because there is a risk of this only for as long as two backends hold
> open the file at the same time. In the case of parallel CREATE INDEX,
> this is now the shortest possible period of time, since workers close
> their files using BufFileClose() immediately after the leader wakes
> them up from a quiescent state. And, if that were to actually happen,
> say due to some random OOM error during that small window, the
> consequence is no worse than an annoying log message: "could not
> unlink file..." (this would come from the second backend that
> attempted an unlink()). You would not see this when a worker raised an
> error due to a duplicate violation, or any other routine problem, so
> it should really be almost impossible.
>
> That having been said, this probably *is* a problematic restriction in
> cases where a temp file's ownership is not immediately handed over
> without concurrent sharing. What happens to be a small window for the
> parallel CREATE INDEX patch probably wouldn't be a small window for
> parallel hash join.   :-(
>
> It's not hard to see why I would like to do things this way. Just look
> at ResourceOwnerReleaseInternal(). Any release of a file happens
> during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic
> shared memory segments happens earlier, during
> RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to
> implement a refcount is using dynamic shared memory, and that seems
> hard. There are additional reasons why I suggest we go this way, such
> as the fact that all the relevant state belongs to BufFile, which is
> implemented a layer above all of the guts of resource management of
> temp files within fd.c. I'd have to replicate almost all state in fd.c
> to make it all work, which seems like a big modularity violation.
>
> Does anyone have any suggestions on how to tackle this?

Hmm.  One approach might be like this:

1.  There is a shared refcount which is incremented when you open a
shared file and decremented if you optionally explicitly 'release' it.
(Not when you close it, because we can't allow code that may be run
during RESOURCE_RELEASE_AFTER_LOCKS to try to access the DSM segment
after it has been unmapped; more generally, creating destruction order
dependencies between different kinds of resource-manager-cleaned-up
objects seems like a bad idea.  Of course the close code still looks
after closing the vfds in the local backend.)

2.  If you want to hand the file over to some other process and exit,
you probably want to avoid race conditions or extra IPC burden.  To
achieve that you could 'pin' the file, so that it survives even while
not open in any backend.

3.  If the recount reaches zero when you 'release' and the file isn't
'pinned', then you must delete the underlying files.

4.  When the DSM segment is detached, we spin through all associated
shared files that we're still 'attached' to (ie opened but didn't
release) and decrement the refcount.  If any shared file's refcount
reaches zero its files should be deleted, even if was 'pinned'.

In other words, the associated DSM segment's lifetime is the maximum
lifetime of shared files, but it can be shorter if you 'release' in
all backends and don't 'pin'.  It's up to client code can come up with
some scheme to make that work, if it doesn't take the easy route of
pinning until DSM segment destruction.

I think in your case you'd simply pin all the BufFiles allowing
workers to exit 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-06 Thread Peter Geoghegan
On Mon, Jan 30, 2017 at 9:15 PM, Peter Geoghegan  wrote:
>> IIUC worker_wait() is only being used to keep the worker around so its
>> files aren't deleted.  Once buffile cleanup is changed to be
>> ref-counted (in an on_dsm_detach hook?) then workers might as well
>> exit sooner, freeing up a worker slot... do I have that right?
>
> Yes. Or at least I think it's very likely that that will end up happening.

I've looked into this, and have a version of the patch where clean-up
occurs when the last backend with a reference to the BufFile goes
away. It seems robust; all of my private tests pass, which includes
things that parallel CREATE INDEX won't use, and yet is added as
infrastructure (e.g., randomAccess recycling of blocks by leader from
workers).

As Thomas anticipated, worker_wait() now only makes workers wait until
the leader comes along to take a reference to their files, at which
point the worker processes can go away. In effect, the worker
processes go away as soon as possible, just as the leader begins its
final on-the-fly merge. At that point, they could be reused by some
other process, of course.

However, there are some specific implementation issues with this that
I didn't quite anticipate. I would like to get feedback on these
issues now, from both Thomas and Robert. The issues relate to how much
the patch can or should "buy into resource management". You might
guess that this new resource management code is something that should
live in fd.c, alongside the guts of temp file resource management,
within the function FileClose(). That way, it would be called by every
possible path that might delete a temp file, including
ResourceOwnerReleaseInternal(). That's not what I've done, though.
Instead, refcount management is limited to a few higher level routines
in buffile.c. Initially, resource management in FileClose() is made to
assume that it must delete the file. Then, if and when directed to by
BufFileClose()/refcount, a backend may determine that it is not its
job to do the deletion -- it will not be the one that must "turn out
the lights", and so indicates to FileClose() that it should not delete
the file after all (it should just release vFDs, close(), and so on).
Otherwise, when refcount reaches zero, temp files are deleted by
FileClose() in more or less the conventional manner.

The fact that there could, in general, be any error that causes us to
attempt a double-deletion (deletion of a temp file from more than one
backend) for a time is less of a problem than you might think. This is
because there is a risk of this only for as long as two backends hold
open the file at the same time. In the case of parallel CREATE INDEX,
this is now the shortest possible period of time, since workers close
their files using BufFileClose() immediately after the leader wakes
them up from a quiescent state. And, if that were to actually happen,
say due to some random OOM error during that small window, the
consequence is no worse than an annoying log message: "could not
unlink file..." (this would come from the second backend that
attempted an unlink()). You would not see this when a worker raised an
error due to a duplicate violation, or any other routine problem, so
it should really be almost impossible.

That having been said, this probably *is* a problematic restriction in
cases where a temp file's ownership is not immediately handed over
without concurrent sharing. What happens to be a small window for the
parallel CREATE INDEX patch probably wouldn't be a small window for
parallel hash join.   :-(

It's not hard to see why I would like to do things this way. Just look
at ResourceOwnerReleaseInternal(). Any release of a file happens
during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic
shared memory segments happens earlier, during
RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to
implement a refcount is using dynamic shared memory, and that seems
hard. There are additional reasons why I suggest we go this way, such
as the fact that all the relevant state belongs to BufFile, which is
implemented a layer above all of the guts of resource management of
temp files within fd.c. I'd have to replicate almost all state in fd.c
to make it all work, which seems like a big modularity violation.

Does anyone have any suggestions on how to tackle this?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-03 Thread Peter Geoghegan
On Fri, Feb 3, 2017 at 4:15 PM, Thomas Munro
 wrote:
>> I suspect that this system isn't particularly well balanced for the
>> task of benchmarking the patch. You would probably see notably better
>> scalability than any you've shown in any test if you could add
>> additional sequential I/O bandwidth, which is probably an economical,
>> practical choice for many users. I suspect that you aren't actually
>> saturating available CPUs to the greatest extent that the
>> implementations makes possible.
>
> I will look into what IO options I can access before running larger
> tests.  Also I will look into running the test with both cold and warm
> caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth
> enters the picture.

It might just have been that the table was too small to be an
effective target for parallel sequential scan with so many workers,
and so a presorted best case CREATE INDEX, which isn't that different,
also fails to see much benefit (compared to what you'd see with a
similar case involving a larger table). In other words, I might have
jumped the gun in emphasizing issues with hardware and I/O bandwidth
over issues around data volume (that I/O parallelism is inherently not
very helpful with these relatively small tables).

As I've pointed out a couple of times before, bigger sorts will be
more CPU bound because sorting itself has costs that grow
linearithmically, whereas writing out runs has costs that grow
linearly. The relative cost of the I/O can be expected to go down as
input goes up for this reason. At the same time, a larger input might
make better use of I/O parallelism, which reduces the cost paid in
latency to write out runs in absolute terms.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-03 Thread Thomas Munro
On Sat, Feb 4, 2017 at 11:58 AM, Peter Geoghegan  wrote:
> On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
>  wrote:
>> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
>> show speed-up relative to that time):
>>
>>  mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
>> -++---+---+---+---+---+---+---+---
>>1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
>>   64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
>>  256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
>>  512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x
>
> I think that this presorted case doesn't improve much because the
> sorting itself is so cheap, as explained in my last mail. However, the
> improvements as workers are added is still smaller than expected. I
> think that this indicates that there isn't enough I/O capacity
> available here to truly show the full potential of the patch -- I've
> certainly seen better scalability for cases like this when there is a
> lot of I/O bandwidth available, and I/O parallelism is there to be
> taken advantage of. Say, when using a system with a large RAID array
> (I used a RAID0 array with 12 HDDs for my own tests). Another issue is
> that you probably don't have enough data here to really show off the
> patch. I don't want to dismiss the benchmark, which is still quite
> informative, but it's worth pointing out that the feature is going to
> be most compelling for very large indexes, that will take at least
> several minutes to build under any circumstances. (Having a
> reproducible case is also important, which what you have here has
> going for it, on the other hand.)

Right.  My main reason for starting smallish was to allow me to search
a space with several variables without waiting eons.  Next I would
like to run a small subset of those tests with, say, 10, 20 or even
100 times more data loaded, so the tables would be ~20GB, ~40GB or
~200GB.

About read bandwidth:  It shouldn't have been touching the disk at all
for reads: I did a dummy run of the index build before the measured
runs, so that a 2GB table being sorted in ~2 minutes would certainly
have come entirely from the OS page cache since the machine has oodles
of RAM.

About write bandwidth:  The WAL, the index and the temp files all went
to an SSD array, though I don't have the characteristics of that to
hand.  I should also be able to test on multi-spindle HDD array.  I
doubt either can touch your 12 way RAID0 array, but will look into
that.

> I suspect that this system isn't particularly well balanced for the
> task of benchmarking the patch. You would probably see notably better
> scalability than any you've shown in any test if you could add
> additional sequential I/O bandwidth, which is probably an economical,
> practical choice for many users. I suspect that you aren't actually
> saturating available CPUs to the greatest extent that the
> implementations makes possible.

I will look into what IO options I can access before running larger
tests.  Also I will look into running the test with both cold and warm
caches (ie "echo 1 > /proc/sys/vm/drop_caches") so that read bandwidth
enters the picture.

> Another thing I want to point out is that with 1MB of
> maintenance_work_mem, the patch appears to do very well, but that
> isn't terribly meaningful. I would suggest that we avoid testing this
> patch with such a low amount of memory -- it doesn't seem important.
> This is skewed by the fact that you're using replacement selection in
> the serial case only. I think what this actually demonstrates is that
> replacement selection is very slow, even with its putative best case.
> I believe that commit 2459833 was the final nail in the coffin of
> replacement selection. I certainly don't want to relitigate the
> discussion on replacement_sort_tuples, and am not going to push too
> hard, but ISTM that we should fully remove replacement selection from
> tuplesort.c and be done with it.

Interesting.  I haven't grokked this but will go and read about it.

Based on your earlier comments about banana skin effects, I'm
wondering if it would be interesting to add a couple more heap
distributions to the test set that are almost completely sorted except
for a few entries out of order.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-03 Thread Peter Geoghegan
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
 wrote:
> 1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
> show speed-up relative to that time):
>
>  mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
> -++---+---+---+---+---+---+---+---
>1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
>   64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
>  256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
>  512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

I think that this presorted case doesn't improve much because the
sorting itself is so cheap, as explained in my last mail. However, the
improvements as workers are added is still smaller than expected. I
think that this indicates that there isn't enough I/O capacity
available here to truly show the full potential of the patch -- I've
certainly seen better scalability for cases like this when there is a
lot of I/O bandwidth available, and I/O parallelism is there to be
taken advantage of. Say, when using a system with a large RAID array
(I used a RAID0 array with 12 HDDs for my own tests). Another issue is
that you probably don't have enough data here to really show off the
patch. I don't want to dismiss the benchmark, which is still quite
informative, but it's worth pointing out that the feature is going to
be most compelling for very large indexes, that will take at least
several minutes to build under any circumstances. (Having a
reproducible case is also important, which what you have here has
going for it, on the other hand.)

I suspect that this system isn't particularly well balanced for the
task of benchmarking the patch. You would probably see notably better
scalability than any you've shown in any test if you could add
additional sequential I/O bandwidth, which is probably an economical,
practical choice for many users. I suspect that you aren't actually
saturating available CPUs to the greatest extent that the
implementations makes possible.

Another thing I want to point out is that with 1MB of
maintenance_work_mem, the patch appears to do very well, but that
isn't terribly meaningful. I would suggest that we avoid testing this
patch with such a low amount of memory -- it doesn't seem important.
This is skewed by the fact that you're using replacement selection in
the serial case only. I think what this actually demonstrates is that
replacement selection is very slow, even with its putative best case.
I believe that commit 2459833 was the final nail in the coffin of
replacement selection. I certainly don't want to relitigate the
discussion on replacement_sort_tuples, and am not going to push too
hard, but ISTM that we should fully remove replacement selection from
tuplesort.c and be done with it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-03 Thread Peter Geoghegan
On Fri, Feb 3, 2017 at 5:04 AM, Thomas Munro
 wrote:
> I applied your v2 patch on top of
> 7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict.  It
> still had a couple of harmless conflicts that I was able to deal with
> (not code, just some header stuff moving around).

 You must mean my V7 patch. FWIW, I've resolved the conflicts with
7ac4a389a7dbddaa8b19deb228f0a988e79c5795 in my own private branch, and
have worked through some of the open items that you raised.

> See full results from all permutations attached, but I wanted to
> highlight the measurements from 'textwide', 'u', 'nonu' which show
> interesting 'asc' numbers (data already sorted).  The 'mem' column is
> maintenance_work_mem in megabytes.  The 'w = 0' column shows the time
> in seconds for parallel_workers = 0.  The other 'w = N' columns show
> times with higher parallel_workers settings, represented as speed-up
> relative to the 'w = 0' time.

The thing to keep in mind about testing presorted cases in tuplesort
in general is that we have this weird precheck for presorted input in
our qsort. This is something added by us to the original Bentley &
McIlroy algorithm in 2006. I am very skeptical of this addition, in
general. It tends to have the effect of highly distorting how
effective most optimizations are for presorted cases, which comes up
again and again. It only works when the input is *perfectly*
presorted, but can throw away an enormous amount of work when the last
tuple of input is out of order -- that will throw away all work before
that point (not so bad when you think your main cost is comparisons
rather than memory accesses, but that isn't the case).

Your baseline case can either be made unrealistically fast due to the
fact that you get a perfectly sympathetic case for this optimization,
or unrealistically slow (very CPU bound) due to the fact that you have
that one last tuple out of place. I once said that this last tuple can
act like a discarded banana skin.

There is nothing wrong with the idea of exploiting presortedness, and
to some extent the original algorithm does that (by using insertion
sort), but an optimization along the lines of Timsort's "galloping
mode" (which is what this modification of ours attempts) requires
non-trivial bookkeeping to do right.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-02-03 Thread Thomas Munro
On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan  wrote:
> On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
>  wrote:
>> So I'm in favour of this patch, which is relatively simple and give us
>> faster index builds soon.  Eventually we might also be able to have
>> approach 1.  From what I gather, it's entirely possible that we might
>> still need 2 to fall back on in some cases.
>
> Right. And it can form the basis of an implementation of 1, which in
> any case seems to be much more compelling for parallel query, when a
> great deal more can be pushed down, and we are not particularly likely
> to be I/O bound (usually not much writing to the heap, or WAL
> logging).

I ran some tests today.  First I created test tables representing the
permutations of these choices:

Table structure:

  int = Integer key only
  intwide = Integer key + wide row
  text = Text key only (using dictionary words)
  textwide = Text key + wide row

Uniqueness:

  u = each value unique
  d = 10 duplicates of each value

Heap physical order:

  rand = Random
  asc = Ascending order (already sorted)
  desc = Descending order (sorted backwards)

I used 10 million rows for this test run, so that gave me 24 tables of
the following sizes as reported in "\d+":

  int tables = 346MB each
  intwide tables = 1817MB each
  text tables = 441MB each
  textwide tables = 1953MB each

It'd be interesting to test larger tables of course but I had a lot of
permutations to get through.

For each of those tables I ran tests corresponding to the permutations
of these three variables:

Index type:

  uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates)
  nonu = CREATE INDEX ("u" and "d" tables)

Maintenance memory: 1M, 64MB, 256MB, 512MB

Workers: from 0 up to 8

Environment:  EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7-
8830  @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS
7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD.
Database initialised with en_US.utf-8 collation, all defaults except
max_wal_size increased to 4GB (otherwise warnings about too frequent
checkpoints) and max_parallel_workers_maintenance = 8.  Testing done
with warm OS cache.

I applied your v2 patch on top of
7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict.  It
still had a couple of harmless conflicts that I was able to deal with
(not code, just some header stuff moving around).

See full results from all permutations attached, but I wanted to
highlight the measurements from 'textwide', 'u', 'nonu' which show
interesting 'asc' numbers (data already sorted).  The 'mem' column is
maintenance_work_mem in megabytes.  The 'w = 0' column shows the time
in seconds for parallel_workers = 0.  The other 'w = N' columns show
times with higher parallel_workers settings, represented as speed-up
relative to the 'w = 0' time.

1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
show speed-up relative to that time):

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-++---+---+---+---+---+---+---+---
   1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
  64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
 256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
 512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

2. 'rand' = randomised data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-++---+---+---+---+---+---+---+---
   1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x
  64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x
 256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x
 512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x

3. 'desc' = reverse-sorted data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-++---+---+---+---+---+---+---+---
   1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x
  64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x
 256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x
 512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x

The 'asc' effects are much less pronounced when the key is an int.
Here is the equivalent data for 'intwide', 'u', 'nonu':

1.  'asc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-+---+---+---+---+---+---+---+---+---
   1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x
  64 |  7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x
 256 |  7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-31 Thread Peter Geoghegan
On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
 wrote:
> 2.  All participants: parallel sequential scan, sort, spool to disk;
> barrier; leader: merge spooled tuples and build btree.
>
> This patch is doing the 2nd thing.  My understanding is that some
> systems might choose to do that if they don't have or don't like the
> table's statistics, since repartitioning for balanced load requires
> carefully chosen ranges and is highly sensitive to distribution
> problems.

The second thing here seems to offer comparable scalability to other
system implementation's of the first thing. They seem to have reused
"partitioning to sort in parallel" for B-Tree builds, at least in some
cases, despite this. WAL logging is the biggest serial bottleneck here
for other systems, I've heard -- that's still going to be pretty much
serial.

I think that the fact that some systems do partitioning for parallel
B-Tree builds might have as much to do with their ability to create
B-Tree indexes in place as anything else. Apparently, some systems
don't use temp files, instead writing out what is for all intents and
purposes part of a finished B-Tree as runs (no use of
temp_tablespaces). That may be a big part of what makes it worthwhile
to try to use partitioning. I understand that only the highest client
counts will see much direct performance benefit relative to the first
approach.

> It's pretty clear that approach 1 is a difficult project.  From my
> research into dynamic repartitioning in the context of hash joins, I
> can see that that infrastructure is a significant project in its own
> right: subproblems include super efficient tuple exchange, buffering,
> statistics/planning and dealing with/adapting to bad outcomes.  I also
> suspect that repartitioning operators might need to be specialised for
> different purposes like sorting vs hash joins, which may have
> differing goals.  I think it's probably easy to build a slow dynamic
> repartitioning mechanism that frequently results in terrible worst
> case scenarios where you paid a fortune in IPC overheads and still
> finished up with one worker pulling most of the whole load.  Without
> range partitioning, I don't believe you can merge the resulting
> non-disjoint btrees efficiently so you'd probably finish up writing a
> complete new btree to mash them together.  As for merging disjoint
> btrees, I assume there are ways to do a structure-preserving merge
> that just rebuilds some internal pages and incorporates the existing
> leaf pages directly, a bit like tree manipulation in functional
> programming languages; that'll take some doing.

I agree with all that. "Stitching together" disjoint B-Trees does seem
to have some particular risks, which users of other systems are
cautioned against in their documentation. You can end up with an
unbalanced B-Tree.

> So I'm in favour of this patch, which is relatively simple and give us
> faster index builds soon.  Eventually we might also be able to have
> approach 1.  From what I gather, it's entirely possible that we might
> still need 2 to fall back on in some cases.

Right. And it can form the basis of an implementation of 1, which in
any case seems to be much more compelling for parallel query, when a
great deal more can be pushed down, and we are not particularly likely
to be I/O bound (usually not much writing to the heap, or WAL
logging).

> Will you move the BufFile changes to a separate patch in the next revision?

That is the plan. I need to get set up with a new machine here, having
given back my work laptop to Heroku, but it shouldn't take too long.

Thanks for the review.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-31 Thread Thomas Munro
On Wed, Feb 1, 2017 at 5:37 PM, Michael Paquier
 wrote:
> On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan  wrote:
>> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
>>  wrote:
>>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan  wrote:
 Attached is V7 of the patch.
>>>
>>> I am doing some testing.  First, some superficial things from first pass:
>>>
>>> [Various minor cosmetic issues]
>>
>> Oops.
>
> As this review is very recent, I have moved the patch to CF 2017-03.

 ParallelContext *
-CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers)
+CreateParallelContext(parallel_worker_main_type entrypoint, int nworkers,
+ bool serializable_okay)
 {
MemoryContext oldcontext;
ParallelContext *pcxt;
@@ -143,7 +144,7 @@ CreateParallelContext(parallel_worker_main_type
entrypoint, int nworkers)
 * workers, at least not until somebody enhances that mechanism to be
 * parallel-aware.
 */
-   if (IsolationIsSerializable())
+   if (IsolationIsSerializable() && !serializable_okay)
nworkers = 0;

That's a bit weird but I can't think of a problem with it.  Workers
run with MySerializableXact == InvalidSerializableXact, even though
they may have the snapshot of a SERIALIZABLE leader.  Hopefully soon
the restriction on SERIALIZABLE in parallel queries can be lifted
anyway, and then this could be removed.

Here are some thoughts on the overall approach.  Disclaimer:  I
haven't researched the state of the art in parallel sort or btree
builds.  But I gather from general reading that there are a couple of
well known approaches, and I'm sure you'll correct me if I'm off base
here.

1.  All participants: parallel sequential scan, repartition on the fly
so each worker has tuples in a non-overlapping range, sort, build
disjoint btrees; barrier; leader: merge disjoint btrees into one.

2.  All participants: parallel sequential scan, sort, spool to disk;
barrier; leader: merge spooled tuples and build btree.

This patch is doing the 2nd thing.  My understanding is that some
systems might choose to do that if they don't have or don't like the
table's statistics, since repartitioning for balanced load requires
carefully chosen ranges and is highly sensitive to distribution
problems.

It's pretty clear that approach 1 is a difficult project.  From my
research into dynamic repartitioning in the context of hash joins, I
can see that that infrastructure is a significant project in its own
right: subproblems include super efficient tuple exchange, buffering,
statistics/planning and dealing with/adapting to bad outcomes.  I also
suspect that repartitioning operators might need to be specialised for
different purposes like sorting vs hash joins, which may have
differing goals.  I think it's probably easy to build a slow dynamic
repartitioning mechanism that frequently results in terrible worst
case scenarios where you paid a fortune in IPC overheads and still
finished up with one worker pulling most of the whole load.  Without
range partitioning, I don't believe you can merge the resulting
non-disjoint btrees efficiently so you'd probably finish up writing a
complete new btree to mash them together.  As for merging disjoint
btrees, I assume there are ways to do a structure-preserving merge
that just rebuilds some internal pages and incorporates the existing
leaf pages directly, a bit like tree manipulation in functional
programming languages; that'll take some doing.

So I'm in favour of this patch, which is relatively simple and give us
faster index builds soon.  Eventually we might also be able to have
approach 1.  From what I gather, it's entirely possible that we might
still need 2 to fall back on in some cases.

Will you move the BufFile changes to a separate patch in the next revision?

Still testing and reviewing, more soon.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-31 Thread Michael Paquier
On Tue, Jan 31, 2017 at 2:15 PM, Peter Geoghegan  wrote:
> On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
>  wrote:
>> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan  wrote:
>>> Attached is V7 of the patch.
>>
>> I am doing some testing.  First, some superficial things from first pass:
>>
>> [Various minor cosmetic issues]
>
> Oops.

As this review is very recent, I have moved the patch to CF 2017-03.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-31 Thread Robert Haas
On Tue, Jan 31, 2017 at 12:15 AM, Peter Geoghegan  wrote:
>> Should this 64KB minimum be mentioned in the documentation?
>
> You mean user-visible documentation, and not just tuplesort.h? I don't
> think that that's necessary. That's a ludicrously low amount of memory
> for a worker to be limited to anyway. It will never come up with
> remotely sensible use of the feature.

I agree.

>> +   if (!btspool->isunique)
>> +   {
>> +   shm_toc_estimate_keys(>estimator, 2);
>> +   }
>>
>> Project style: people always tell me to drop the curlies in cases like
>> that.  There are a few more examples in the patch.
>
> I only do this when there is an "else" that must have curly braces,
> too. There are plenty of examples of this from existing code, so I
> think it's fine.

But I disagree on this one.  I think

if (blah)
stuff();
else
{
thing();
gargle();
}

...is much better than

if (blah)
{
stuff();
}
else
{
thing();
gargle();
}

But if there were a comment on a separate line before the call to
stuff(), then I would do it the second way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-30 Thread Peter Geoghegan
On Mon, Jan 30, 2017 at 8:46 PM, Thomas Munro
 wrote:
> On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan  wrote:
>> Attached is V7 of the patch.
>
> I am doing some testing.  First, some superficial things from first pass:
>
> [Various minor cosmetic issues]

Oops.

> Just an observation:  if you ask for a large number of workers, but
> only one can be launched, it will be constrained to a small fraction
> of maintenance_work_mem, but use only one worker.  That's probably OK,
> and I don't see how to do anything about it unless you are prepared to
> make workers wait for an initial message from the leader to inform
> them how many were launched.

Actually, the leader-owned worker Tuplesort state will have the
appropriate amount, so you'd still need to have 2 participants (1
worker + leader-as-worker). And, sorting is much less sensitive to
having a bit less memory than hashing (at least when there isn't
dozens of runs to merge in the end, or multiple passes). So, I agree
that this isn't worth worrying about for a DDL statement.

> Should this 64KB minimum be mentioned in the documentation?

You mean user-visible documentation, and not just tuplesort.h? I don't
think that that's necessary. That's a ludicrously low amount of memory
for a worker to be limited to anyway. It will never come up with
remotely sensible use of the feature.

> +   if (!btspool->isunique)
> +   {
> +   shm_toc_estimate_keys(>estimator, 2);
> +   }
>
> Project style: people always tell me to drop the curlies in cases like
> that.  There are a few more examples in the patch.

I only do this when there is an "else" that must have curly braces,
too. There are plenty of examples of this from existing code, so I
think it's fine.

> + /* Wait for workers */
> + ConditionVariableSleep(>workersFinishedCv,
> +   WAIT_EVENT_PARALLEL_FINISH);
>
> I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in
> tuplesort_leader_wait and worker_wait.  That belongs to
> WaitForParallelWorkersToFinish, so someone who see that in
> pg_stat_activity won't know which it is.

Noted.

> IIUC worker_wait() is only being used to keep the worker around so its
> files aren't deleted.  Once buffile cleanup is changed to be
> ref-counted (in an on_dsm_detach hook?) then workers might as well
> exit sooner, freeing up a worker slot... do I have that right?

Yes. Or at least I think it's very likely that that will end up happening.

> Incidentally, barrier.c could probably be used for this
> synchronisation instead of these functions.  I think
> _bt_begin_parallel would call BarrierInit(>barrier,
> scantuplesortstates) and then after LaunchParallelWorkers() it'd call
> a new interface BarrierDetachN(>barrier, scantuplesortstates -
> pcxt->nworkers_launched) to forget about workers that failed to
> launch.  Then you could use BarrierWait where the leader waits for the
> workers to finish, and BarrierDetach where the workers are finished
> and want to exit.

I thought about doing that, actually, but I don't like creating
dependencies on some other uncommited patch, which is a moving target
(barrier stuff isn't committed yet). It makes life difficult for
reviewers. I put off adopting condition variables until they were
committed for the same reason -- it's was easy to do without them for
a time. I'll probably get around to it before too long, but feel no
urgency about it. Barriers will only allow me to make a modest net
removal of code, AFAIK.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-30 Thread Thomas Munro
On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan  wrote:
> Attached is V7 of the patch.

I am doing some testing.  First, some superficial things from first pass:

Still applies with some offsets and one easy-to-fix rejected hunk in
nbtree.c (removing some #include directives and a struct definition).

+/* Sort parallel code from state for sort__start probes */
+#define PARALLEL_SORT(state)   ((state)->shared == NULL ? 0 : \
+(state)->workerNum >= 0 : 1 : 2)

Typo: ':' instead of '?', --enable-dtrace build fails.

+ the entire utlity command, regardless of the number of

Typo: s/utlity/utility/

+   /* Perform sorting of spool, and possibly a spool2 */
+   sortmem = Max(maintenance_work_mem / btshared->scantuplesortstates, 64);

Just an observation:  if you ask for a large number of workers, but
only one can be launched, it will be constrained to a small fraction
of maintenance_work_mem, but use only one worker.  That's probably OK,
and I don't see how to do anything about it unless you are prepared to
make workers wait for an initial message from the leader to inform
them how many were launched.

Should this 64KB minimum be mentioned in the documentation?

+   if (!btspool->isunique)
+   {
+   shm_toc_estimate_keys(>estimator, 2);
+   }

Project style: people always tell me to drop the curlies in cases like
that.  There are a few more examples in the patch.

+ /* Wait for workers */
+ ConditionVariableSleep(>workersFinishedCv,
+   WAIT_EVENT_PARALLEL_FINISH);

I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in
tuplesort_leader_wait and worker_wait.  That belongs to
WaitForParallelWorkersToFinish, so someone who see that in
pg_stat_activity won't know which it is.

IIUC worker_wait() is only being used to keep the worker around so its
files aren't deleted.  Once buffile cleanup is changed to be
ref-counted (in an on_dsm_detach hook?) then workers might as well
exit sooner, freeing up a worker slot... do I have that right?

Incidentally, barrier.c could probably be used for this
synchronisation instead of these functions.  I think
_bt_begin_parallel would call BarrierInit(>barrier,
scantuplesortstates) and then after LaunchParallelWorkers() it'd call
a new interface BarrierDetachN(>barrier, scantuplesortstates -
pcxt->nworkers_launched) to forget about workers that failed to
launch.  Then you could use BarrierWait where the leader waits for the
workers to finish, and BarrierDetach where the workers are finished
and want to exit.

+ /* Prepare state to create unified tapeset */
+ leaderTapes = palloc(sizeof(TapeShare) * state->maxTapes);

Missing cast (TapeShare *) here?  Project style judging by code I've
seen, and avoids gratuitously C++ incompatibility.

+_bt_parallel_shared_estimate(Snapshot snapshot)
...
+tuplesort_estimate_shared(int nWorkers)

Inconsistent naming?

More soon.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-01-03 Thread Peter Geoghegan
On Tue, Dec 20, 2016 at 5:14 PM, Peter Geoghegan  wrote:
>> Imagine a data structure that is stored in dynamic shared memory and
>> contains space for a filename, a reference count, and a mutex.  Let's
>> call this thing a SharedTemporaryFile or something like that.  It
>> offers these APIs:
>>
>> extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
>> extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment 
>> *seg);
>> extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
>> extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);
>
> I'm a little bit tired right now, and I have yet to look at Thomas'
> parallel hash join patch in any detail. I'm interested in what you
> have to say here, but I think that I need to learn more about its
> requirements in order to have an informed opinion.

Attached is V7 of the patch. The overall emphasis with this revision
is on bringing clarity on how much can be accomplished using
generalized infrastructure, explaining the unification mechanism
coherently, and related issues.

Notable changes
---

* Rebased to work with the newly simplified logtape.c representation
(the recent removal of "indirect blocks" by Heikki). Heikki's work was
something that helped with simplifying the whole unification
mechanism, to a significant degree. I think that there was over a 50%
reduction in logtape.c lines of code in this revision.

* randomAccess cases are now able to reclaim disk space from blocks
originally written by workers. This further simplifies logtape.c
changes significantly. I don't think that this is important because
some future randomAccess caller might otherwise have double the
storage overhead for their parallel sort, or even because of the
disproportionate performance penalty such a caller would experience;
rather, it's important because it removes previous special cases (that
were internal to logtape.c).

For example, aside from the fact that worker tapes within a unified
tapeset will often have a non-zero offset, there is no state that
actually remembers that this is a unified tapeset, because that isn't
needed anymore. And, even though we reclaim blocks from workers, we
only have one central chokepoint for applying worker offsets in the
leader (that chokepoint is ltsReadFillBuffer()). Routines tasked with
things like positional seeking for mark/restore for certain tuplesort
clients (which are. in general, poorly tested) now need to have no
knowledge of unification while still working just the same. This is a
consequence of the fact that ltsWriteBlock() callers (and
ltsWriteBlock() itself) never have to think about offsets. I'm pretty
happy about that.

* pg_restore now prevents the planner from deciding that parallelism
should be used, in order to make restoration behavior more consistent
and predictable. Iff a dump being restored happens to have a CREATE
INDEX with the new index storage parameter parallel_workers set, then
pg_restore will use parallel CREATE INDEX. This is accomplished with a
new GUC, enable_parallelddl (since "max_parallel_workers_maintenance =
0" will disable parallel CREATE INDEX across the board, ISTM that a
second new GUC is required). I think that this behavior the right
trade-off for pg_restore goes, although I still don't feel
particularly strongly about it. There is now a concrete proposal on
what to do about pg_restore, if nothing else. To recap, the general
concern address here is that there are typically no ANALYZE stats
available for the planner to decide with when pg_restore runs CREATE
INDEX, although that isn't always true, which was both surprising and
inconsistent.

* Addresses the problem of anonymous record types and their need for
"remapping" across parallel workers. I've simply pushed the
responsibility on callers within tuplesort.h contract; parallel CREATE
INDEX callers don't need to care about this, as explained there.
(CLUSTER tuplesorts would also be safe.)

* Puts the whole rationale for unification into one large comment
above the function BufFileUnify(), and removes traces of the same kind
of discussion from everywhere else. I think that buffile.c is the
right central place to discuss the unification mechanism, now that
logtape.c has been greatly simplified. All the fd.c changes are in
routines that are only ever called by buffile.c anyway, and are not
too complicated (in general, temp fd.c files are only ever owned
transitively, through BufFiles). So, morally, the unification
mechanism is something that wholly belongs to buffile.c, since
unification is all about temp files, and buffile.h is the interface
through which temp files are owned and accessed in general, without
exception.

Unification remains specialized
---

On the one hand, BufFileUnify() now describes the whole idea of
unification in detail, in its own general terms, including its
performance characteristics, but on the other hand it 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Peter Geoghegan
On Wed, Dec 21, 2016 at 10:21 AM, Peter Geoghegan  wrote:
> On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas  wrote:
>> 3. Just live with the waste of space.
>
> I am loathe to create a special case for the parallel interface too,
> but I think it's possible that *no* caller will ever actually need to
> live with this restriction at any time in the future.

I just realized that you were actually talking about the waste of
space in workers here, as opposed to the theoretical waste of space
that would occur in the leader should there ever be a parallel
randomAccess tuplesort caller.

To be clear, I am totally against allowing a waste of logtape.c temp
file space in *workers*, because that implies a cost that will most
certainly be felt by users all the time.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Peter Geoghegan
On Wed, Dec 21, 2016 at 6:00 AM, Robert Haas  wrote:
> 3. Just live with the waste of space.

I am loathe to create a special case for the parallel interface too,
but I think it's possible that *no* caller will ever actually need to
live with this restriction at any time in the future. I am strongly
convinced that adopting tuplesort.c for parallelism should involve
partitioning [1]. With that approach, even randomAccess callers will
not want to read at random for one big materialized tape, since that's
at odds with the whole point of partitioning, which is to remove any
dependencies between workers quickly and early, so that as much work
as possible is pushed down into workers. If a merge join were
performed in a world where we have this kind of partitioning, we
definitely wouldn't require one big materialized tape that is
accessible within each worker.

What are the chances of any real user actually having to live with the
waste of space at some point in the future?

> Another tangentially-related problem I just realized is that we need
> to somehow handle the issues that tqueue.c does when transferring
> tuples between backends -- most of the time there's no problem, but if
> anonymous record types are involved then tuples require "remapping".
> It's probably harder to provoke a failure in the tuplesort case than
> with parallel query per se, but it's probably not impossible.

Thanks for pointing that out. I'll look into it.

BTW, I discovered a bug where there is very low memory available
within each worker -- tuplesort.c throws an error within workers
immediately. It's just a matter of making sure that they at least have
64KB of workMem, which is a pretty straightforward fix. Obviously it
makes no sense to use so little memory in the first place; this is a
corner case.

[1] 
https://www.postgresql.org/message-id/CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas  wrote:
>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> Currently, the logical tape code assumes that all tapes in a single
> LogicalTapeSet are allocated from the same BufFile. The logical tape's
> on-disk format contains block numbers, to point to the next/prev block of
> the tape [1], and they're assumed to refer to the same file. That allows
> reusing space efficiently during the merge. After you have read the first
> block from tapes A, B and C, you can immediately reuse those three blocks
> for output tape D.

I see.  Hmm.

> Now, if you read multiple tapes, from different LogicalTapeSet, hence backed
> by different BufFiles, you cannot reuse the space from those different tapes
> for a single output tape, because the on-disk format doesn't allow referring
> to blocks in other files. You could reuse the space of *one* of the input
> tapes, by placing the output tape in the same LogicalTapeSet, but not all of
> them.
>
> We could enhance that, by using "filename + block number" instead of just
> block number, in the pointers in the logical tapes. Then you could spread
> one logical tape across multiple files. Probably not worth it in practice,
> though.

OK, so the options as I understand them are:

1. Enhance the logical tape set infrastructure in the manner you
mention, to support filename (or more likely a proxy for filename) +
block number in the logical tape pointers.  Then, tapes can be
transferred from one LogicalTapeSet to another.

2. Enhance the BufFile infrastructure to support some notion of a
shared BufFile so that multiple processes can be reading and writing
blocks in the same BufFile.  Then, extend the logical tape
infrastruture so that we also have the notion of a shared LogicalTape.
This means that things like ltsGetFreeBlock() need to be re-engineered
to handle concurrency with other backends.

3. Just live with the waste of space.

I would guess that (1) is easier than (2).  Also, (2) might provoke
contention while writing tapes that is otherwise completely
unnecessary.  It seems silly to have multiple backends fighting over
the same end-of-file pointer for the same file when they could just
write to different files instead.

Another tangentially-related problem I just realized is that we need
to somehow handle the issues that tqueue.c does when transferring
tuples between backends -- most of the time there's no problem, but if
anonymous record types are involved then tuples require "remapping".
It's probably harder to provoke a failure in the tuplesort case than
with parallel query per se, but it's probably not impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-21 Thread Heikki Linnakangas

On 12/21/2016 12:53 AM, Robert Haas wrote:

That leaves one problem, though: reusing space in the final merge phase. If
the tapes being merged belong to different LogicalTapeSets, and create one
new tape to hold the result, the new tape cannot easily reuse the space of
the input tapes because they are on different tape sets.


If the worker is always completely finished with the tape before the
leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
the tape and overwrite it like any other?


Currently, the logical tape code assumes that all tapes in a single 
LogicalTapeSet are allocated from the same BufFile. The logical tape's 
on-disk format contains block numbers, to point to the next/prev block 
of the tape [1], and they're assumed to refer to the same file. That 
allows reusing space efficiently during the merge. After you have read 
the first block from tapes A, B and C, you can immediately reuse those 
three blocks for output tape D.


Now, if you read multiple tapes, from different LogicalTapeSet, hence 
backed by different BufFiles, you cannot reuse the space from those 
different tapes for a single output tape, because the on-disk format 
doesn't allow referring to blocks in other files. You could reuse the 
space of *one* of the input tapes, by placing the output tape in the 
same LogicalTapeSet, but not all of them.


We could enhance that, by using "filename + block number" instead of 
just block number, in the pointers in the logical tapes. Then you could 
spread one logical tape across multiple files. Probably not worth it in 
practice, though.



[1] As the code stands, there are no next/prev pointers, but a tree of 
"indirect" blocks. But I'm planning to change that to simpler next/prev 
pointers, in 
https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-20 Thread Robert Haas
On Tue, Dec 20, 2016 at 8:14 PM, Peter Geoghegan  wrote:
> Without meaning to sound glib, unification is the process by which
> parallel CREATE INDEX has the leader read temp files from workers
> sufficient to complete its final on-the-fly merge.

That's not glib, but you can't in the end define BufFile unification
in terms of what parallel CREATE INDEX needs.  Whatever changes we
make to lower-level abstractions in the service of some higher-level
goal need to be explainable on their own terms.

>>It's fine to make up new words -- indeed, in some sense that is the essence
>> of writing any complex problem -- but you have to define them.
>
> I invite you to help me define this new word.

If at some point I'm able to understand what it means, I'll try to do
that.  I think you're loosely using "unification" to mean combining
stuff from different backends in some way that depends on the
particular context, so that "BufFile unification" can be different
from "LogicalTape unification".  But that's just punting the question
of what each of those things actually are.

> Maybe it's inferior to that, but I think what Heikki proposes is more
> or less complementary to what I've proposed, and has nothing to do
> with resource management and plenty to do with making the logtape.c
> interface look nice, AFAICT. It's also about refactoring/simplifying
> logtape.c itself, while we're at it. I believe that Heikki has yet to
> comment either way on my approach to resource management, one aspect
> of the patch that I was particularly keen on your looking into.

My reading of Heikki's point was that there's not much point in
touching the BufFile level of things if we can do all of the necessary
stuff at the LogicalTape level, and I agree with him about that.  If a
shared BufFile had a shared read-write pointer, that would be a good
justification for having it.  But it seems like unification at the
BufFile level is just concatenation, and that can be done just as well
at the LogicalTape level, so why tinker with BufFile?  As I've said, I
think there's some low-level hacking needed here to make sure files
get removed at the correct time in all cases, but apart from that I
see no good reason to push the concatenation operation all the way
down into BufFile.

> The theory of operation here is that workers own their own BufFiles,
> and are responsible for deleting them when they die. The assumption,
> rightly or wrongly, is that it's sufficient that workers flush
> everything out (write out temp files), and yield control to the
> leader, which will open their temp files for the duration of the
> leader final on-the-fly merge. The resource manager in the leader
> knows it isn't supposed to ever delete worker-owned files (just
> close() the FDs), and the leader errors if it cannot find temp files
> that match what it expects. If there is a an error in the leader, it
> shuts down workers, and they clean up, more than likely. If there is
> an error in the worker, or if the files cannot be deleted (e.g., if
> there is a classic hard crash scenario), we should also be okay,
> because nobody will trip up on some old temp file from some worker,
> since fd.c has some gumption about what workers need to do (and what
> the leader needs to avoid) in the event of a hard crash. I don't see a
> risk of file descriptor leaks, which may or may not have been part of
> your concern (please clarify).

I don't think there's any new issue with file descriptor leaks here,
but I think there is a risk of calling unlink() too early or too late
with your design.  My proposal was an effort to nail that down real
tight.

>> If the worker is always completely finished with the tape before the
>> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
>> the tape and overwrite it like any other?
>
> I'll remind you that parallel CREATE INDEX doesn't actually ever need
> to be randomAccess, and so we are not actually going to ever need to
> do this as things stand. I wrote the code that way in order to not
> break the existing interface, which seemed like a blocker to posting
> the patch. I am open to the idea of such an "adoption" occurring, even
> though it actually wouldn't help any case that exists in the patch as
> proposed. I didn't go that far in part because it seemed premature,
> given that nobody had looked at my work to date at the time, and given
> the fact that there'd be no initial user-visible benefit, and given
> how the exact meaning of "unification" was (and is) somewhat in flux.
>
> I see no good reason to not do that, although that might change if I
> actually seriously undertook to teach the leader about this kind of
> "adoption". I suspect that the interface specification would make for
> confusing reading, which isn't terribly appealing, but I'm sure I
> could manage to make it work given time.

I think the interface is pretty clear: the worker's logical tapes get
incorporated into the leader's 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-20 Thread Peter Geoghegan
On Tue, Dec 20, 2016 at 2:53 PM, Robert Haas  wrote:
>> What I have in mind is something like the attached patch. It refactors
>> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
>> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
>> doesn't have the concept of a tape number anymore, it can contain any number
>> of tapes, and you can create more on the fly. With that, it'd be fairly easy
>> to make tuplesort.c merge LogicalTapes that came from different tape sets,
>> backed by different BufFiles. I think that'd avoid much of the unification
>> code.
>
> I just looked at the buffile.c/buffile.h changes in the latest version
> of the patch and I agree with this criticism, though maybe not with
> the proposed solution.  I actually don't understand what "unification"
> is supposed to mean.  The patch really doesn't explain that anywhere
> that I can see.  It says stuff like:
>
> + * Parallel operations can use an interface to unify multiple worker-owned
> + * BufFiles and a leader-owned BufFile within a leader process.  This relies
> + * on various fd.c conventions about the naming of temporary files.

Without meaning to sound glib, unification is the process by which
parallel CREATE INDEX has the leader read temp files from workers
sufficient to complete its final on-the-fly merge. So, it's a
terminology that's bit like "speculative insertion" was up until
UPSERT was committed: a concept that is somewhat in flux, and
describes a new low-level mechanism built to support a higher level
operation, which must accord with a higher level set of requirements
(so, for speculative insertion, that would be avoiding "unprincipled
deadlocks" and so on). That being the case, maybe "unification" isn't
useful as an precise piece of terminology at this point, but that will
change.

While I'm fairly confident that I basically have the right idea with
this patch, I think that you are better at judging the ins and outs of
resource management than I am, not least because of the experience of
working on parallel query itself. Also, I'm signed up to review
parallel hash join in large part because I think there might be some
convergence concerning the sharing of BufFiles among parallel workers.
I don't think I'm qualified to judge what a general abstraction like
this should look like, but I'm trying to get there.

> That comment tells you that unification is a thing you can do -- via
> an unspecified interface for unspecified reasons using unspecified
> conventions -- but it doesn't tell you what the semantics of it are
> supposed to be.  For example, if we "unify" several BufFiles, do they
> then have a shared seek pointer?

No.

> Do the existing contents effectively
> get concatenated in an unpredictable order, or are they all expected
> to be empty at the time unification happens?  Or something else?

The order is the same order as ordinal identifiers are assigned to
workers within tuplesort.c, which is undefined, with the notable
exception of the leader's own identifier (-1) and area of the unified
BufFile space (this is only relevant in randomAccess cases, where
leader may write stuff out to its own reserved part of the BufFile
space). It only matters that the bit of metadata in shared memory is
in that same order, which it clearly will be. So, it's unpredictable,
but in the same way that ordinal identifiers are assigned in a
not-well-defined order; it doesn't or at least shouldn't matter. We
can imagine a case where it does matter, and we probably should, but
that case isn't parallel CREATE INDEX.

>It's fine to make up new words -- indeed, in some sense that is the essence
> of writing any complex problem -- but you have to define them.

I invite you to help me define this new word.

> It's hard to understand how something like this doesn't leak
> resources.  Maybe that's been thought about here, but it isn't very
> clear to me how it's supposed to work.

I agree that it would be useful to centrally document what all this
unification stuff is about. Suggestions on where that should live are
welcome.

> In Heikki's proposal, if
> process A is trying to read a file owned by process B, and process B
> dies and removes the file before process A gets around to reading it,
> we have got trouble, especially on Windows which apparently has low
> tolerance for such things.  Peter's proposal avoids that - I *think* -
> by making the leader responsible for all resource cleanup, but that's
> inferior to the design we've used for other sorts of shared resource
> cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
> detach always takes responsibility.

Maybe it's inferior to that, but I think what Heikki proposes is more
or less complementary to what I've proposed, and has nothing to do
with resource management and plenty to do with making the logtape.c
interface look nice, AFAICT. It's also about refactoring/simplifying
logtape.c itself, while we're at 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-20 Thread Robert Haas
On Wed, Sep 21, 2016 at 12:52 PM, Heikki Linnakangas  wrote:
> I find this unification business really complicated. I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I just looked at the buffile.c/buffile.h changes in the latest version
of the patch and I agree with this criticism, though maybe not with
the proposed solution.  I actually don't understand what "unification"
is supposed to mean.  The patch really doesn't explain that anywhere
that I can see.  It says stuff like:

+ * Parallel operations can use an interface to unify multiple worker-owned
+ * BufFiles and a leader-owned BufFile within a leader process.  This relies
+ * on various fd.c conventions about the naming of temporary files.

That comment tells you that unification is a thing you can do -- via
an unspecified interface for unspecified reasons using unspecified
conventions -- but it doesn't tell you what the semantics of it are
supposed to be.  For example, if we "unify" several BufFiles, do they
then have a shared seek pointer?  Do the existing contents effectively
get concatenated in an unpredictable order, or are they all expected
to be empty at the time unification happens?  Or something else?  It's
fine to make up new words -- indeed, in some sense that is the essence
of writing any complex problem -- but you have to define them.  As far
as I can tell, the idea is that we're somehow magically concatenating
the BufFiles into one big super-BufFile, but I'm fuzzy on exactly
what's supposed to be going on there.

It's hard to understand how something like this doesn't leak
resources.  Maybe that's been thought about here, but it isn't very
clear to me how it's supposed to work.  In Heikki's proposal, if
process A is trying to read a file owned by process B, and process B
dies and removes the file before process A gets around to reading it,
we have got trouble, especially on Windows which apparently has low
tolerance for such things.  Peter's proposal avoids that - I *think* -
by making the leader responsible for all resource cleanup, but that's
inferior to the design we've used for other sorts of shared resource
cleanup (DSM, DSA, shm_mq, lock groups) where the last process to
detach always takes responsibility.  That avoids assuming that we're
always dealing with a leader-follower situation, it doesn't
categorically require the leader to be the one who creates the shared
resource, and it doesn't require the leader to be the last process to
die.

Imagine a data structure that is stored in dynamic shared memory and
contains space for a filename, a reference count, and a mutex.  Let's
call this thing a SharedTemporaryFile or something like that.  It
offers these APIs:

extern void SharedTemporaryFileInitialize(SharedTemporaryFile *);
extern void SharedTemporaryFileAttach(SharedTemporary File *, dsm_segment *seg);
extern void SharedTemporaryFileAssign(SharedTemporyFile *, char *pathname);
extern File SharedTemporaryFileGetFile(SharedTemporaryFile *);

After setting aside sizeof(SharedTemporaryFile) bytes in your shared
DSM sgement, you call SharedTemporaryFileInitialize() to initialize
them.  Then, every process that cares about the file does
SharedTemporaryFileAttach(), which bumps the reference count and sets
an on_dsm_detach hook to decrement the reference count and unlink the
file if the reference count thereby reaches 0.  One of those processes
does SharedTemporaryFileAssign(), which fills in the pathname and
clears FD_TEMPORARY.  Then, any process that has attached can call
SharedTemporaryFileGetFile() to get a File which can then be accessed
normally.  So, the pattern for parallel sort would be:

- Leader sets aside space and calls SharedTemporaryFileInitialize()
and SharedTemporaryFileAttach().
- The cooperating worker calls SharedTemporaryFileAttach() and then
SharedTemporaryFileAssign().
- The leader then calls SharedTemporaryFileGetFile().

Since the leader can attach to the file before the path name is filled
in, there's no window where the file is at risk of being leaked.
Before SharedTemporaryFileAssign(), the worker is solely responsible
for removing the file; after that call, whichever of the leader and
the worker exits last will 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-04 Thread Haribabu Kommi
On Mon, Dec 5, 2016 at 7:44 AM, Peter Geoghegan  wrote:

> On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra
>  wrote:
> > I do share your concerns about unpredictable behavior - that's
> > particularly worrying for pg_restore, which may be used for time-
> > sensitive use cases (DR, migrations between versions), so unpredictable
> > changes in behavior / duration are unwelcome.
>
> Right.
>
> > But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
> > The issue seems to be that the reltuples value may or may not get
> > updated, so maybe forcing ANALYZE (even very low statistics_target
> > values would do the trick, I think) would be more appropriate solution?
> > Or maybe it's time add at least some rudimentary statistics into the
> > dumps (the reltuples field seems like a good candidate).
>
> I think that there is a number of reasonable ways of looking at it. It
> might also be worthwhile to have a minimal ANALYZE performed by CREATE
> INDEX directly, iff there are no preexisting statistics (there is
> definitely going to be something pg_restore-like that we cannot fix --
> some ETL tool, for example). Perhaps, as an additional condition to
> proceeding with such an ANALYZE, it should also only happen when there
> is any chance at all of parallelism being used (but then you get into
> having to establish the relation size reliably in the absence of any
> pg_class.relpages, which isn't very appealing when there are many tiny
> indexes).
>
> In summary, I would really like it if a consensus emerged on how
> parallel CREATE INDEX should handle the ecosystem of tools like
> pg_restore, reindexdb, and so on. Personally, I'm neutral on which
> general approach should be taken. Proposals from other hackers about
> what to do here are particularly welcome.
>
>
Moved to next CF with "needs review" status.


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-04 Thread Peter Geoghegan
On Sat, Dec 3, 2016 at 7:23 PM, Tomas Vondra
 wrote:
> I do share your concerns about unpredictable behavior - that's
> particularly worrying for pg_restore, which may be used for time-
> sensitive use cases (DR, migrations between versions), so unpredictable
> changes in behavior / duration are unwelcome.

Right.

> But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
> The issue seems to be that the reltuples value may or may not get
> updated, so maybe forcing ANALYZE (even very low statistics_target
> values would do the trick, I think) would be more appropriate solution?
> Or maybe it's time add at least some rudimentary statistics into the
> dumps (the reltuples field seems like a good candidate).

I think that there is a number of reasonable ways of looking at it. It
might also be worthwhile to have a minimal ANALYZE performed by CREATE
INDEX directly, iff there are no preexisting statistics (there is
definitely going to be something pg_restore-like that we cannot fix --
some ETL tool, for example). Perhaps, as an additional condition to
proceeding with such an ANALYZE, it should also only happen when there
is any chance at all of parallelism being used (but then you get into
having to establish the relation size reliably in the absence of any
pg_class.relpages, which isn't very appealing when there are many tiny
indexes).

In summary, I would really like it if a consensus emerged on how
parallel CREATE INDEX should handle the ecosystem of tools like
pg_restore, reindexdb, and so on. Personally, I'm neutral on which
general approach should be taken. Proposals from other hackers about
what to do here are particularly welcome.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-03 Thread Tomas Vondra
On Sat, 2016-12-03 at 18:37 -0800, Peter Geoghegan wrote:
> On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera  com> wrote:
> > 
> > I don't think a patch must necessarily consider all possible uses
> > that
> > the new feature may have.  If we introduce parallel index creation,
> > that's great; if pg_restore doesn't start using it right away,
> > that's
> > okay.  You, or somebody else, can still patch it later.  The patch
> > is
> > still a step forward.
> While I agree, right now pg_restore will tend to use or not use
> parallelism for CREATE INDEX more or less by accident, based on
> whether or not pg_class.reltuples has already been set by something
> else (e.g., an earlier CREATE INDEX against the same table in the
> restoration). That seems unacceptable. I haven't just suppressed the
> use of parallel CREATE INDEX within pg_restore because that would be
> taking a position on something I have a hard time defending any
> particular position on. And so, I am slightly concerned about the
> entire ecosystem of tools that could implicitly use parallel CREATE
> INDEX, with undesirable consequences. Especially pg_restore.
> 
> It's not so much a hard question as it is an awkward one. I want to
> handle any possible objection about there being future compatibility
> issues with going one way or the other ("This paints us into a corner
> with..."). And, there is no existing, simple way for pg_restore and
> other tools to disable the use of parallelism due to the cost model
> automatically kicking in, while still allowing the proposed new index
> storage parameter ("parallel_workers") to force the use of
> parallelism, which seems like something that should happen. (I might
> have to add a new GUC like "enable_maintenance_paralleism", since
> "max_parallel_workers_maintenance = 0" disables parallelism no matter
> how it might be invoked).

I do share your concerns about unpredictable behavior - that's
particularly worrying for pg_restore, which may be used for time-
sensitive use cases (DR, migrations between versions), so unpredictable
changes in behavior / duration are unwelcome.

But isn't this more a deficiency in pg_restore, than in CREATE INDEX?
The issue seems to be that the reltuples value may or may not get
updated, so maybe forcing ANALYZE (even very low statistics_target
values would do the trick, I think) would be more appropriate solution?
Or maybe it's time add at least some rudimentary statistics into the
dumps (the reltuples field seems like a good candidate).

Trying to fix this by adding more GUCs seems a bit strange to me.

> 
> In general, I have a positive outlook on this patch, since it appears
> to compete well with similar implementations in other systems
> scalability-wise. It does what it's supposed to do.
> 

+1 to that

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-03 Thread Peter Geoghegan
On Sat, Dec 3, 2016 at 5:45 PM, Alvaro Herrera  wrote:
> I don't think a patch must necessarily consider all possible uses that
> the new feature may have.  If we introduce parallel index creation,
> that's great; if pg_restore doesn't start using it right away, that's
> okay.  You, or somebody else, can still patch it later.  The patch is
> still a step forward.

While I agree, right now pg_restore will tend to use or not use
parallelism for CREATE INDEX more or less by accident, based on
whether or not pg_class.reltuples has already been set by something
else (e.g., an earlier CREATE INDEX against the same table in the
restoration). That seems unacceptable. I haven't just suppressed the
use of parallel CREATE INDEX within pg_restore because that would be
taking a position on something I have a hard time defending any
particular position on. And so, I am slightly concerned about the
entire ecosystem of tools that could implicitly use parallel CREATE
INDEX, with undesirable consequences. Especially pg_restore.

It's not so much a hard question as it is an awkward one. I want to
handle any possible objection about there being future compatibility
issues with going one way or the other ("This paints us into a corner
with..."). And, there is no existing, simple way for pg_restore and
other tools to disable the use of parallelism due to the cost model
automatically kicking in, while still allowing the proposed new index
storage parameter ("parallel_workers") to force the use of
parallelism, which seems like something that should happen. (I might
have to add a new GUC like "enable_maintenance_paralleism", since
"max_parallel_workers_maintenance = 0" disables parallelism no matter
how it might be invoked).

In general, I have a positive outlook on this patch, since it appears
to compete well with similar implementations in other systems
scalability-wise. It does what it's supposed to do.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-03 Thread Alvaro Herrera
Peter Geoghegan wrote:
> On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan  wrote:
> > What do we need to teach pg_restore about parallel CREATE INDEX, if
> > anything at all? Could this be as simple as a blanket disabling of
> > parallelism for CREATE INDEX from pg_restore? Or, does it need to be
> > more sophisticated than that? I suppose that tools like reindexdb and
> > pgbench must be considered in a similar way.
> 
> I still haven't resolved this question, which seems like the most
> important outstanding question,

I don't think a patch must necessarily consider all possible uses that
the new feature may have.  If we introduce parallel index creation,
that's great; if pg_restore doesn't start using it right away, that's
okay.  You, or somebody else, can still patch it later.  The patch is
still a step forward.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-12-03 Thread Peter Geoghegan
On Mon, Nov 7, 2016 at 8:28 PM, Peter Geoghegan  wrote:
> What do we need to teach pg_restore about parallel CREATE INDEX, if
> anything at all? Could this be as simple as a blanket disabling of
> parallelism for CREATE INDEX from pg_restore? Or, does it need to be
> more sophisticated than that? I suppose that tools like reindexdb and
> pgbench must be considered in a similar way.

I still haven't resolved this question, which seems like the most
important outstanding question, but I attach V6. Changes:

* tuplesort.c was adapted to use the recently committed condition
variables stuff. This made things cleaner. No more ad-hoc WaitLatch()
looping.

* Adapted docs to mention the newly committed max_parallel_workers GUC
in the context of discussing proposed max_parallel_workers_maintenance
GUC.

* Fixed trivial assertion failure bug that could be tripped when a
conventional sort uses very little memory.

-- 
Peter Geoghegan


0002-Add-temporary-testing-tools.patch.gz
Description: GNU Zip compressed data


0001-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-15 Thread Robert Haas
On Wed, Nov 9, 2016 at 10:18 PM, Peter Geoghegan  wrote:
>> Maybe another way of putting this is that, while there's clearly a
>> benefit to having some kind of a cap, it's appropriate to pick a large
>> value, such as 500.  Having no cap at all risks creating many extra
>> tapes that just waste memory, and also risks an unduly
>> cache-inefficient final merge.  Reigning that in makes sense.
>> However, we can't reign it in too far or we'll create slow polyphase
>> merges in case that are reasonably likely to occur in real life.
>
> I completely agree with your analysis.

Cool.  BTW, my run with 10 tapes completed in 10696528.377 ms
(02:58:16.528) - i.e. almost 3 minutes slower than with no tape limit.
Building runs took 4260.16 s, and the final merge pass began at
8239.12 s.  That's certainly better than I expected, and it seems to
show that even if the number of tapes is grossly inadequate for the
number of runs, you can still make up most of the time that you lose
to I/O with improved cache efficiency -- at least under favorable
circumstances.  Of course, on many systems I/O bandwidth will be a
scarce resource, so that argument can be overdone -- and even if not,
a 10-tape sort version takes FAR longer to deliver the first tuple.

I also tried this out with work_mem = 512MB.  Doubling work_mem
reduces the number of runs enough that we don't get a polyphase merge
in any case.  With no limit on tapes:

2016-11-10 11:24:45 UTC [54042] LOG:  switching to external sort with
1873 tapes: CPU: user: 11.34 s, system: 0.48 s, elapsed: 12.13 s
2016-11-10 12:36:22 UTC [54042] LOG:  finished writing run 308 to tape
307: CPU: user: 4096.63 s, system: 156.88 s, elapsed: 4309.66 s
2016-11-10 12:36:22 UTC [54042] LOG:  using 516563 KB of memory for
read buffers among 308 input tapes
2016-11-10 12:36:30 UTC [54042] LOG:  performsort done (except 308-way
final merge): CPU: user: 4097.75 s, system: 157.24 s, elapsed: 4317.76
s
2016-11-10 13:54:07 UTC [54042] LOG:  external sort ended, 6255577
disk blocks used: CPU: user: 8638.72 s, system: 177.42 s, elapsed:
8974.44 s

With a max_sort_tapes = 501:

2016-11-10 14:23:50 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 10.99 s, system: 0.54 s, elapsed: 11.57 s
2016-11-10 15:36:47 UTC [54042] LOG:  finished writing run 278 to tape
277: CPU: user: 4190.31 s, system: 155.33 s, elapsed: 4388.86 s
2016-11-10 15:36:47 UTC [54042] LOG:  using 517313 KB of memory for
read buffers among 278 input tapes
2016-11-10 15:36:54 UTC [54042] LOG:  performsort done (except 278-way
final merge): CPU: user: 4191.36 s, system: 155.68 s, elapsed: 4395.66
s
2016-11-10 16:53:39 UTC [54042] LOG:  external sort ended, 6255699
disk blocks used: CPU: user: 8673.07 s, system: 175.93 s, elapsed:
9000.80 s

0.3% slower with the tape limit, but that might be noise.  Even if
not, it seems pretty silly to create 1873 tapes when we only need
~300.

At work_mem = 2GB:

2016-11-10 18:08:00 UTC [54042] LOG:  switching to external sort with
7490 tapes: CPU: user: 44.28 s, system: 1.99 s, elapsed: 46.33 s
2016-11-10 19:23:06 UTC [54042] LOG:  finished writing run 77 to tape
76: CPU: user: 4342.10 s, system: 156.21 s, elapsed: 4551.95 s
2016-11-10 19:23:06 UTC [54042] LOG:  using 2095202 KB of memory for
read buffers among 77 input tapes
2016-11-10 19:23:12 UTC [54042] LOG:  performsort done (except 77-way
final merge): CPU: user: 4343.36 s, system: 157.07 s, elapsed: 4558.79
s
2016-11-10 20:24:24 UTC [54042] LOG:  external sort ended, 6255946
disk blocks used: CPU: user: 7894.71 s, system: 176.36 s, elapsed:
8230.13 s

At work_mem = 2GB, max_sort_tapes = 501:

2016-11-10 21:28:23 UTC [54042] LOG:  switching to external sort with
502 tapes: CPU: user: 44.09 s,
 system: 1.94 s, elapsed: 46.07 s
2016-11-10 22:42:28 UTC [54042] LOG:  finished writing run 68 to tape
67: CPU: user: 4278.49 s, system: 154.39 s, elapsed: 4490.25 s
2016-11-10 22:42:28 UTC [54042] LOG:  using 2095427 KB of memory for
read buffers among 68 input tapes
2016-11-10 22:42:34 UTC [54042] LOG:  performsort done (except 68-way
final merge): CPU: user: 4279.60 s, system: 155.21 s, elapsed: 4496.83
s
2016-11-10 23:42:10 UTC [54042] LOG:  external sort ended, 6255983
disk blocks used: CPU: user: 7733.98 s, system: 173.85 s, elapsed:
8072.55 s

Roughly 2% faster.  Maybe still noise, but less likely.  7490 tapes
certainly seems over the top.

At work_mem = 8GB:

2016-11-14 19:17:28 UTC [54042] LOG:  switching to external sort with
29960 tapes: CPU: user: 183.80 s, system: 7.71 s, elapsed: 191.61 s
2016-11-14 20:32:02 UTC [54042] LOG:  finished writing run 20 to tape
19: CPU: user: 4431.44 s, system: 176.82 s, elapsed: 4665.16 s
2016-11-14 20:32:02 UTC [54042] LOG:  using 8388083 KB of memory for
read buffers among 20 input tapes
2016-11-14 20:32:26 UTC [54042] LOG:  performsort done (except 20-way
final merge): CPU: user: 4432.99 s, system: 181.29 s, elapsed: 4689.52
s
2016-11-14 21:30:56 UTC [54042] LOG:  

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2016 at 6:57 PM, Robert Haas  wrote:
> I guess that's possible, but the problem with polyphase merge is that
> the increased I/O becomes a pretty significant cost in a hurry.

Not if you have a huge RAID array. :-)

Obviously I'm not seriously suggesting that we revise the cap from 500
to 7. We're only concerned about the constant factors here. There is a
clearly a need to make some simplifying assumptions. I think that you
understand this very well, though.

> Maybe another way of putting this is that, while there's clearly a
> benefit to having some kind of a cap, it's appropriate to pick a large
> value, such as 500.  Having no cap at all risks creating many extra
> tapes that just waste memory, and also risks an unduly
> cache-inefficient final merge.  Reigning that in makes sense.
> However, we can't reign it in too far or we'll create slow polyphase
> merges in case that are reasonably likely to occur in real life.

I completely agree with your analysis.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Robert Haas
On Wed, Nov 9, 2016 at 7:54 PM, Peter Geoghegan  wrote:
>> Now, on the other hand, as far as I can see, the actual amount of
>> evidence that 0001 is a good idea which has been presented in this
>> forum is pretty near zero.  You've argued for it on theoretical
>> grounds several times, but theoretical arguments are not a substitute
>> for test results.
>
> See the illustration in TAOCP, vol III, page 273 in the second edition
> -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
> that it's actually a real-world benchmark.

I don't have that publication, and I'm guessing that's not based on
PostgreSQL's implementation.  There's no substitute for tests using
the code we've actually got.

>> So, I'm now feeling pretty bullish about this patch, except for one
>> thing, which is that I think the comments are way off-base. Peter
>> writes: $When allowedMem is significantly lower than what is required
>> for an internal sort, it is unlikely that there are benefits to
>> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
>> I'm pretty sure that's totally wrong, first of all because commit
>> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
>> precisely the thing which this comment says we shouldn't
>
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.
>
> The reason I haven't actually gone right back down to 7 with this cap
> is that it's possible that the added I/O costs outweigh the CPU costs
> in extreme cases, even though I think that polyphase merge doesn't
> have all that much to do with I/O costs, even with its 1970s
> perspective. Knuth doesn't say much about I/O costs -- it's more about
> using an extremely small amount of memory effectively (minimizing CPU
> costs with very little available main memory).
>
> Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
> (benefitting from a few dozen or hundred instead) seems more possible
> with the improved merge heap maintenance logic added recently, where
> there could be perhaps hundreds of runs merged with very low CPU cost
> in the event of presorted input (or, input that is inversely
> logically/physically correlated). That would be true because we'd only
> examine the top of the heap through, and so I/O costs may matter much
> more.
>
> Depending on the exact details, I bet you could see a benefit with
> only 7 tapes due to CPU cache efficiency in a case like the one you
> describe. Perhaps when sorting integers, but not when sorting collated
> text. There are many competing considerations, which I've tried my
> best to balance here with a merge order of 500.

I guess that's possible, but the problem with polyphase merge is that
the increased I/O becomes a pretty significant cost in a hurry.
Here's the same test with max_sort_tapes = 100:

2016-11-09 23:02:49 UTC [48551] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 23:02:55 UTC [48551] LOG:  switching to external sort with
101 tapes: CPU: user: 5.72 s, system: 0.25 s, elapsed: 6.04 s
2016-11-10 00:13:00 UTC [48551] LOG:  finished writing run 544 to tape
49: CPU: user: 4003.00 s, system: 156.89 s, elapsed: 4211.33 s
2016-11-10 00:16:52 UTC [48551] LOG:  finished 51-way merge step: CPU:
user: 4214.84 s, system: 161.94 s, elapsed: 4442.98 s
2016-11-10 00:25:41 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 4704.14 s, system: 170.83 s, elapsed: 4972.47 s
2016-11-10 00:36:47 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5333.12 s, system: 179.94 s, elapsed: 5638.52 s
2016-11-10 00:45:32 UTC [48551] LOG:  finished 99-way merge step: CPU:
user: 5821.13 s, system: 189.00 s, elapsed: 6163.53 s
2016-11-10 01:01:29 UTC [48551] LOG:  finished 100-way merge step:
CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58 s
2016-11-10 01:01:29 UTC [48551] LOG:  performsort done (except 100-way
final merge): CPU: user: 6691.10 s, system: 210.60 s, elapsed: 7120.58
s
2016-11-10 01:45:40 UTC [48551] LOG:  external sort ended, 6255949
disk blocks used: CPU: user: 9271.07 s, system: 232.26 s, elapsed:
9771.49 s

This is already worse than max_sort_tapes = 501, though the total
runtime is still better than no cap (the time-to-first-tuple is way
worse, though).   I'm going to try max_sort_tapes = 10 next, but I
think the basic pattern is already fairly clear.  As you reduce the
cap on the number of tapes, (a) the time to build the initial runs
doesn't change very much, (b) the time to perform the final merge
decreases significantly, and (c) the time 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2016 at 4:54 PM, Peter Geoghegan  wrote:
> It's more complicated than that. As I said, I think that Knuth
> basically had it right with his sweet spot of 7. I think that commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
> because a one-pass merge avoided certain overheads not inherent to
> polyphase merge, like all that memory accounting stuff, extra palloc()
> traffic, etc. The expanded use of per tape buffering we have even in
> multi-pass cases likely makes that much less true for us these days.

Also, logtape.c fragmentation made multiple merge pass cases
experience increased random I/O in a way that was only an accident of
our implementation. We've fixed that now, but that problem must have
added further cost that df700e6b40195d28dc764e0c694ac8cef90d4638
*masked* when it was commited in 2006. (I do think that the problem
with the merge heap maintenance fixed recently in
24598337c8d214ba8dcf354130b72c49636bba69 was the biggest problem that
the 2006 work masked, though).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Peter Geoghegan
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas  wrote:
> I gather that 0001, which puts a cap on the number of tapes, is not
> actually related to the subject of this thread; it's an independent
> change that you think is a good idea.  I reviewed the previous
> discussion on this topic upthread, between you and Heikki, which seems
> to me to contain more heat than light.

FWIW, I don't remember it that way. Heikki seemed to be uncomfortable
with the quasi-arbitrary choice of constant, rather than disagreeing
with the general idea of a cap. Or, maybe he thought I didn't go far
enough, by completely removing polyphase merge. I think that removing
polyphase merge would be an orthogonal change to this, though.

> Now, on the other hand, as far as I can see, the actual amount of
> evidence that 0001 is a good idea which has been presented in this
> forum is pretty near zero.  You've argued for it on theoretical
> grounds several times, but theoretical arguments are not a substitute
> for test results.

See the illustration in TAOCP, vol III, page 273 in the second edition
-- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
that it's actually a real-world benchmark.

I guess I felt that no one ever argued that as many tapes as possible
was sound on any grounds, even theoretical, and so didn't feel
obligated to test it until asked to do so. I think that the reason
that a cap like this didn't go in around the time that the growth
logic went in (2006) was because nobody followed up on it. If you look
at the archives, there is plenty of discussion of a cap like this at
the time.

> That looks very good.  On a test that runs for almost 3 hours, we
> saved more than 20 minutes.  The overall runtime improvement is 23% in
> a case where we would not expect this patch to do particularly well;
> after all, without limiting the number of runs, we are able to
> complete the sort with a single merge pass, whereas when we reduce the
> number of runs, we now require a polyphase merge.  Nevertheless, we
> come out way ahead, because the final merge pass gets way faster,
> presumably because there are fewer tapes involved.  The first test
> does a 616-way final merge and takes 6184.34 seconds to do it.  The
> second test does a 501-way final merge and takes 4519.31 seconds to
> do.  This increased final merge speed accounts for practically all of
> the speedup, and the reason it's faster pretty much has to be that
> it's merging fewer tapes.

It's CPU cache efficiency -- has to be.

> That, in turn, happens for two reasons.  First, because limiting the
> number of tapes increases slightly the memory available for storing
> the tuples belonging to each run, we end up with fewer runs in the
> first place.  The number of runs drops from from 616 to 577, about a
> 7% reduction.  Second, because we have more runs than tapes in the
> second case, it does a 77-way merge prior to the final merge.  Because
> of that 77-way merge, the time at which the second run starts
> producing tuples is slightly later.  Instead of producing the first
> tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
> disadvantage in this case, because it's hypothetically possible that a
> query like this could have a LIMIT and we'd end up worse off overall.
> However, that's pretty unlikely, for three reasons.  Number one, LIMIT
> isn't likely to be used on queries of this type in the first place.
> Number two, if it were used, we'd probably end up with a bounded sort
> plan which would be way faster anyway.  Number three, if somehow we
> still sorted the data set we'd still win in this case if the limit
> were more than about 20% of the total number of tuples.  The much
> faster run time to produce the whole data set is a small price to pay
> for possibly needing to wait a little longer for the first tuple.

Cool.

> So, I'm now feeling pretty bullish about this patch, except for one
> thing, which is that I think the comments are way off-base. Peter
> writes: $When allowedMem is significantly lower than what is required
> for an internal sort, it is unlikely that there are benefits to
> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
> I'm pretty sure that's totally wrong, first of all because commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
> precisely the thing which this comment says we shouldn't

It's more complicated than that. As I said, I think that Knuth
basically had it right with his sweet spot of 7. I think that commit
df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
because a one-pass merge avoided certain overheads not inherent to
polyphase merge, like all that memory accounting stuff, extra palloc()
traffic, etc. The expanded use of per tape buffering we have even in
multi-pass cases likely makes that much less true for us these days.

The reason I haven't actually gone right back down to 7 with this cap
is that it's 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-09 Thread Robert Haas
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan  wrote:
> I attach V5.

I gather that 0001, which puts a cap on the number of tapes, is not
actually related to the subject of this thread; it's an independent
change that you think is a good idea.  I reviewed the previous
discussion on this topic upthread, between you and Heikki, which seems
to me to contain more heat than light.  At least in my opinion, the
question is not whether a limit on the number of tapes is the best
possible system, but rather whether it's better than the status quo.
It's silly to refuse to make a simple change on the grounds that some
much more complex change might be better, because if somebody writes
that patch and it is better we can always revert 0001 then.  If 0001
involved hundreds of lines of invasive code changes, that argument
wouldn't apply, but it doesn't; it's almost a one-liner.

Now, on the other hand, as far as I can see, the actual amount of
evidence that 0001 is a good idea which has been presented in this
forum is pretty near zero.  You've argued for it on theoretical
grounds several times, but theoretical arguments are not a substitute
for test results.  Therefore, I decided that the best thing to do was
test it myself.  I wrote a little patch to add a GUC for
max_sort_tapes, which actually turns out not to work as I thought:
setting max_sort_tapes = 501 seems to limit the highest tape number to
501 rather than the number of tapes to 501, so there's a sort of
off-by-one error.  But that doesn't really matter.  The patch is
attached here for the convenience of anyone else who may want to
fiddle with this.

Next, I tried to set things up so that I'd get a large enough number
of tapes for the cap to matter.  To do that, I initialized with
"pgbench -i --unlogged-tables -s 2" so that I had 2 billion
tuples.  Then I used this SQL query: "select sum(w+abalance) from
(select (aid::numeric * 7123000217)%10 w, * from
pgbench_accounts order by 1) x".  The point of the math is to perturb
the ordering of the tuples so that they actually need to be sorted
instead of just passed through unchanged. The use of abalance in the
outer sum prevents an index-only-scan from being used, which makes the
sort wider; perhaps I should have tried to make it wider still, but
this is what I did.  I wanted to have more than 501 tapes because,
obviously, a concern with a change like this is that things might get
slower in the case where it forces a polyphase merge rather than a
single merge pass. And, of course, I set trace_sort = on.

Here's what my initial run looked like, in brief:

2016-11-09 15:37:52 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 15:37:59 UTC [44026] LOG:  switching to external sort with
937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s
2016-11-09 16:48:31 UTC [44026] LOG:  finished writing run 616 to tape
615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s
2016-11-09 16:48:31 UTC [44026] LOG:  using 246719 KB of memory for
read buffers among 616 input tapes
2016-11-09 16:48:39 UTC [44026] LOG:  performsort done (except 616-way
final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41
s
2016-11-09 18:33:30 UTC [44026] LOG:  external sort ended, 6255145
disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed:
10538.06 s

And according to psql: Time: 10538068.225 ms (02:55:38.068)

Then I set max_sort_tapes = 501 and ran it again.  This time:

2016-11-09 19:05:22 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 19:05:28 UTC [44026] LOG:  switching to external sort with
502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s
2016-11-09 20:15:20 UTC [44026] LOG:  finished writing run 577 to tape
75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s
2016-11-09 20:15:20 UTC [44026] LOG:  using 249594 KB of memory for
read buffers among 501 input tapes
2016-11-09 20:21:19 UTC [44026] LOG:  finished 77-way merge step: CPU:
user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s
2016-11-09 20:21:19 UTC [44026] LOG:  performsort done (except 501-way
final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22
s
2016-11-09 21:38:12 UTC [44026] LOG:  external sort ended, 6255484
disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed:
9170.62 s

And this one, according to psql: Time: 9170629.597 ms (02:32:50.630)

That looks very good.  On a test that runs for almost 3 hours, we
saved more than 20 minutes.  The overall runtime improvement is 23% in
a case where we would not expect this patch to do particularly well;
after all, without limiting the number of runs, we are able to
complete the sort with a single merge pass, whereas when we reduce the
number of runs, we now require a polyphase merge.  Nevertheless, we
come out way ahead, because the final merge pass gets way faster,
presumably because there are fewer tapes involved.  The 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-11-07 Thread Peter Geoghegan
On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan  wrote:
>> * Cost model. Should probably attempt to guess final index size, and
>> derive calculation of number of workers from that. Also, I'm concerned
>> that I haven't given enough thought to the low end, where with default
>> settings most CREATE INDEX statements will use at least one parallel
>> worker.

> While I haven't made progress on any of these open items, I should
> still get a version out that applies cleanly on top of git tip --
> commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
> bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
> no notable behavioral changes or bug fixes.

I attach V5. Changes:

* A big cost model overhaul. Workers are logarithmically scaled based
on projected final *index* size, not current heap size, as was the
case in V4. A new nbtpage.c routine is added to estimate a
not-yet-built B-Tree index's size, now called by the optimizer. This
involves getting average item width for indexed attributes from
pg_attribute for the heap relation. There are some subtleties here
with partial indexes, null_frac, etc. I also refined the cap applied
on the number of workers that limits too many workers being launched
when there isn't so much maintenance_work_mem.

The cost model is much improved now -- it is now more than just a
placeholder, at least. It doesn't do things like launch a totally
inappropriate number of workers to build a very small partial index.
Granted, those workers would still have something to do -- scan the
heap -- but not enough to justify launching so many (that is,
launching as many as would be launched for an equivalent non-partial
index).

That having been said, things are still quite fudged here, and I
struggle to find any guiding principle around doing better on average.
I think that that's because of the inherent difficulty of modeling
what's going on, but I'd be happy to be proven wrong on that. In any
case, I think it's going to be fairly common for DBAs to want to use
the storage parameter to force the use of a particular number of
parallel workers.

(See also: my remarks below on how the new bt_estimate_nblocks()
SQL-callable function can give insight into the new cost model's
decisions.)

* Overhauled leader_mergeruns() further, to make it closer to
mergeruns(). We now always rewind input tapes. This simplification
involved refining some of the assertions within logtape.c, which is
also slightly simplified.

* 2 new testing tools are added to the final commit in the patch
series (not actually proposed for commit). I've added 2 new
SQL-callable functions to contrib/pageinspect.

The 2 new testing functions are:

bt_estimate_nblocks
---

bt_estimate_nblocks() provides an easy way to see the optimizer's
projection of how large the final index will be. It returns an
estimate in blocks. Example:

mgd=# analyze;
ANALYZE
mgd=# select oid::regclass as rel,
bt_estimated_nblocks(oid),
relpages,
to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as
estimate_actual
from pg_class
where relkind = 'i'
order by relpages desc limit 20;

rel │
bt_estimated_nblocks │ relpages │ estimate_actual
┼──┼──┼─
 mgd.acc_accession_idx_accid│
107,091 │  106,274 │ 1.008
 mgd.acc_accession_0│
169,024 │  106,274 │ 1.590
 mgd.acc_accession_1│
169,024 │   80,382 │ 2.103
 mgd.acc_accession_idx_prefixpart   │
76,661 │   80,382 │ 0.954
 mgd.acc_accession_idx_mgitype_key  │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_clustered│
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_createdby_key│
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_numericpart  │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_logicaldb_key│
76,661 │   76,928 │ 0.997
 mgd.acc_accession_idx_modifiedby_key   │
76,661 │   76,928 │ 0.997
 mgd.acc_accession_pkey │
76,661 │   76,928 │ 0.997
 mgd.mgi_relationship_property_idx_propertyname_key │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_idx_modifiedby_key   │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_pkey │
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_idx_clustered│
74,197 │   74,462 │ 0.996
 mgd.mgi_relationship_property_idx_createdby_key│
74,197 │   74,462 │ 0.996
 mgd.seq_sequence_idx_clustered │
50,051 │   50,486 │ 0.991
 mgd.seq_sequence_raw_pkey  │
35,826 │   35,952 │ 0.996
 mgd.seq_sequence_raw_idx_modifiedby_key│
35,826 │   35,952 │ 0.996
 mgd.seq_source_assoc_idx_clustered │
35,822 │   35,952 │ 0.996
(20 rows)

I haven't 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-27 Thread Peter Geoghegan
On Wed, Oct 19, 2016 at 11:33 AM, Peter Geoghegan  wrote:
> I don't think that eager merging will prove all that effective,
> however it's implemented. I see a very I/O bound system when parallel
> CREATE INDEX merges serially. There is no obvious reason why you'd
> have a straggler worker process with CREATE INDEX, really.

In an effort to head off any misunderstanding around this patch
series, I started a new Wiki page for it:

https://wiki.postgresql.org/wiki/Parallel_External_Sort

This talks about parallel CREATE INDEX in particular, and uses of
parallel external sort more generally, including future uses beyond
CREATE INDEX.

This approach worked very well for me during the UPSERT project, where
a detailed overview really helped. With UPSERT, it was particularly
difficult to keep the *current* state of things straight, such as
current open items for the patch, areas of disagreement, and areas
where there was no longer any disagreement or controversy. I don't
think that this patch is even remotely as complicated as UPSERT was,
but it's still something that has had several concurrently active
mailing list threads (threads that are at least loosely related to the
project), so I think that this will be useful.

I welcome anyone with an interest in this project to review the Wiki
page, add their own concerns to it with -hackers citation, and add
their own content around related work. There is a kind of unresolved
question around where the Gather Merge work might fit in to what I've
come up with aleady. There may be other unresolved questions like
that, that I'm not even aware of.

I commit to maintaining the new Wiki page as a useful starting
reference for understanding the current state of this patch. I hope
this makes looking into the patch series less intimidating for
potential reviewers.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-26 Thread Peter Geoghegan
On Mon, Aug 1, 2016 at 3:18 PM, Peter Geoghegan  wrote:
> Setup:
>
> CREATE TABLE parallel_sort_test AS
> SELECT hashint8(i) randint,
> md5(i::text) collate "C" padding1,
> md5(i::text || '2') collate "C" padding2
> FROM generate_series(0, 1e9::bigint) i;
>
> CHECKPOINT;
>
> This leaves us with a parallel_sort_test table that is 94 GB in size.
>
> SET maintenance_work_mem = '8GB';
>
> -- Serial case (external sort, should closely match master branch):
> CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
> (parallel_workers = 0);
>
> Total time: 00:15:42.15
>
> -- Patch with 8 tuplesort "sort-and-scan" workers (leader process
> participates as a worker here):
> CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
> (parallel_workers = 7);
>
> Total time: 00:06:03.86
>
> As you can see, the parallel case is 2.58x faster

I decided to revisit this exact benchmark, using the same AWS instance
type (the one with 16 HDDs, again configured in software RAID0) to see
how things had changed for both parallel and serial cases. I am now
testing V4. A lot changed in the last 3 months, with most of the
changes that help here now already committed to the master branch.

Relevant changes


* Heikki's major overhaul of preload memory makes CREATE INDEX merging
have more sequential access patterns. It also effectively allows us to
use more memory. It's possible that the biggest benefit it brings to
parallel CREATE INDEX is that is eliminates almost any random I/O
penalty from logtape.c fragmentation that an extra merge pass has;
parallel workers now usually do their own merge to produce one big run
for the leader to merge. It also improves CPU cache efficiency quite
directly, I think.

This is the patch that helps most. Many thanks to Heikki for driving
this forward.

* My patch to simplify and optimize how the K-way merge heap is
maintained (as tuples fill leaf pages of the final index structure)
makes the merge phase significantly less CPU bound overall.

(These first two items particularly help parallel CREATE INDEX, which
spends proportionally much more wall clock time merging than would be
expected for similar serial cases. Serial cases do of course also
benefit.)

* V2 of the patch (and all subsequent versions) apportioned slices of
maintenance_work_mem to workers. maintenance_work_mem became a
per-utility-operation budget, regardless of number of workers
launched. This means that workers have less memory than the original
V1 benchmark (they simply don't make use of it now), but this seems
unlikely to hurt. Possibly, it even helps.

* Andres' work on md.c scalability may have helped (seems unlikely
with these CREATE INDEX cases that produce indexes not in the hundreds
of gigabytes, though). It would help with *extremely* large index
creation, which we won't really look at here.

Things now look better than ever for the parallel CREATE INDEX patch.
While it's typical for about 75% of wall clock time to be spent on
sorting runs with serial CREATE INDEX, with the remaining 25% going on
merging/writing index, with parallel CREATE INDEX I now generally see
about a 50/50 split between parallel sorting of runs (including any
worker merging to produce final runs) and serial merging for final
on-the-fly merge where we actually write new index out as input is
merged. This is a *significant* improvement over what we saw here back
in August, where it was not uncommon for parallel CREATE INDEX to
spend *twice* as much time in the serial final on-the-fly merge step.

All improvements to the code that we've seen since August have
targeted this final on-the-fly merge bottleneck. (The final on-the-fly
merge is now *consistently* able to write out the index at a rate of
150MB/sec+ in my tests, which is pretty good.)

New results
==

Same setup as one quoted above -- once again, we "SET
maintenance_work_mem = '8GB'".

-- Patch with 8 tuplesort "sort-and-scan" workers:
CREATE INDEX patch_8_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 7);

Total time: 00:04:24.93

-- Serial case:
CREATE INDEX serial_idx ON parallel_sort_test (randint) WITH
(parallel_workers = 0);

Total time: 00:14:25.19

3.27x faster. Not bad. As you see in the quoted text, that was 2.58x
back in August, even though the implementation now uses a lot less
memory in parallel workers. And, that's without even considering the
general question of how much faster index creation can be compared to
Postgres 9.6 -- it's roughly 3.5x faster at times.

New case


Separately, using my gensort tool [1], I came up with a new test case.
The tool generated a 2.5 billion row table, sized at 159GB. This is
how long is takes to produce a 73GB index on the "sortkey" column of
the resulting table:

-- gensort "C" locale text parallel case:
CREATE INDEX test8 on sort_test(sortkey) WITH (parallel_workers = 7);

Total time: 00:16:19.63

-- gensort "C" locale text serial case:
CREATE INDEX test0 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-24 Thread Peter Geoghegan
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan  wrote:
> Work is still needed on:
>
> * Cost model. Should probably attempt to guess final index size, and
> derive calculation of number of workers from that. Also, I'm concerned
> that I haven't given enough thought to the low end, where with default
> settings most CREATE INDEX statements will use at least one parallel
> worker.
>
> * The whole way that I teach nbtsort.c to disallow catalog tables for
> parallel CREATE INDEX due to concerns about parallel safety is in need
> of expert review, preferably from Robert. It's complicated in a way
> that relies on things happening or not happening from a distance.
>
> * Heikki seems to want to change more about logtape.c, and its use of
> indirection blocks. That may evolve, but for now I can only target the
> master branch.
>
> * More extensive performance testing. I think that this V3 is probably
> the fastest version yet, what with Heikki's improvements, but I
> haven't really verified that.

While I haven't made progress on any of these open items, I should
still get a version out that applies cleanly on top of git tip --
commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
no notable behavioral changes or bug fixes.

-- 
Peter Geoghegan


0003-Add-force_btree_randomaccess-GUC-for-testing.patch.gz
Description: GNU Zip compressed data


0002-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data


0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-21 Thread Amit Kapila
On Fri, Oct 21, 2016 at 4:25 PM, Amit Kapila  wrote:
> On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan  wrote:
>> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila  wrote:
>>
>> I read the following paragraph from the Volcano paper just now:
>>
>> """
>> During implementation and benchmarking of parallel sorting, we added
>> two more features to exchange. First, we wanted to implement a merge
>> network in which some processors produce sorted streams merge
>> concurrently by other processors. Volcano’s sort iterator can be used
>> to generate a sorted stream. A merge iterator was easily derived from
>> the sort module. It uses a single level merge, instead of the cascaded
>> merge of runs used in sort. The input of a merge iterator is an
>> exchange. Differently from other operators, the merge iterator
>> requires to distinguish the input records by their producer. As an
>> example, for a join operation it does not matter where the input
>> records were created, and all inputs can be accumulated in a single
>> input stream. For a merge operation, it is crucial to distinguish the
>> input records by their producer in order to merge multiple sorted
>> streams correctly.
>> """
>>
>> I don't really understand this paragraph, but thought I'd ask: why the
>> need to "distinguish the input records by their producer in order to
>> merge multiple sorted streams correctly"? Isn't that talking about
>> partitioning, where each workers *ownership* of a range matters?
>>
>
> I think so, but it seems from above text that is mainly required for
> merge iterator which probably will be used in merge join.
>
>> My
>> patch doesn't care which values belong to which workers. And, it
>> focuses quite a lot on dealing well with the memory bandwidth bound,
>> I/O bound part of the sort where we write out the index itself, just
>> by piggy-backing on tuplesort.c. I don't think that that's useful for
>> a general-purpose executor node -- tuple-at-a-time processing when
>> fetching from workers would kill performance.
>>
>
> Right, but what is written in text quoted by you seems to be do-able
> with tuple-at-a-time processing.
>

To be clear, by saying above, I don't mean that we should try that
approach instead of what you are proposing, but it is worth some
discussion to see if that has any significant merits.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-21 Thread Amit Kapila
On Tue, Oct 18, 2016 at 3:48 AM, Peter Geoghegan  wrote:
> On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila  wrote:
>
> I read the following paragraph from the Volcano paper just now:
>
> """
> During implementation and benchmarking of parallel sorting, we added
> two more features to exchange. First, we wanted to implement a merge
> network in which some processors produce sorted streams merge
> concurrently by other processors. Volcano’s sort iterator can be used
> to generate a sorted stream. A merge iterator was easily derived from
> the sort module. It uses a single level merge, instead of the cascaded
> merge of runs used in sort. The input of a merge iterator is an
> exchange. Differently from other operators, the merge iterator
> requires to distinguish the input records by their producer. As an
> example, for a join operation it does not matter where the input
> records were created, and all inputs can be accumulated in a single
> input stream. For a merge operation, it is crucial to distinguish the
> input records by their producer in order to merge multiple sorted
> streams correctly.
> """
>
> I don't really understand this paragraph, but thought I'd ask: why the
> need to "distinguish the input records by their producer in order to
> merge multiple sorted streams correctly"? Isn't that talking about
> partitioning, where each workers *ownership* of a range matters?
>

I think so, but it seems from above text that is mainly required for
merge iterator which probably will be used in merge join.

> My
> patch doesn't care which values belong to which workers. And, it
> focuses quite a lot on dealing well with the memory bandwidth bound,
> I/O bound part of the sort where we write out the index itself, just
> by piggy-backing on tuplesort.c. I don't think that that's useful for
> a general-purpose executor node -- tuple-at-a-time processing when
> fetching from workers would kill performance.
>

Right, but what is written in text quoted by you seems to be do-able
with tuple-at-a-time processing.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-21 Thread Amit Kapila
On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan  wrote:
> On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas  wrote:
>> Gather Merge can't emit a tuple unless it has buffered at least one
>> tuple from every producer; otherwise, the next tuple it receives from
>> one of those producers might proceed whichever tuple it chooses to
>> emit.

Right. Now, after again looking at Gather Merge patch, I think I can
better understand how it performs merging.

>>  However, it doesn't need to wait until all of the workers are
>> completely done.  The leader only needs to be at least slightly ahead
>> of the slowest worker.  I'm not sure how that compares to Peter's
>> approach.
>
> I don't think that eager merging will prove all that effective,
> however it's implemented. I see a very I/O bound system when parallel
> CREATE INDEX merges serially. There is no obvious reason why you'd
> have a straggler worker process with CREATE INDEX, really.
>
>> What I'm worried about is that we're implementing two separate systems
>> to do the same thing, and that the parallel sort approach is actually
>> a lot less general.  I think it's possible to imagine a Parallel Sort
>> implementation which does things Gather Merge can't.  If all of the
>> workers collaborate to sort all of the data rather than each worker
>> sorting its own data, then you've got something which Gather Merge
>> can't match.  But this is not that.
>
> It's not that yet, certainly. I think I've sketched a path forward for
> making partitioning a part of logtape.c that is promising. The sharing
> of ranges within tapes and so on will probably have a significant
> amount in common with what I've come up with.
>
> I don't think that any executor infrastructure is a particularly good
> model when *batch output* is needed -- the tuple queue mechanism will
> be a significant bottleneck, particularly because it does not
> integrate read-ahead, etc.
>

Tuple queue mechanism might not be super-efficient for *batch output*
(cases where many tuples needs to be read and written), but I see no
reason why it will be slower than disk I/O which I think you are using
in the patch.  IIUC, in the patch each worker including leader does
the tape sort for it's share of tuples and then finally leader merges
and populates the index.  I am not sure if the mechanism used in patch
can be useful as compare to using tuple queue, if the workers can
finish their part of sorting in-memory.

>
> The bottom line is that it's inherently difficult for me to refute the
> idea that Gather Merge could do just as well as what I have here,
> because proving that involves adding a significant amount of new
> infrastructure (e.g., to teach the executor about IndexTuples).
>

I think, there could be a simpler way, like we can force the gather
merge node when all the tuples needs to be sorted and compute the time
till it merges all tuples.  Similarly, with your patch, we can wait
till final merge is completed.  However, after doing initial study of
both the patches, I feel one can construct cases where Gather Merge
can win and also there will be cases where your patch can win.  In
particular, the Gather Merge can win where workers needs to perform
sort mostly in-memory.  I am not sure if it's easy to get best of both
the worlds.

Your patch needs rebase and I noticed one warning.
sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-19 Thread Peter Geoghegan
On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas  wrote:
> Gather Merge can't emit a tuple unless it has buffered at least one
> tuple from every producer; otherwise, the next tuple it receives from
> one of those producers might proceed whichever tuple it chooses to
> emit.  However, it doesn't need to wait until all of the workers are
> completely done.  The leader only needs to be at least slightly ahead
> of the slowest worker.  I'm not sure how that compares to Peter's
> approach.

I don't think that eager merging will prove all that effective,
however it's implemented. I see a very I/O bound system when parallel
CREATE INDEX merges serially. There is no obvious reason why you'd
have a straggler worker process with CREATE INDEX, really.

> What I'm worried about is that we're implementing two separate systems
> to do the same thing, and that the parallel sort approach is actually
> a lot less general.  I think it's possible to imagine a Parallel Sort
> implementation which does things Gather Merge can't.  If all of the
> workers collaborate to sort all of the data rather than each worker
> sorting its own data, then you've got something which Gather Merge
> can't match.  But this is not that.

It's not that yet, certainly. I think I've sketched a path forward for
making partitioning a part of logtape.c that is promising. The sharing
of ranges within tapes and so on will probably have a significant
amount in common with what I've come up with.

I don't think that any executor infrastructure is a particularly good
model when *batch output* is needed -- the tuple queue mechanism will
be a significant bottleneck, particularly because it does not
integrate read-ahead, etc. The best case that I saw advertised for
Gather Merge was TPC-H query 9 [1]. That doesn't look like a good
proxy for how Gather Merge adapted to parallel CREATE INDEX would do,
since it benefits from the GroupAggregate merge having many equal
values, possibly with a clustering in the original tables that can
naturally be exploited (no TID tiebreaker needed, since IndexTuples
are not being merged). Also, it looks like Gather Merge may do that
well by enabling things, rather than parallelizing the sort
effectively per se. Besides, the query 9 case is significantly less
scalable than good cases for this parallel CREATE INDEX patch have
already been shown to be.

I think I've been pretty modest about what this parallel CREATE INDEX
patch gets us from the beginning. It is a generalization of
tuplesort.c to work in parallel; we need a lot more for that to make
things like GroupAggregate as scalable as possible, and I don't
pretend that this helps much with that. There are actually more
changes to nbtsort.c to coordinate all of this than there are to
tuplesort.c in the latest version, so I think that this simpler
approach for parallel CREATE INDEX and CLUSTER is worthwhile.

The bottom line is that it's inherently difficult for me to refute the
idea that Gather Merge could do just as well as what I have here,
because proving that involves adding a significant amount of new
infrastructure (e.g., to teach the executor about IndexTuples). I
think that the argument for this basic approach is sound (it appears
to offer comparable scalability to the parallel CREATE INDEX
implementations of other systems), but it's simply impractical for me
to offer much reassurance beyond that.

[1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-19 Thread Robert Haas
On Mon, Oct 17, 2016 at 8:36 AM, Amit Kapila  wrote:
>> This project of mine is about parallelizing tuplesort.c, which isn't
>> really what you want for parallel query -- you shouldn't try to scope
>> the problem as "make the sort more scalable using parallelism" there.
>> Rather, you want to scope it at "make the execution of the entire
>> query more scalable using parallelism", which is really quite a
>> different thing, which necessarily involves the executor having direct
>> knowledge of partition boundaries.
>
> Okay, but what is the proof or why do you think second is going to
> better than first?  One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

Gather Merge can't emit a tuple unless it has buffered at least one
tuple from every producer; otherwise, the next tuple it receives from
one of those producers might proceed whichever tuple it chooses to
emit.  However, it doesn't need to wait until all of the workers are
completely done.  The leader only needs to be at least slightly ahead
of the slowest worker.  I'm not sure how that compares to Peter's
approach.

What I'm worried about is that we're implementing two separate systems
to do the same thing, and that the parallel sort approach is actually
a lot less general.  I think it's possible to imagine a Parallel Sort
implementation which does things Gather Merge can't.  If all of the
workers collaborate to sort all of the data rather than each worker
sorting its own data, then you've got something which Gather Merge
can't match.  But this is not that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-17 Thread Peter Geoghegan
On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila  wrote:
> Okay, but what is the proof or why do you think second is going to
> better than first?

I don't have proof. It's my opinion that it probably would be, based
on partial information, and my intuition. It's hard to prove something
like that, because it's not really clear what that alternative would
look like. Also, finding all of that out would take a long time --
it's hard to prototype. Do tuple table slots need to care about
IndexTuples now? What does that even look like? What existing executor
code needs to be taught about this new requirement?

> One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

How can it? You need to have at least one tuple from every worker
(before the worker has exhausted its supply of output tuples) in order
to merge to return the next tuple to the top level consumer (the thing
above the Gather Merge). If you're talking about "eager vs. lazy
merging", please see my previous remarks on that, on this thread. (In
any case, whether we merge more eagerly seems like an orthogonal
question to the one you ask.)

The first thing to note about my approach is that I openly acknowledge
that this parallel CREATE INDEX patch is not much use for parallel
query. I have only generalized tuplesort.c to support parallelizing a
sort operation. I think that parallel query needs partitioning to push
down parts of a sort to workers, with little or no need for them to be
funneled together at the end, since most tuples are eliminated before
being passed to the Gather/Gather Merge node. The partitioning part is
really hard.

I guess that Gather Merge nodes have value because they allow us to
preserve the sorted-ness of a parallel path, which might be most
useful because it enables things elsewhere. But, that doesn't really
recommend making Gather Merge nodes good at batch processing a large
number of tuples, I suspect. (One problem with the tuple queue
mechanism is that it can be a big bottleneck -- that's why we want to
eliminate most tuples before they're passed up to the leader, in the
case of parallel sequential scan in 9.6.)

I read the following paragraph from the Volcano paper just now:

"""
During implementation and benchmarking of parallel sorting, we added
two more features to exchange. First, we wanted to implement a merge
network in which some processors produce sorted streams merge
concurrently by other processors. Volcano’s sort iterator can be used
to generate a sorted stream. A merge iterator was easily derived from
the sort module. It uses a single level merge, instead of the cascaded
merge of runs used in sort. The input of a merge iterator is an
exchange. Differently from other operators, the merge iterator
requires to distinguish the input records by their producer. As an
example, for a join operation it does not matter where the input
records were created, and all inputs can be accumulated in a single
input stream. For a merge operation, it is crucial to distinguish the
input records by their producer in order to merge multiple sorted
streams correctly.
"""

I don't really understand this paragraph, but thought I'd ask: why the
need to "distinguish the input records by their producer in order to
merge multiple sorted streams correctly"? Isn't that talking about
partitioning, where each workers *ownership* of a range matters? My
patch doesn't care which values belong to which workers. And, it
focuses quite a lot on dealing well with the memory bandwidth bound,
I/O bound part of the sort where we write out the index itself, just
by piggy-backing on tuplesort.c. I don't think that that's useful for
a general-purpose executor node -- tuple-at-a-time processing when
fetching from workers would kill performance.

> By looking at 'workersFinished' usage, it looks like you have devised
> a new way for leader to know when workers have finished which might be
> required for this patch.  However, have you tried to use or
> investigate if existing infrastructure which serves same purpose could
> be used for it?

Yes, I have. I think that Robert's "condition variables" patch would
offer a general solution to what I've devised.

What I have there is, as you say, fairly ad-hoc, even though my
requirements are actually fairly general. I was actually annoyed that
there wasn't an easier way to do that myself. Robert has said that he
won't commit his "condition variables" work until it's clear that
there will be some use for the facility. Well, I'd 

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-17 Thread Amit Kapila
On Thu, Oct 13, 2016 at 12:35 AM, Peter Geoghegan  wrote:
> On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas  wrote:
>
>> On the flip side, what if anything can queries hope to get out of
>> parallel sort that they can't get out of Gather Merge?  One
>> possibility is that a parallel sort might end up being substantially
>> faster than Gather Merge-over-non-parallel sort.  In that case, we
>> obviously should prefer it.
>
> I must admit that I don't know enough about it to comment just yet.
> Offhand, it occurs to me that the Gather Merge sorted input could come
> from a number of different types of paths/nodes, whereas adopting what
> I've done here could only work more or less equivalently to "Gather
> Merge -> Sort -> Seq Scan" -- a special case, really.
>
>> For example, it's possible that you might want to have all
>> workers participate in sorting some data and then divide the result of
>> the sort into equal ranges that are again divided among the workers,
>> or that you might want all workers to sort and then each worker to
>> read a complete copy of the output data.  But these don't seem like
>> particularly mainstream needs, nor do they necessarily seem like
>> problems that parallel sort itself should be trying to solve.
>
> This project of mine is about parallelizing tuplesort.c, which isn't
> really what you want for parallel query -- you shouldn't try to scope
> the problem as "make the sort more scalable using parallelism" there.
> Rather, you want to scope it at "make the execution of the entire
> query more scalable using parallelism", which is really quite a
> different thing, which necessarily involves the executor having direct
> knowledge of partition boundaries.
>

Okay, but what is the proof or why do you think second is going to
better than first?  One thing which strikes as a major difference
between your approach and Gather Merge is that in your approach leader
has to wait till all the workers have done with their work on sorting
whereas with Gather Merge as soon as first one is done, leader starts
with merging.  I could be wrong here, but if I understood it
correctly, then there is a argument that Gather Merge kind of approach
can win in cases where some of the workers can produce sorted outputs
ahead of others and I am not sure if we can dismiss such cases.

+struct Sharedsort
+{
..
+ * Workers increment workersFinished to indicate having finished.  If
+ * this is equal to state.launched within the leader, leader is ready
+ * to merge runs.
+ *
+ * leaderDone indicates if leader is completely done (i.e., was
+ * tuplesort_end called against the state through which parallel output
+ * was consumed?)
+ */
+ int currentWorker;
+ int workersFinished;
..
}

By looking at 'workersFinished' usage, it looks like you have devised
a new way for leader to know when workers have finished which might be
required for this patch.  However, have you tried to use or
investigate if existing infrastructure which serves same purpose could
be used for it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-12 Thread Peter Geoghegan
On Wed, Oct 12, 2016 at 11:09 AM, Robert Haas  wrote:
> I realize that you are primarily targeting utility commands here, and
> that is obviously great, because making index builds faster is very
> desirable. However, I'd just like to talk for a minute about how this
> relates to parallel query.  With Rushabh's Gather Merge patch, you can
> now have a plan that looks like Gather Merge -> Sort -> whatever.
> That patch also allows other patterns that are useful completely
> independently of this patch, like Finalize GroupAggregate -> Gather
> Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather
> Merge -> Sort -> whatever path is very related to what this patch
> does.  For example, instead of committing this patch at all, we could
> try to funnel index creation through the executor, building a plan of
> that shape, and using the results to populate the index.  I'm not
> saying that's a good idea, but it could be done.

Right, but that would be essentially the same approach as mine, but, I
suspect, less efficient and more complicated. More importantly, it
wouldn't be partitioning, and partitioning is what we really need
within the executor.

> On the flip side, what if anything can queries hope to get out of
> parallel sort that they can't get out of Gather Merge?  One
> possibility is that a parallel sort might end up being substantially
> faster than Gather Merge-over-non-parallel sort.  In that case, we
> obviously should prefer it.

I must admit that I don't know enough about it to comment just yet.
Offhand, it occurs to me that the Gather Merge sorted input could come
from a number of different types of paths/nodes, whereas adopting what
I've done here could only work more or less equivalently to "Gather
Merge -> Sort -> Seq Scan" -- a special case, really.

> For example, it's possible that you might want to have all
> workers participate in sorting some data and then divide the result of
> the sort into equal ranges that are again divided among the workers,
> or that you might want all workers to sort and then each worker to
> read a complete copy of the output data.  But these don't seem like
> particularly mainstream needs, nor do they necessarily seem like
> problems that parallel sort itself should be trying to solve.

This project of mine is about parallelizing tuplesort.c, which isn't
really what you want for parallel query -- you shouldn't try to scope
the problem as "make the sort more scalable using parallelism" there.
Rather, you want to scope it at "make the execution of the entire
query more scalable using parallelism", which is really quite a
different thing, which necessarily involves the executor having direct
knowledge of partition boundaries. Maybe the executor enlists
tuplesort.c to help with those boundaries to some degree, but that
whole thing is basically something which treats tuplesort.c as a low
level primitive.

> The
> Volcano paper[1], one of the oldest and most-cited sources I can find
> for research into parallel execution and with a design fairly similar
> to our own executor, describes various variants of what they call
> Exchange, of which what we now call Gather is one.

I greatly respect the work of Goetz Graef, including his work on the
Volcano paper. Graef has been the single biggest external influence on
my work on Postgres.

> They describe
> another variant called Interchange, which acts like a Gather node
> without terminating parallelism: every worker process reads the
> complete output of an Interchange, which is the union of all rows
> produced by all workers running the Interchange's input plan.  That
> seems like a better design than coupling such data flows specifically
> to parallel sort.
>
> I'd like to think that parallel sort will help lots of queries, as
> well as helping utility commands, but I'm not sure it will.  Thoughts?

You are right that I'm targeting the cases where we can get real
benefits without really changing the tuplesort.h contract too much.
This is literally the parallel tuplesort.c patch, which probably isn't
very useful for parallel query, because the final output is always
consumed serially here (this doesn't matter all that much for CREATE
INDEX, I believe). This approach of mine seems like the simplest way
of getting a large benefit to users involving parallelizing sorting,
but I certainly don't imagine it to be the be all and end all.

I have at least tried to anticipate how tuplesort.c will eventually
serve the needs of partitioning for the benefit of parallel query. My
intuition is that you'll have to teach it about partitioning
boundaries fairly directly -- it won't do to add something generic to
the executor. And, it probably won't be the only thing that needs to
be taught about them.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-12 Thread Robert Haas
On Fri, Oct 7, 2016 at 5:47 PM, Peter Geoghegan  wrote:
> Work is still needed on:
>
> * Cost model. Should probably attempt to guess final index size, and
> derive calculation of number of workers from that. Also, I'm concerned
> that I haven't given enough thought to the low end, where with default
> settings most CREATE INDEX statements will use at least one parallel
> worker.
>
> * The whole way that I teach nbtsort.c to disallow catalog tables for
> parallel CREATE INDEX due to concerns about parallel safety is in need
> of expert review, preferably from Robert. It's complicated in a way
> that relies on things happening or not happening from a distance.
>
> * Heikki seems to want to change more about logtape.c, and its use of
> indirection blocks. That may evolve, but for now I can only target the
> master branch.
>
> * More extensive performance testing. I think that this V3 is probably
> the fastest version yet, what with Heikki's improvements, but I
> haven't really verified that.

I realize that you are primarily targeting utility commands here, and
that is obviously great, because making index builds faster is very
desirable. However, I'd just like to talk for a minute about how this
relates to parallel query.  With Rushabh's Gather Merge patch, you can
now have a plan that looks like Gather Merge -> Sort -> whatever.
That patch also allows other patterns that are useful completely
independently of this patch, like Finalize GroupAggregate -> Gather
Merge -> Partial GroupAggregate -> Sort -> whatever, but the Gather
Merge -> Sort -> whatever path is very related to what this patch
does.  For example, instead of committing this patch at all, we could
try to funnel index creation through the executor, building a plan of
that shape, and using the results to populate the index.  I'm not
saying that's a good idea, but it could be done.

On the flip side, what if anything can queries hope to get out of
parallel sort that they can't get out of Gather Merge?  One
possibility is that a parallel sort might end up being substantially
faster than Gather Merge-over-non-parallel sort.  In that case, we
obviously should prefer it.  Other possibilities seem a little
obscure.  For example, it's possible that you might want to have all
workers participate in sorting some data and then divide the result of
the sort into equal ranges that are again divided among the workers,
or that you might want all workers to sort and then each worker to
read a complete copy of the output data.  But these don't seem like
particularly mainstream needs, nor do they necessarily seem like
problems that parallel sort itself should be trying to solve.  The
Volcano paper[1], one of the oldest and most-cited sources I can find
for research into parallel execution and with a design fairly similar
to our own executor, describes various variants of what they call
Exchange, of which what we now call Gather is one.  They describe
another variant called Interchange, which acts like a Gather node
without terminating parallelism: every worker process reads the
complete output of an Interchange, which is the union of all rows
produced by all workers running the Interchange's input plan.  That
seems like a better design than coupling such data flows specifically
to parallel sort.

I'd like to think that parallel sort will help lots of queries, as
well as helping utility commands, but I'm not sure it will.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] "Volcano - an Extensible and Parallel Query Evaluation System",
https://pdfs.semanticscholar.org/865b/5f228f08ebac0b68d3a4bfd97929ee85e4b6.pdf
[2] See "C. Variants of the Exchange Operator" on p. 13 of [1]


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-10-07 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 11:05 AM, Peter Geoghegan  wrote:
> So, while there are still a few loose ends with this revision (it
> should still certainly be considered WIP), I wanted to get a revision
> out quickly because V1 has been left to bitrot for too long now, and
> my schedule is very full for the next week, ahead of my leaving to go
> on vacation (which is long overdue). Hopefully, I'll be able to get
> out a third revision next Saturday, on top of the
> by-then-presumably-committed new tape batch memory patch from Heikki,
> just before I leave. I'd rather leave with a patch available that can
> be cleanly applied, to make review as easy as possible, since it
> wouldn't be great to have this V2 with bitrot for 10 days or more.

Heikki committed his preload memory patch a little later than
originally expected, 4 days ago. I attach V3 of my own parallel CREATE
INDEX patch, which should be applied on top of a today's git master
(there is a bugfix that reviewers won't want to miss -- commit
b56fb691). I have my own test suite, and have to some extent used TDD
for this patch, so rebasing was not so bad . My tests are rather rough
and ready, so I'm not going to post them here. (Changes in the
WaitLatch() API also caused bitrot, which is now fixed.)

Changes from V2:

* Since Heikki eliminated the need for any extra memtuple "slots"
(memtuples is now only exactly big enough for the initial merge heap),
an awful lot of code could be thrown out that managed sizing memtuples
in the context of the parallel leader (based on trends seen in
parallel workers). I was able to follow Heikki's example by
eliminating code for parallel sorting memtuples sizing. Throwing this
code out let me streamline a lot of stuff within tuplesort.c, which is
cleaned up quite a bit.

* Since this revision was mostly focused on fixing up logtape.c
(rebasing on top of Heikki's work), I also took the time to clarify
some things about how an block-based offset might need to be applied
within the leader. Essentially, outlining how and where that happens,
and where it doesn't and shouldn't happen. (An offset must sometimes
be applied to compensate for difference in logical BufFile positioning
(leader/worker differences) following leader's unification of worker
tapesets into one big tapset of its own.)

* max_parallel_workers_maintenance now supersedes the use of the new
parallel_workers index storage parameter. This matches existing heap
storage parameter behavior, and allows the implementation to add
practically no cycles as compared to master branch when the use of
parallelism is disabled by setting max_parallel_workers_maintenance to
0.

* New additions to the chapter in the documentation that Robert added
a little while back, "Chapter 15. Parallel Query". It's perhaps a bit
of a stretch to call this feature part of parallel query, but I think
that it works reasonably well. The optimizer does determine the number
of workers needed here, so while it doesn't formally produce a query
plan, I think the implication that it does is acceptable for
user-facing documentation. (Actually, it would be nice if you really
could EXPLAIN utility commands -- that would be a handy place to show
information about how they were executed.)

Maybe this new documentation describes things in what some would
consider to be excessive detail for users. The relatively detailed
information added on parallel sorting seemed to be in the pragmatic
spirit of the new chapter 15, so I thought I'd see what people
thought.

Work is still needed on:

* Cost model. Should probably attempt to guess final index size, and
derive calculation of number of workers from that. Also, I'm concerned
that I haven't given enough thought to the low end, where with default
settings most CREATE INDEX statements will use at least one parallel
worker.

* The whole way that I teach nbtsort.c to disallow catalog tables for
parallel CREATE INDEX due to concerns about parallel safety is in need
of expert review, preferably from Robert. It's complicated in a way
that relies on things happening or not happening from a distance.

* Heikki seems to want to change more about logtape.c, and its use of
indirection blocks. That may evolve, but for now I can only target the
master branch.

* More extensive performance testing. I think that this V3 is probably
the fastest version yet, what with Heikki's improvements, but I
haven't really verified that.

-- 
Peter Geoghegan


0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz
Description: GNU Zip compressed data


0002-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data


0003-Add-force_btree_randomaccess-GUC-for-testing.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-27 Thread Robert Haas
On Mon, Sep 26, 2016 at 3:40 PM, Peter Geoghegan  wrote:
> On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas  wrote:
>>> That requires some kind of mutual exclusion mechanism, like an LWLock.
>>
>> No, it doesn't.  Shared memory queues are single-reader, single-writer.
>
> The point is that there is a natural dependency when merging is
> performed eagerly within the leader. One thing needs to be in lockstep
> with the others. That's all.

I don't know what any of that means.  You said we need something like
an LWLock, but I think we don't.  The workers just write the results
of their own final merges into shm_mqs.  The leader can read from any
given shm_mq until no more tuples can be read without blocking, just
like nodeGather.c does, or at least it can do that unless its own
queue fills up first.  No mutual exclusion mechanism is required for
any of that, as far as I can see - not an LWLock, and not anything
similar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-26 Thread Peter Geoghegan
On Mon, Sep 26, 2016 at 6:58 PM, Robert Haas  wrote:
>> That requires some kind of mutual exclusion mechanism, like an LWLock.
>
> No, it doesn't.  Shared memory queues are single-reader, single-writer.

The point is that there is a natural dependency when merging is
performed eagerly within the leader. One thing needs to be in lockstep
with the others. That's all.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-26 Thread Robert Haas
On Sat, Sep 24, 2016 at 9:07 AM, Peter Geoghegan  wrote:
> On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas  wrote:
>> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas  wrote:
>>> It'd be good if you could overlap the final merges in the workers with the
>>> merge in the leader. ISTM it would be quite straightforward to replace the
>>> final tape of each worker with a shared memory queue, so that the leader
>>> could start merging and returning tuples as soon as it gets the first tuple
>>> from each worker. Instead of having to wait for all the workers to complete
>>> first.
>>
>> If you do that, make sure to have the leader read multiple tuples at a
>> time from each worker whenever possible.  It makes a huge difference
>> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.
>
> That requires some kind of mutual exclusion mechanism, like an LWLock.

No, it doesn't.  Shared memory queues are single-reader, single-writer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-24 Thread Peter Geoghegan
On Thu, Sep 22, 2016 at 8:57 PM, Robert Haas  wrote:
> On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas  wrote:
>> It'd be good if you could overlap the final merges in the workers with the
>> merge in the leader. ISTM it would be quite straightforward to replace the
>> final tape of each worker with a shared memory queue, so that the leader
>> could start merging and returning tuples as soon as it gets the first tuple
>> from each worker. Instead of having to wait for all the workers to complete
>> first.
>
> If you do that, make sure to have the leader read multiple tuples at a
> time from each worker whenever possible.  It makes a huge difference
> to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

That requires some kind of mutual exclusion mechanism, like an LWLock.
It's possible that merging everything lazily is actually the faster
approach, given this, and given the likely bottleneck on I/O at htis
stage. It's also certainly simpler to not overlap things. This is
something I've read about before [1], with "eager evaluation" sorting
not necessarily coming out ahead IIRC.

[1] 
http://digitalcommons.ohsu.edu/cgi/viewcontent.cgi?article=1193=csetech
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-24 Thread Peter Geoghegan
On Wed, Sep 21, 2016 at 5:52 PM, Heikki Linnakangas  wrote:
> I find this unification business really complicated.

I can certainly understand why you would. As I said, it's the most
complicated part of the patch, which overall is one of the most
ambitious patches I've ever written.

> I think it'd be simpler
> to keep the BufFiles and LogicalTapeSets separate, and instead teach
> tuplesort.c how to merge tapes that live on different
> LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
> LogicalTapeSet can contain tapes from different underlying BufFiles.
>
> What I have in mind is something like the attached patch. It refactors
> LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape
> as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet
> doesn't have the concept of a tape number anymore, it can contain any number
> of tapes, and you can create more on the fly. With that, it'd be fairly easy
> to make tuplesort.c merge LogicalTapes that came from different tape sets,
> backed by different BufFiles. I think that'd avoid much of the unification
> code.

I think that it won't be possible to make a LogicalTapeSet ever use
more than one BufFile without regressing the ability to eagerly reuse
space, which is almost the entire reason for logtape.c existing. The
whole indirect block thing is an idea borrowed from the FS world, of
course, and so logtape.c needs one block-device-like BufFile, with
blocks that can be reclaimed eagerly, but consumed for recycling in
*contiguous* order (which is why they're sorted using qsort() within
ltsGetFreeBlock()). You're going to increase the amount of random I/O
by using more than one BufFile for an entire tapeset, I think.

This patch you posted
("0001-Refactor-LogicalTapeSet-LogicalTape-interface.patch") just
keeps one BufFile, and only changes the interface to expose the tapes
themselves to tuplesort.c, without actually making tuplesort.c do
anything with that capability. I see what you're getting at, I think,
but I don't see how that accomplishes all that much for parallel
CREATE INDEX. I mean, the special case of having multiple tapesets
from workers (not one "unified" tapeset created from worker temp files
from their tapesets to begin with) now needs special treatment.
Haven't you just moved the complexity around (once your patch is made
to care about parallelism)? Having multiple entire tapesets explicitly
from workers, with their own BufFiles, is not clearly less complicated
than managing ranges from BufFile fd.c files with delineated ranges of
"logical tapeset space". Seems almost equivalent, except that my way
doesn't bother tuplesort.c with any of this.

>> +* As a consequence of only being permitted to write to the leader
>> +* controlled range, parallel sorts that require a final
>> materialized tape
>> +* will use approximately twice the disk space for temp files
>> compared to
>> +* a more or less equivalent serial sort.

> I'm slightly worried about that. Maybe it's OK for a first version, but it'd
> be annoying in a query where a sort is below a merge join, for example, so
> that you can't do the final merge on the fly because mark/restore support is
> needed.

My intuition is that we'll *never* end up using this for merge joins.
I think that I could do better here (why should workers really care at
this point?), but just haven't bothered to.

This parallel sort implementation is something written with CREATE
INDEX and CLUSTER in mind only (maybe one or two other things, too). I
believe that for query execution, partitioning is the future [1]. With
merge joins, partitioning is desirable because it lets you push down
*everything* to workers, not just sorting (e.g., by aligning
partitioning boundaries on each side of each merge join sort in the
worker, and having the worker also "synchronize" each side of the
join, all independently and without a dependency on a final merge).

That's why I think it's okay that I use twice as much space for
randomAccess tuplesort.c callers. No real world caller will ever end
up needing to do this. It just seems like a good idea to support
randomAccess when using this new infrastructure, on general principle.
Forcing myself to support that case during initial development
actually resulted in much cleaner, less invasive changes to
tuplesort.c in general.

[1] 
https://www.postgresql.org/message-id/flat/CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com#CAM3SWZR+ATYAzyMT+hm-Bo=1l1smtjbndtibwbtktyqs0dy...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-22 Thread Robert Haas
On Thu, Sep 22, 2016 at 3:51 PM, Heikki Linnakangas  wrote:
> It'd be good if you could overlap the final merges in the workers with the
> merge in the leader. ISTM it would be quite straightforward to replace the
> final tape of each worker with a shared memory queue, so that the leader
> could start merging and returning tuples as soon as it gets the first tuple
> from each worker. Instead of having to wait for all the workers to complete
> first.

If you do that, make sure to have the leader read multiple tuples at a
time from each worker whenever possible.  It makes a huge difference
to performance.  See bc7fcab5e36b9597857fa7e3fa6d9ba54aaea167.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-22 Thread Heikki Linnakangas

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:

No merging in parallel
--

Currently, merging worker *output* runs may only occur in the leader
process. In other words, we always keep n worker processes busy with
scanning-and-sorting (and maybe some merging), but then all processes
but the leader process grind to a halt (note that the leader process
can participate as a scan-and-sort tuplesort worker, just as it will
everywhere else, which is why I specified "parallel_workers = 7" but
talked about 8 workers).

One leader process is kept busy with merging these n output runs on
the fly, so things will bottleneck on that, which you saw in the
example above. As already described, workers will sometimes merge in
parallel, but only their own runs -- never another worker's runs. I
did attempt to address the leader merge bottleneck by implementing
cross-worker run merging in workers. I got as far as implementing a
very rough version of this, but initial results were disappointing,
and so that was not pursued further than the experimentation stage.

Parallel merging is a possible future improvement that could be added
to what I've come up with, but I don't think that it will move the
needle in a really noticeable way.


It'd be good if you could overlap the final merges in the workers with 
the merge in the leader. ISTM it would be quite straightforward to 
replace the final tape of each worker with a shared memory queue, so 
that the leader could start merging and returning tuples as soon as it 
gets the first tuple from each worker. Instead of having to wait for all 
the workers to complete first.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-21 Thread Heikki Linnakangas

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:

Tape unification


Sort operations have a unique identifier, generated before any workers
are launched, using a scheme based on the leader's PID, and a unique
temp file number. This makes all on-disk state (temp files managed by
logtape.c) discoverable by the leader process. State in shared memory
is sized in proportion to the number of workers, so the only thing
about the data being sorted that gets passed around in shared memory
is a little logtape.c metadata for tapes, describing for example how
large each constituent BufFile is (a BufFile associated with one
particular worker's tapeset).

(See below also for notes on buffile.c's role in all of this, fd.c and
resource management, etc.)


> ...


buffile.c, and "unification"


There has been significant new infrastructure added to make logtape.c
aware of workers. buffile.c has in turn been taught about unification
as a first class part of the abstraction, with low-level management of
certain details occurring within fd.c. So, "tape unification" within
processes to open other backend's logical tapes to generate a unified
logical tapeset for the leader to merge is added. This is probably the
single biggest source of complexity for the patch, since I must
consider:

* Creating a general, reusable abstraction for other possible BufFile
users (logtape.c only has to serve tuplesort.c, though).

* Logical tape free space management.

* Resource management, file lifetime, etc. fd.c resource management
can now close a file at xact end for temp files, while not deleting it
in the leader backend (only the "owning" worker backend deletes the
temp file it owns).

* Crash safety (e.g., when to truncate existing temp files, and when not to).


I find this unification business really complicated. I think it'd be 
simpler to keep the BufFiles and LogicalTapeSets separate, and instead 
teach tuplesort.c how to merge tapes that live on different 
LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single 
LogicalTapeSet can contain tapes from different underlying BufFiles.


What I have in mind is something like the attached patch. It refactors 
LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a 
LogicalTape as argument, instead of LogicalTapeSet and tape number. 
LogicalTapeSet doesn't have the concept of a tape number anymore, it can 
contain any number of tapes, and you can create more on the fly. With 
that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that 
came from different tape sets, backed by different BufFiles. I think 
that'd avoid much of the unification code.


That leaves one problem, though: reusing space in the final merge phase. 
If the tapes being merged belong to different LogicalTapeSets, and 
create one new tape to hold the result, the new tape cannot easily reuse 
the space of the input tapes because they are on different tape sets. 
But looking at your patch, ISTM you actually dodged that problem as well:



+* As a consequence of only being permitted to write to the leader
+* controlled range, parallel sorts that require a final materialized 
tape
+* will use approximately twice the disk space for temp files compared 
to
+* a more or less equivalent serial sort.  This is deemed acceptable,
+* since it is far rarer in practice for parallel sort operations to
+* require a final materialized output tape.  Note that this does not
+* apply to any merge process required by workers, which may reuse space
+* eagerly, just like conventional serial external sorts, and so
+* typically, parallel sorts consume approximately the same amount of 
disk
+* blocks as a more or less equivalent serial sort, even when workers 
must
+* perform some merging to produce input to the leader.


I'm slightly worried about that. Maybe it's OK for a first version, but 
it'd be annoying in a query where a sort is below a merge join, for 
example, so that you can't do the final merge on the fly because 
mark/restore support is needed.


One way to fix that would be have all the parallel works share the work 
files to begin with, and keep the "nFileBlocks" value in shared memory 
so that the workers won't overlap each other. Then all the blocks from 
different workers would be mixed together, though, which would hurt the 
sequential pattern of the tapes, so each workers would need to allocate 
larger chunks to avoid that.


- Heikki

>From a1aa45c22cd13a2059880154e30f48d884a849ef Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Sep 2016 19:31:33 +0300
Subject: [PATCH 1/1] Refactor LogicalTapeSet/LogicalTape interface.

A LogicalTape is now visible to callers, not just an internal object
within logtape.c. All the tape functions, like LogicalTapeRead and
LogicalTapeWrite, take a LogicalTape as argument, instead of

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-13 Thread Robert Haas
On Sun, Sep 11, 2016 at 2:05 PM, Peter Geoghegan  wrote:
> On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas  wrote:
>> Pushed this "displace root" patch, with some changes:
>
> Attached is rebased version of the entire patch series, which should
> be applied on top of what you pushed to the master branch today.

0003 looks like a sensible cleanup of our #include structure
regardless of anything this patch series is trying to accomplish, so
I've committed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-11 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas  wrote:
> Pushed this "displace root" patch, with some changes:

Attached is rebased version of the entire patch series, which should
be applied on top of what you pushed to the master branch today.

This features a new scheme for managing workMem --
maintenance_work_mem is now treated as a high watermark/budget for the
entire CREATE INDEX operation, regardless of the number of workers.
This seems to work much better, so Robert was right to suggest it.

There were also improvements to the cost model, to weigh available
maintenance_work_mem under this new system. And, the cost model was
moved inside planner.c (next to plan_cluster_use_sort()), which is
really where it belongs. The cost model is still WIP, though, and I
didn't address some concerns of my own about how tuplesort.c
coordinates workers. I think that Robert's "condition variables" will
end up superseding that stuff anyway. And, I think that this v2 will
bitrot fairly soon, when Heikki commits what is in effect his version
of my 0002-* patch (that's unchanged, if only because it refactors
some things that the parallel CREATE INDEX patch is reliant on).

So, while there are still a few loose ends with this revision (it
should still certainly be considered WIP), I wanted to get a revision
out quickly because V1 has been left to bitrot for too long now, and
my schedule is very full for the next week, ahead of my leaving to go
on vacation (which is long overdue). Hopefully, I'll be able to get
out a third revision next Saturday, on top of the
by-then-presumably-committed new tape batch memory patch from Heikki,
just before I leave. I'd rather leave with a patch available that can
be cleanly applied, to make review as easy as possible, since it
wouldn't be great to have this V2 with bitrot for 10 days or more.

-- 
Peter Geoghegan


0003-Rearrange-header-file-include-directives.patch.gz
Description: GNU Zip compressed data


0005-Add-force_btree_randomaccess-GUC-for-testing.patch.gz
Description: GNU Zip compressed data


0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz
Description: GNU Zip compressed data


0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch.gz
Description: GNU Zip compressed data


0004-Add-parallel-B-tree-index-build-sorting.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-11 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 6:28 AM, Heikki Linnakangas  wrote:
> * I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I realize
> that this is controversial, per the discussion on the "Is
> tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a new
> function, "tuplesort_heap_replace_top()", which is exactly the same
> algorithm as the "delete_top()" algorithm, calling one of them "siftup"
> became just too confusing.

I feel pretty strongly that this was the correct decision. I would
have gone further, and removed any mention of "Sift up", but you can't
win them all.

> * Instead of "root_displace", I used the name "replace_top", and
> "delete_top" for the old siftup function. Because we use "top" to refer to
> memtuples[0] more commonly than "root", in the existing comments.

Fine by me.

> * I shared the code between the delete-top and replace-top. Delete-top now
> calls the replace-top function, with the last element of the heap. Both
> functions have the same signature, i.e. they both take the checkIndex
> argument. Peter's patch left that out for the "replace" function, on
> performance grounds, but if that's worthwhile, that seems like a separate
> optimization. Might be worth benchmarking that separately, but I didn't want
> to conflate that with this patch.

Okay.

> * I replaced a few more siftup+insert calls with the new combined
> replace-top operation. Because why not.

I suppose that the consistency has value, from a code clarity standpoint.

> Thanks for the patch, Peter, and thanks for the review, Claudio!

Thanks Heikki!

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-11 Thread Heikki Linnakangas

On 09/10/2016 03:22 AM, Claudio Freire wrote:

Overall, however, I believe the patch is in good shape. Only minor
form issues need to be changed, the functionality seems both desirable
and ready.


Pushed this "displace root" patch, with some changes:

* I renamed "tuplesort_heap_siftup()" to "tuplesort_delete_top()". I 
realize that this is controversial, per the discussion on the "Is 
tuplesort_heap_siftup() a misnomer?" thread. However, now that we have a 
new function, "tuplesort_heap_replace_top()", which is exactly the same 
algorithm as the "delete_top()" algorithm, calling one of them "siftup" 
became just too confusing. If anything, the new "replace_top" 
corresponds more closely to Knuth's siftup algorithm; delete-top is a 
special case of it. I added a comment on that to replace_top. I hope 
everyone can live with this.


* Instead of "root_displace", I used the name "replace_top", and 
"delete_top" for the old siftup function. Because we use "top" to refer 
to memtuples[0] more commonly than "root", in the existing comments.


* I shared the code between the delete-top and replace-top. Delete-top 
now calls the replace-top function, with the last element of the heap. 
Both functions have the same signature, i.e. they both take the 
checkIndex argument. Peter's patch left that out for the "replace" 
function, on performance grounds, but if that's worthwhile, that seems 
like a separate optimization. Might be worth benchmarking that 
separately, but I didn't want to conflate that with this patch.


* I replaced a few more siftup+insert calls with the new combined 
replace-top operation. Because why not.


Thanks for the patch, Peter, and thanks for the review, Claudio!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-10 Thread Peter Geoghegan
On Fri, Sep 9, 2016 at 5:22 PM, Claudio Freire  wrote:
> Since it is true that doing so would make it impossible to keep the
> asserts about tupindex in tuplesort_heap_root_displace, I guess it
> depends on how useful those asserts are (ie: how likely it is that
> those conditions could be violated, and how damaging it could be if
> they were). If it is decided the refactor is desirable, I'd suggest
> making the common siftup producedure static inline, to allow
> tuplesort_heap_root_displace to inline and specialize it, since it
> will be called with checkIndex=False and that simplifies the resulting
> code considerably.

Right. I want to keep it as a separate function for all these reasons.
I also think that I'll end up further optimizing what I've called
tuplesort_heap_root_displace in the future, to adopt to clustered
input. I'm thinking of something like Timsort's "galloping mode". What
I've come up with here still needs 2 comparisons and a swap per call
for presorted input. There is still a missed opportunity for clustered
or (inverse) correlated input -- we can make merging opportunistically
skip ahead to determine that the root tape's 100th tuple (say) would
still fit in the root position of the merge minheap. So, immediately
return 100 tuples from the root's tape without bothering to compare
them to anything. Do a binary search to find the best candidate
minheap root before the 100th tuple if a guess of 100 doesn't work
out. Adapt to trends. Stuff like that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-09 Thread Claudio Freire
...
On Fri, Sep 9, 2016 at 9:22 PM, Claudio Freire  wrote:
> Since it is true that doing so would make it impossible to keep the
> asserts about tupindex in tuplesort_heap_root_displace, I guess it
> depends on how useful those asserts are (ie: how likely it is that
> those conditions could be violated, and how damaging it could be if
> they were). If it is decided the refactor is desirable, I'd suggest
> making the common siftup producedure static inline, to allow
> tuplesort_heap_root_displace to inline and specialize it, since it
> will be called with checkIndex=False and that simplifies the resulting
> code considerably.
>
> Peter also mentioned that there were some other changes going on in
> the surrounding code that could impact this patch, so I'm marking the
> patch Waiting on Author.
>
> Overall, however, I believe the patch is in good shape. Only minor
> form issues need to be changed, the functionality seems both desirable
> and ready.


Sorry, forgot to specify, that was all about patch 3, the one about
tuplesort_heap_root_displace.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-09 Thread Peter Geoghegan
On Wed, Sep 7, 2016 at 2:36 PM, Heikki Linnakangas  wrote:
> 3. If we do that, we'll still have to reserve the tape buffers for all the
> tapes that we use during merge. So after we've built the initial runs, we'll
> need to reserve memory for those buffers. That might require shrinking
> memtuples. But that's OK: after building the initial runs, memtuples is
> empty, so we can shrink it.

Do you really think all this is worth the effort? Given how things are
going to improve for merging anyway, I tend to doubt it. I'd rather
just apply the cap (not necessarily 501 tapes, but something), and be
done with it. As you know, Knuth never advocated more than 7 tapes at
once, which I don't think had anything to do with the economics of
tape drives in the 1970s (or problems with tape operators getting
repetitive strange injuries). There is a chart in volume 3 about this.
Senior hackers talked about a cap like this from day one, back in
2006, when Simon and Tom initially worked on scaling the number of
tapes. Alternatively, we could make MERGE_BUFFER_SIZE much larger,
which I think would be a good idea independent of whatever waste
logically allocation of never-used tapes presents us with. It's
currently 1/4 of 1MiB, which is hardly anything these days, and
doesn't seem to have much to do with OS read ahead trigger sizes.

If we were going to do something like you describe here, I'd prefer it
to be driven by an observable benefit in performance, rather than a
theoretical benefit. Not doing everything in one pass isn't
necessarily worse than having a less cache efficient heap -- it might
be quite a bit better, in fact. You've seen how hard it can be to get
a sort that is I/O bound. (Sorting will tend to not be completely I/O
bound, unless perhaps parallelism is used).

Anyway, this patch (patch 0001-*) is by far the least important of the
3 that you and Claudio are signed up to review. I don't think it's
worth bending over backwards to do better. If you're not comfortable
with a simple cap like this, than I'd suggest that we leave it at
that, since our time is better spent elsewhere. We can just shelve it
for now -- "returned with feedback". I wouldn't make any noise about
it (although, I actually don't think that the cap idea is at all
controversial).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >