Re: [HACKERS] Tuplesort merge pre-reading

2017-04-18 Thread Heikki Linnakangas

On 04/14/2017 08:19 AM, Peter Geoghegan wrote:

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.


The pre-read buffers are only needed for input tapes; the total number 
of tapes doesn't matter.


For comparison, imagine that you want to perform a merge, such that you 
always merge 7 runs into one. With polyphase merge, you would need 8 
tapes, so that you always read from 7 of them, and write onto one. With 
balanced merge, you would need 14 tapes: you always read from 7 tapes, 
and you would need up to 7 output tapes, of which one would be active at 
any given time.


Those extra idle output tapes are practically free in our 
implementation. The "pre-read buffers" are only needed for input tapes, 
the number of output tapes doesn't matter. Likewise, maintaining the 
heap is cheaper if you only merge a small number of tapes at a time, but 
that's also dependent on the number of *input* tapes, not the total 
number of tapes.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2017-04-14 Thread Peter Geoghegan
On Fri, Apr 14, 2017 at 5:57 AM, Robert Haas  wrote:
> I don't think there's any one fixed answer, because increasing the
> number of tapes reduces I/O by adding CPU cost, and visca versa.

Sort of, but if you have to merge hundreds of runs (a situation that
should be quite rare), then you should be concerned about being CPU
bound first, as Knuth was. Besides, on modern hardware, read-ahead can
be more effective if you have more merge passes, to a point, which
might also make it worth it -- using hundreds of tapes results in
plenty of *random* I/O. Plus, most of the time you only do a second
pass over a subset of initial quicksorted runs -- not all of them.

Probably the main complicating factor that Knuth doesn't care about is
time to return the first tuple -- startup cost. That was a big
advantage for commit df700e6 that I should have mentioned.

I'm not seriously suggesting that we should prefer multiple passes in
the vast majority of real world cases, nor am I suggesting that we
should go out of our way to help cases that need to do that. I just
find all this interesting.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2017-04-14 Thread Robert Haas
On Fri, Apr 14, 2017 at 1:19 AM, Peter Geoghegan  wrote:
> Interestingly enough, I think that Knuth was pretty much spot on with
> his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
> df700e6 (where the sweet spot of merge order 7 was no longer always
> used) was effective because it masked certain overheads that we
> experience when doing multiple passes, overheads that Heikki and I
> mostly removed. This was confirmed by Robert's testing of my merge
> order cap work for commit fc19c18, where he found that using 7 tapes
> was only slightly worse than using many hundreds of tapes. If we could
> somehow be completely effective in making access to logical tapes
> perfectly sequential, then 7 tapes would probably be noticeably
> *faster*, due to CPU caching effects.

I don't think there's any one fixed answer, because increasing the
number of tapes reduces I/O by adding CPU cost, and visca versa.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2017-04-13 Thread Peter Geoghegan
On Thu, Apr 13, 2017 at 10:19 PM, Peter Geoghegan  wrote:
> I actually think Heikki's work here would particularly help on
> spinning rust, especially when less memory is available. He
> specifically justified it on the basis of it resulting in a more
> sequential read pattern, particularly when multiple passes are
> required.

BTW, what you might have missed is that Heikki did end up using a
significant amount of memory in the committed version. It just ended
up being managed by logtape.c, which now does the prereading instead
of tuplesort.c, but at a lower level. There is only one tuple in the
merge heap, but there is still up to 1GB of memory per tape,
containing raw preread tuples mixed with integers that demarcate tape
contents.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2017-04-13 Thread Peter Geoghegan
On Thu, Apr 13, 2017 at 9:51 PM, Tom Lane  wrote:
> I'm fairly sure that the point was exactly what it said, ie improve
> locality of access within the temp file by sequentially reading as many
> tuples in a row as we could, rather than grabbing one here and one there.
>
> It may be that the work you and Peter G. have been doing have rendered
> that question moot.  But I'm a bit worried that the reason you're not
> seeing any effect is that you're only testing situations with zero seek
> penalty (ie your laptop's disk is an SSD).  Back then I would certainly
> have been testing with temp files on spinning rust, and I fear that this
> may still be an issue in that sort of environment.

I actually think Heikki's work here would particularly help on
spinning rust, especially when less memory is available. He
specifically justified it on the basis of it resulting in a more
sequential read pattern, particularly when multiple passes are
required.

> The larger picture to be drawn from that thread is that we were seeing
> very different performance characteristics on different platforms.
> The specific issue that Tatsuo-san reported seemed like it might be
> down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
> but the point that this stuff can be environment-dependent is still
> something to take to heart.

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.

Interestingly enough, I think that Knuth was pretty much spot on with
his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
df700e6 (where the sweet spot of merge order 7 was no longer always
used) was effective because it masked certain overheads that we
experience when doing multiple passes, overheads that Heikki and I
mostly removed. This was confirmed by Robert's testing of my merge
order cap work for commit fc19c18, where he found that using 7 tapes
was only slightly worse than using many hundreds of tapes. If we could
somehow be completely effective in making access to logical tapes
perfectly sequential, then 7 tapes would probably be noticeably
*faster*, due to CPU caching effects.

Knuth was completely correct to say that it basically made no
difference once more than 7 tapes are used to merge, because he didn't
have logtape.c fragmentation to worry about.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2017-04-13 Thread Tom Lane
I wrote:
> Heikki Linnakangas  writes:
>> I'm talking about the code that reads a bunch of from each tape, loading 
>> them into the memtuples array. That code was added by Tom Lane, back in 
>> 1999:
>> So apparently there was a benefit back then, but is it still worthwhile? 

> I'm fairly sure that the point was exactly what it said, ie improve
> locality of access within the temp file by sequentially reading as many
> tuples in a row as we could, rather than grabbing one here and one there.

[ blink... ]  Somehow, my mail reader popped up a message from 2016
as current, and I spent some time researching and answering it without
noticing the message date.

Never mind, nothing to see here ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2017-04-13 Thread Tom Lane
Heikki Linnakangas  writes:
> I'm talking about the code that reads a bunch of from each tape, loading 
> them into the memtuples array. That code was added by Tom Lane, back in 
> 1999:

> commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
> Author: Tom Lane 
> Date:   Sat Oct 30 17:27:15 1999 +

>  Further performance improvements in sorting: reduce number of comparisons
>  during initial run formation by keeping both current run and next-run
>  tuples in the same heap (yup, Knuth is smarter than I am).  And, during
>  merge passes, make use of available sort memory to load multiple tuples
>  from any one input 'tape' at a time, thereby improving locality of
>  access to the temp file.

> So apparently there was a benefit back then, but is it still worthwhile? 

I'm fairly sure that the point was exactly what it said, ie improve
locality of access within the temp file by sequentially reading as many
tuples in a row as we could, rather than grabbing one here and one there.

It may be that the work you and Peter G. have been doing have rendered
that question moot.  But I'm a bit worried that the reason you're not
seeing any effect is that you're only testing situations with zero seek
penalty (ie your laptop's disk is an SSD).  Back then I would certainly
have been testing with temp files on spinning rust, and I fear that this
may still be an issue in that sort of environment.

The relevant mailing list thread seems to be "sort on huge table" in
pgsql-hackers in October/November 1999.  The archives don't seem to have
threaded that too successfully, but here's a message specifically
describing the commit you mention:

https://www.postgresql.org/message-id/2726.941493808%40sss.pgh.pa.us

and you can find the rest by looking through the archive summary pages
for that interval.

The larger picture to be drawn from that thread is that we were seeing
very different performance characteristics on different platforms.
The specific issue that Tatsuo-san reported seemed like it might be
down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
but the point that this stuff can be environment-dependent is still
something to take to heart.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-10-03 Thread Peter Geoghegan
On Mon, Oct 3, 2016 at 3:39 AM, Heikki Linnakangas  wrote:
>> Can't you just use state->tapeRange, and remove the "continue"? I
>> recommend referring to "now-exhausted input tapes" here, too.
>
>
> Don't think so. result_tape == tapeRange only when the merge was done in a
> single pass (or you're otherwise lucky).

Ah, yes. Logical tape assignment/physical tape number confusion on my part here.

>> * I'm not completely prepared to give up on using
>> MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
>> that it couldn't possibly matter that we impose a MaxAllocSize cap
>> within logtape.c (per tape), but I have slight reservations that I
>> need to address. Maybe a better way of putting it would be that I have
>> some reservations about possible regressions at the very high end,
>> with very large workMem. Any thoughts on that?
>
>
> Meh, I can't imagine that using more than 1 GB for a read-ahead buffer could
> make any difference in practice. If you have a very large work_mem, you'll
> surely get away with a single merge pass, and fragmentation won't become an
> issue. And 1GB should be more than enough to trigger OS read-ahead.

I had a non-specific concern, not an intuition of suspicion about
this. I think that I'll figure it out when I rebase the parallel
CREATE INDEX patch on top of this and test that.

> Committed with some final kibitzing. Thanks for the review!

Thanks for working on this!

> PS. This patch didn't fix bug #14344, the premature reuse of memory with
> tuplesort_gettupleslot. We'll still need to come up with 1. a backportable
> fix for that, and 2. perhaps a different fix for master.

Agreed. It seemed like you favor not changing memory ownership
semantics for 9.6. I'm not sure that that's the easiest approach for
9.6, but let's discuss that over on the dedicated thread soon.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-10-03 Thread Heikki Linnakangas

On 09/30/2016 04:08 PM, Peter Geoghegan wrote:

On Thu, Sep 29, 2016 at 4:10 PM, Heikki Linnakangas  wrote:

Bah, I fumbled the initSlabAllocator() function, attached is a fixed
version.


This looks much better. It's definitely getting close. Thanks for
being considerate of my more marginal concerns. More feedback:


Fixed most of the things you pointed out, thanks.


* Minor issues with initSlabAllocator():

You call the new function initSlabAllocator() as follows:


+   if (state->tuples)
+   initSlabAllocator(state, numInputTapes + 1);
+   else
+   initSlabAllocator(state, 0);


Isn't the number of slots (the second argument to initSlabAllocator())
actually just numInputTapes when we're "state->tuples"? And so,
shouldn't the "+ 1" bit happen within initSlabAllocator() itself? It
can just inspect "state->tuples" itself. In short, maybe push a bit
more into initSlabAllocator(). Making the arguments match those passed
to initTapeBuffers() a bit later would be nice, perhaps.


The comment above that explains the "+ 1". init_slab_allocator allocates 
the number of slots that was requested, and the caller is responsible 
for deciding how many slots are needed. Yeah, we could remove the 
argument and move the logic altogether into init_slab_allocator(), but I 
think it's clearer this way. Matter of taste, I guess.



* This could be simpler, I think:


+   /* Release the read buffers on all the other tapes, by rewinding them. */
+   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+   {
+   if (tapenum == state->result_tape)
+   continue;
+   LogicalTapeRewind(state->tapeset, tapenum, true);
+   }


Can't you just use state->tapeRange, and remove the "continue"? I
recommend referring to "now-exhausted input tapes" here, too.


Don't think so. result_tape == tapeRange only when the merge was done in 
a single pass (or you're otherwise lucky).



* I'm not completely prepared to give up on using
MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
that it couldn't possibly matter that we impose a MaxAllocSize cap
within logtape.c (per tape), but I have slight reservations that I
need to address. Maybe a better way of putting it would be that I have
some reservations about possible regressions at the very high end,
with very large workMem. Any thoughts on that?


Meh, I can't imagine that using more than 1 GB for a read-ahead buffer 
could make any difference in practice. If you have a very large 
work_mem, you'll surely get away with a single merge pass, and 
fragmentation won't become an issue. And 1GB should be more than enough 
to trigger OS read-ahead.


Committed with some final kibitzing. Thanks for the review!

PS. This patch didn't fix bug #14344, the premature reuse of memory with 
tuplesort_gettupleslot. We'll still need to come up with 1. a 
backportable fix for that, and 2. perhaps a different fix for master.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-30 Thread Peter Geoghegan
On Thu, Sep 29, 2016 at 4:10 PM, Heikki Linnakangas  wrote:
> Bah, I fumbled the initSlabAllocator() function, attached is a fixed
> version.

This looks much better. It's definitely getting close. Thanks for
being considerate of my more marginal concerns. More feedback:

* Should say "fixed number of...":

> +* we start merging.  Merging only needs to keep a small, fixed number 
> tuples

* Minor concern about these new macros:

> +#define IS_SLAB_SLOT(state, tuple) \
> +   ((char *) tuple >= state->slabMemoryBegin && \
> +(char *) tuple < state->slabMemoryEnd)
> +
> +/*
> + * Return the given tuple to the slab memory free list, or free it
> + * if it was palloc'd.
> + */
> +#define RELEASE_SLAB_SLOT(state, tuple) \
> +   do { \
> +   SlabSlot *buf = (SlabSlot *) tuple; \
> +   \
> +   if (IS_SLAB_SLOT(state, tuple)) \
> +   { \
> +   buf->nextfree = state->slabFreeHead; \
> +   state->slabFreeHead = buf; \
> +   } else \
> +   pfree(tuple); \
> +   } while(0)

I suggest duplicating the paranoia seen elsewhere around what "state"
macro argument could expand to. You know, by surrounding "state" with
parenthesis each time it is used. This is what we see with existing,
similar macros.

* Should cast to int64 here (for the benefit of win64):

> +   elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among 
> %d input tapes",
> +(long) (availBlocks * BLCKSZ) / 1024, numInputTapes);

* FWIW, I still don't love this bit:

> +* numTapes and numInputTapes reflect the actual number of tapes we will
> +* use.  Note that the output tape's tape number is maxTapes - 1, so the
> +* tape numbers of the used tapes are not consecutive, so you cannot
> +* just loop from 0 to numTapes to visit all used tapes!
> +*/
> +   if (state->Level == 1)
> +   {
> +   numInputTapes = state->currentRun;
> +   numTapes = numInputTapes + 1;
> +   FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
> +   }

But I can see how the verbosity of almost-duplicating the activeTapes
stuff seems unappealing. That said, I think that you should point out
in comments that you're calculating the number of
maybe-active-in-some-merge tapes. They're maybe-active in that they
have some number of real tapes. Not going to insist on that, but
something to think about.

* Shouldn't this use state->tapeRange?:

> +   else
> +   {
> +   numInputTapes = state->maxTapes - 1;
> +   numTapes = state->maxTapes;
> +   }

* Doesn't it also set numTapes without it being used? Maybe that
variable can be declared within "if (state->Level == 1)" block.

* Minor issues with initSlabAllocator():

You call the new function initSlabAllocator() as follows:

> +   if (state->tuples)
> +   initSlabAllocator(state, numInputTapes + 1);
> +   else
> +   initSlabAllocator(state, 0);

Isn't the number of slots (the second argument to initSlabAllocator())
actually just numInputTapes when we're "state->tuples"? And so,
shouldn't the "+ 1" bit happen within initSlabAllocator() itself? It
can just inspect "state->tuples" itself. In short, maybe push a bit
more into initSlabAllocator(). Making the arguments match those passed
to initTapeBuffers() a bit later would be nice, perhaps.

* This could be simpler, I think:

> +   /* Release the read buffers on all the other tapes, by rewinding them. */
> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
> +   {
> +   if (tapenum == state->result_tape)
> +   continue;
> +   LogicalTapeRewind(state->tapeset, tapenum, true);
> +   }

Can't you just use state->tapeRange, and remove the "continue"? I
recommend referring to "now-exhausted input tapes" here, too.

* I'm not completely prepared to give up on using
MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
that it couldn't possibly matter that we impose a MaxAllocSize cap
within logtape.c (per tape), but I have slight reservations that I
need to address. Maybe a better way of putting it would be that I have
some reservations about possible regressions at the very high end,
with very large workMem. Any thoughts on that?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Robert Haas
On Thu, Sep 29, 2016 at 11:38 AM, Peter Geoghegan  wrote:
> On Thu, Sep 29, 2016 at 2:59 PM, Robert Haas  wrote:
>>> Maybe that was the wrong choice of words. What I mean is that it seems
>>> somewhat unprincipled to give over an equal share of memory to each
>>> active-at-least-once tape, ...
>>
>> I don't get it.  If the memory is being used for prereading, then the
>> point is just to avoid doing many small I/Os instead of one big I/O,
>> and that goal will be accomplished whether the memory is equally
>> distributed or not; indeed, it's likely to be accomplished BETTER if
>> the memory is equally distributed than if it isn't.
>
> I think it could hurt performance if preloading loads runs on a tape
> that won't be needed until some subsequent merge pass, in preference
> to using that memory proportionately, giving more to larger input runs
> for *each* merge pass (giving memory proportionate to the size of each
> run to be merged from each tape).  For tapes with a dummy run, the
> appropriate amount of memory for there next merge pass is zero.

OK, true.  But I still suspect that unless the amount of data you need
to read from a tape is zero or very small, the size of the buffer
doesn't matter.  For example, if you have a 1GB tape and a 10GB tape,
I doubt there's any benefit in making the buffer for the 10GB tape 10x
larger.  They can probably be the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2016 at 2:59 PM, Robert Haas  wrote:
>> Maybe that was the wrong choice of words. What I mean is that it seems
>> somewhat unprincipled to give over an equal share of memory to each
>> active-at-least-once tape, ...
>
> I don't get it.  If the memory is being used for prereading, then the
> point is just to avoid doing many small I/Os instead of one big I/O,
> and that goal will be accomplished whether the memory is equally
> distributed or not; indeed, it's likely to be accomplished BETTER if
> the memory is equally distributed than if it isn't.

I think it could hurt performance if preloading loads runs on a tape
that won't be needed until some subsequent merge pass, in preference
to using that memory proportionately, giving more to larger input runs
for *each* merge pass (giving memory proportionate to the size of each
run to be merged from each tape).  For tapes with a dummy run, the
appropriate amount of memory for there next merge pass is zero.

I'm not arguing that it would be worth it to do that, but I do think
that that's the sensible way of framing the idea of using a uniform
amount of memory to every maybe-active tape up front. I'm not too
concerned about this because I'm never too concerned about multiple
merge pass cases, which are relatively rare and relatively
unimportant. Let's just get our story straight.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Heikki Linnakangas

On 09/29/2016 05:41 PM, Heikki Linnakangas wrote:

Here's a new patch version, addressing the points you made. Please have
a look!


Bah, I fumbled the initSlabAllocator() function, attached is a fixed 
version.


- Heikki

>From bd74cb9c32b3073637d6932f3b4552598fcdc92a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 14 Sep 2016 17:29:11 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE. In other words, it's used whenever we do
an external sort.

Reviewed by Peter Geoghegan and Claudio Freire.
---
 src/backend/utils/sort/logtape.c   |  153 -
 src/backend/utils/sort/tuplesort.c | 1208 +---
 src/include/utils/logtape.h|1 +
 3 files changed, 565 insertions(+), 797 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..4152da1 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -52,12 +52,17 @@
  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
  * policy for free blocks would be better?)
  *
+ * To further make the I/Os more sequential, we can use a larger buffer
+ * when reading, and read multiple blocks from the same tape in one go,
+ * whenever the buffer becomes empty. LogicalTapeAssignReadBufferSize()
+ * can be used to set the size of the read buffer.
+ *
  * To support the above policy of writing to the lowest free block,
  * ltsGetFreeBlock sorts the list of free block numbers into decreasing
  * order each time it is asked for a block and the list isn't currently
  * sorted.  This is an efficient way to handle it because we expect cycles
  * of releasing many blocks followed by re-using many blocks, due to
- * tuplesort.c's "preread" behavior.
+ * the larger read buffer.
  *
  * Since all the bookkeeping and buffer memory is allocated with palloc(),
  * and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -79,6 +84,7 @@
 
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#include "utils/memutils.h"
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -131,9 +137,18 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * Desired buffer size to use when reading.  To keep things simple, we
+	 * use a single-block buffer when writing, or when reading a frozen
+	 * tape.  But when we are reading and will only read forwards, we
+	 * allocate a larger buffer, determined by read_buffer_size.
+	 */
+	int			read_buffer_size;
 } LogicalTape;
 
 /*
@@ -228,6 +243,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Heikki Linnakangas

On 09/29/2016 01:52 PM, Peter Geoghegan wrote:

* Variables like maxTapes have a meaning that is directly traceable
back to Knuth's description of polyphase merge. I don't think that you
should do anything to them, on general principle.


Ok. I still think that changing maxTapes would make sense, when there 
are fewer runs than tapes, but that is actually orthogonal to the rest 
of the patch, so let's discuss that separately. I've changed the patch 
to not do that.



* Everything or almost everything that you've added to mergeruns()
should probably be in its own dedicated function. This function can
have a comment where you acknowledge that it's not perfectly okay that
you claim memory per-tape, but it's simpler and faster overall.


Ok.


* I think you should be looking at the number of active tapes, and not
state->Level or state->currentRun in this new function. Actually,
maybe this wouldn't be the exact definition of an active tape that we
establish at the beginning of beginmerge() (this considers tapes with
dummy runs to be inactive for that merge), but it would look similar.
The general concern I have here is that you shouldn't rely on
state->Level or state->currentRun from a distance for the purposes of
determining which tapes need some batch memory. This is also where you
might say something like: "we don't bother with shifting memory around
tapes for each merge step, even though that would be fairer. That's
why we don't use the beginmerge() definition of activeTapes --
instead, we use our own broader definition of the number of active
tapes that doesn't exclude dummy-run-tapes with real runs, making the
allocation reasonably sensible for every merge pass".


I'm not sure I understood what your concern was, but please have a look 
at this new version, if the comments in initTapeBuffers() explain that 
well enough.



* The "batchUsed" terminology really isn't working here, AFAICT. For
one thing, you have two separate areas where caller tuples might
reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
tape), and the logtape.c controlled preread buffers (sized up to
MaxAllocSize per tape). Which of these two things is batch memory? I
think it might just be the first one, but KiBs of memory do not
suggest "batch" to me. Isn't that really more like what you might call
double buffer memory, used not to save overhead from palloc (having
many palloc headers in memory), but rather to recycle memory
efficiently? So, these two things should have two new names of their
own, I think, and neither should be called "batch memory" IMV. I see
several assertions remain here and there that were written with my
original definition of batch memory in mind -- assertions like:


Ok. I replaced "batch" terminology with "slab allocator" and "slab 
slots", I hope this is less confusing. This isn't exactly like e.g. the 
slab allocator in the Linux kernel, as it has a fixed number of slots, 
so perhaps an "object pool" might be more accurate. But I like "slab" 
because it's not used elsewhere in the system. I also didn't use the 
term "cache" for the "slots", as might be typical for slab allocators, 
because "cache" is such an overloaded term.



case TSS_SORTEDONTAPE:
Assert(forward || state->randomAccess);
Assert(!state->batchUsed);

(Isn't state->batchUsed always true now?)


Good catch. It wasn't, because mergeruns() set batchUsed only after 
checking for the TSS_SORTEDONTAPE case, even though it set up the batch 
memory arena before it. So if replacement selection was able to produce 
a single run batchUsed was false. Fixed, the slab allocator (as it's now 
called) is now always used in TSS_SORTEDONTAPE case.



And:

case TSS_FINALMERGE:
Assert(forward);
Assert(state->batchUsed || !state->tuples);

(Isn't state->tuples only really of interest to datum-tuple-case
routines, now that you've made everything use large logtape.c preread
buffers?)


Yeah, fixed.


* Is is really necessary for !state->tuples cases to do that
MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
there for pass-by-value datum cases to have this scratch/double buffer
memory at all, since the value is returned to caller by-value, not
by-reference? This is related to the problem of it not being entirely
clear what batch memory now is, I think.


True, fixed. I still set slabAllocatorUsed (was batchUsed), but it's 
initialized as a dummy 0-slot arena when !state->tuples.


Here's a new patch version, addressing the points you made. Please have 
a look!


- Heikki

>From a958cea32550825aa0ea487f58ac87c2c3620cda Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 14 Sep 2016 17:29:11 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Robert Haas
On Thu, Sep 29, 2016 at 6:52 AM, Peter Geoghegan  wrote:
>> How is it awkward?
>
> Maybe that was the wrong choice of words. What I mean is that it seems
> somewhat unprincipled to give over an equal share of memory to each
> active-at-least-once tape, ...

I don't get it.  If the memory is being used for prereading, then the
point is just to avoid doing many small I/Os instead of one big I/O,
and that goal will be accomplished whether the memory is equally
distributed or not; indeed, it's likely to be accomplished BETTER if
the memory is equally distributed than if it isn't.

I can imagine that there might be a situation in which it makes sense
to give a bigger tape more resources than a smaller one; for example,
if one were going to divide N tapes across K worker processess and
make individual workers or groups of workers responsible for sorting
particular tapes, one would of course want to divide up the data
across the available processes relatively evenly, rather than having
some workers responsible for only a small amount of data and others
for a very large amount of data.  But such considerations are
irrelevant here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Peter Geoghegan
On Thu, Sep 29, 2016 at 10:49 AM, Heikki Linnakangas  wrote:
>> Do I have that right? If so, this seems rather awkward. Hmm.
>
>
> How is it awkward?

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, regardless of how much that matters in
practice. One tape could have several runs in memory at once, while
another only has a fraction of a single much larger run. Maybe this is
just the first time that I find myself on the *other* side of a
discussion about an algorithm that seems brute-force compared to what
it might replace, but is actually better overall.   :-)

Now, that could be something that I just need to get over. In any
case, I still think:

* Variables like maxTapes have a meaning that is directly traceable
back to Knuth's description of polyphase merge. I don't think that you
should do anything to them, on general principle.

* Everything or almost everything that you've added to mergeruns()
should probably be in its own dedicated function. This function can
have a comment where you acknowledge that it's not perfectly okay that
you claim memory per-tape, but it's simpler and faster overall.

* I think you should be looking at the number of active tapes, and not
state->Level or state->currentRun in this new function. Actually,
maybe this wouldn't be the exact definition of an active tape that we
establish at the beginning of beginmerge() (this considers tapes with
dummy runs to be inactive for that merge), but it would look similar.
The general concern I have here is that you shouldn't rely on
state->Level or state->currentRun from a distance for the purposes of
determining which tapes need some batch memory. This is also where you
might say something like: "we don't bother with shifting memory around
tapes for each merge step, even though that would be fairer. That's
why we don't use the beginmerge() definition of activeTapes --
instead, we use our own broader definition of the number of active
tapes that doesn't exclude dummy-run-tapes with real runs, making the
allocation reasonably sensible for every merge pass".

* The "batchUsed" terminology really isn't working here, AFAICT. For
one thing, you have two separate areas where caller tuples might
reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
tape), and the logtape.c controlled preread buffers (sized up to
MaxAllocSize per tape). Which of these two things is batch memory? I
think it might just be the first one, but KiBs of memory do not
suggest "batch" to me. Isn't that really more like what you might call
double buffer memory, used not to save overhead from palloc (having
many palloc headers in memory), but rather to recycle memory
efficiently? So, these two things should have two new names of their
own, I think, and neither should be called "batch memory" IMV. I see
several assertions remain here and there that were written with my
original definition of batch memory in mind -- assertions like:

case TSS_SORTEDONTAPE:
Assert(forward || state->randomAccess);
Assert(!state->batchUsed);

(Isn't state->batchUsed always true now?)

And:

case TSS_FINALMERGE:
Assert(forward);
Assert(state->batchUsed || !state->tuples);

(Isn't state->tuples only really of interest to datum-tuple-case
routines, now that you've made everything use large logtape.c preread
buffers?)

* Is is really necessary for !state->tuples cases to do that
MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
there for pass-by-value datum cases to have this scratch/double buffer
memory at all, since the value is returned to caller by-value, not
by-reference? This is related to the problem of it not being entirely
clear what batch memory now is, I think.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-29 Thread Heikki Linnakangas

On 09/28/2016 07:20 PM, Peter Geoghegan wrote:

On Wed, Sep 28, 2016 at 5:11 PM, Peter Geoghegan  wrote:

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".


Wait. I guess you feel you have to, since it wouldn't be okay to use
almost no memory per tape on non-final merges.

You're able to throw out so much code here in large part because you
give almost all memory over to logtape.c (e.g., you don't manage each
tape's share of "slots" anymore -- better to give everything to
logtape.c). So, with your patch, you would actually only have one
caller tuple in memory at once per tape for a merge that doesn't use
batch memory (if you made it so that a merge *could* avoid the use of
batch memory, as I suggest).


Correct.


In summary, under your scheme, the "batchUsed" variable contains a
tautological value, since you cannot sensibly not use batch memory.
(This is even true with !state->tuples callers).


I suppose.


Do I have that right? If so, this seems rather awkward. Hmm.


How is it awkward?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-28 Thread Heikki Linnakangas

On 09/28/2016 07:11 PM, Peter Geoghegan wrote:

On Wed, Sep 28, 2016 at 5:04 PM, Heikki Linnakangas  wrote:

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.



How does the distribution of the runs on the tapes matter?


The exact details are not really relevant to this discussion (I think
it's confusing that we simply say "Target Fibonacci run counts",
FWIW), but the simple fact that it can be quite uneven is.


Well, I claim that the fact that the distribution of runs is uneven, 
does not matter. Can you explain why you think it does?



This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".


Yep. As the patch stands, we wouldn't really need batchUsed, as we know 
that it's always true when merging, and false otherwise. But I kept it, 
as it seems like that might not always be true - we might use batch 
memory when building the initial runs, for example - and because it 
seems nice to have an explicit flag for it, for readability and 
debugging purposes.



I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).



You can't release the tape buffer at the end of a pass, because the buffer
of a tape will already be filled with data from the next run on the same
tape.


Okay, but can't you just not use batch memory for non-final merges,
per my initial approach? That seems far cleaner.


Why? I don't see why the final merge should behave differently from the 
non-final ones.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2016 at 5:11 PM, Peter Geoghegan  wrote:
> This is why I never pursued batch memory for non-final merges. Isn't
> that what you're doing here? You're pretty much always setting
> "state->batchUsed = true".

Wait. I guess you feel you have to, since it wouldn't be okay to use
almost no memory per tape on non-final merges.

You're able to throw out so much code here in large part because you
give almost all memory over to logtape.c (e.g., you don't manage each
tape's share of "slots" anymore -- better to give everything to
logtape.c). So, with your patch, you would actually only have one
caller tuple in memory at once per tape for a merge that doesn't use
batch memory (if you made it so that a merge *could* avoid the use of
batch memory, as I suggest).

In summary, under your scheme, the "batchUsed" variable contains a
tautological value, since you cannot sensibly not use batch memory.
(This is even true with !state->tuples callers).

Do I have that right? If so, this seems rather awkward. Hmm.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-28 Thread Peter Geoghegan
On Wed, Sep 28, 2016 at 5:04 PM, Heikki Linnakangas  wrote:
>> Not sure that I understand. I agree that each merge pass tends to use
>> roughly the same number of tapes, but the distribution of real runs on
>> tapes is quite unbalanced in earlier merge passes (due to dummy runs).
>> It looks like you're always using batch memory, even for non-final
>> merges. Won't that fail to be in balance much of the time because of
>> the lopsided distribution of runs? Tapes have an uneven amount of real
>> data in earlier merge passes.
>
>
> How does the distribution of the runs on the tapes matter?

The exact details are not really relevant to this discussion (I think
it's confusing that we simply say "Target Fibonacci run counts",
FWIW), but the simple fact that it can be quite uneven is.

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".

>> I'm basically repeating myself here, but: I think it's incorrect that
>> LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
>> generally, it is questionable that it is called in such a high level
>> routine, rather than the start of a specific merge pass -- I said so a
>> couple of times already).
>
>
> You can't release the tape buffer at the end of a pass, because the buffer
> of a tape will already be filled with data from the next run on the same
> tape.

Okay, but can't you just not use batch memory for non-final merges,
per my initial approach? That seems far cleaner.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-28 Thread Heikki Linnakangas

On 09/28/2016 06:05 PM, Peter Geoghegan wrote:

On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas  wrote:

I don't think it makes much difference in practice, because most merge
passes use all, or almost all, of the available tapes. BTW, I think the
polyphase algorithm prefers to do all the merges that don't use all tapes
upfront, so that the last final merge always uses all the tapes. I'm not
100% sure about that, but that's my understanding of the algorithm, and
that's what I've seen in my testing.


Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.


How does the distribution of the runs on the tapes matter?


+   usedBlocks = 0;
+   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+   {
+   int64   numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
+
+   if (numBlocks > MaxAllocSize / BLCKSZ)
+   numBlocks = MaxAllocSize / BLCKSZ;
+   LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+   numBlocks * BLCKSZ);
+   usedBlocks += numBlocks;
+   }
+   USEMEM(state, usedBlocks * BLCKSZ);


I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).


You can't release the tape buffer at the end of a pass, because the 
buffer of a tape will already be filled with data from the next run on 
the same tape.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-28 Thread Peter Geoghegan
On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas  wrote:
>> I still don't get why you're doing all of this within mergeruns() (the
>> beginning of when we start merging -- we merge all quicksorted runs),
>> rather than within beginmerge() (the beginning of one particular merge
>> pass, of which there are potentially more than one). As runs are
>> merged in a non-final merge pass, fewer tapes will remain active for
>> the next merge pass. It doesn't do to do all that up-front when we
>> have multiple merge passes, which will happen from time to time.
>
>
> Now that the pre-reading is done in logtape.c, it doesn't stop at a run
> boundary. For example, when we read the last 1 MB of the first run on a
> tape, and we're using a 10 MB read buffer, we will merrily also read the
> first 9 MB from the next run. You cannot un-read that data, even if the tape
> is inactive in the next merge pass.

I've had a chance to think about this some more. I did a flying review
of the same revision that I talk about here, but realized some more
things. First, I will do a recap.

> I don't think it makes much difference in practice, because most merge
> passes use all, or almost all, of the available tapes. BTW, I think the
> polyphase algorithm prefers to do all the merges that don't use all tapes
> upfront, so that the last final merge always uses all the tapes. I'm not
> 100% sure about that, but that's my understanding of the algorithm, and
> that's what I've seen in my testing.

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

FWIW, polyphase merge actually doesn't distribute runs based on the
Fibonacci sequence (at least in Postgres). It uses a generalization of
the Fibonacci numbers [1] -- *which* generalization varies according
to merge order/maxTapes. IIRC, with Knuth's P == 6 (i.e. merge order
== 6), it's the "hexanacci" sequence.

The following code is from your latest version, posted on 2016-09-14,
within the high-level mergeruns() function (the function that takes
quicksorted runs, and produces final output following 1 or more merge
passes):

> +   usedBlocks = 0;
> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
> +   {
> +   int64   numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
> +
> +   if (numBlocks > MaxAllocSize / BLCKSZ)
> +   numBlocks = MaxAllocSize / BLCKSZ;
> +   LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
> +   numBlocks * BLCKSZ);
> +   usedBlocks += numBlocks;
> +   }
> +   USEMEM(state, usedBlocks * BLCKSZ);

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

It's weird that you change the meaning of maxTapes by reassigning in
advance of iterating through maxTapes like this. I should point out
again that the master branch mergebatch() function does roughly the
same thing as you're doing here as follows:

static void
mergebatch(Tuplesortstate *state, int64 spacePerTape)
{
int srcTape;

Assert(state->activeTapes > 0);
Assert(state->tuples);

/*
 * For the purposes of tuplesort's memory accounting, the batch allocation
 * is special, and regular memory accounting through USEMEM() calls is
 * abandoned (see mergeprereadone()).
 */
for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
{
char   *mergetuples;

if (!state->mergeactive[srcTape])
continue;

/* Allocate buffer for each active tape */
mergetuples = MemoryContextAllocHuge(state->tuplecontext,
 spacePerTape);

/* Initialize state for tape */
state->mergetuples[srcTape] = mergetuples;
state->mergecurrent[srcTape] = mergetuples;
state->mergetail[srcTape] = mergetuples;
state->mergeoverflow[srcTape] = NULL;
}

state->batchUsed = true;
state->spacePerTape = spacePerTape;
}

Notably, this function:

* Works without altering the meaning of maxTapes. state->maxTapes is
Knuth's T, which is established very early and doesn't change with
polyphase merge (same applies to state->tapeRange). What does change
across merge passes is the number of *active* tapes. I don't think
it's necessary to change that in any way. I find it very confusing.
(Also, that you're using state->currentRun here at all seems bad, 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-22 Thread Claudio Freire
On Thu, Sep 22, 2016 at 4:17 AM, Heikki Linnakangas  wrote:
> On 09/22/2016 03:40 AM, Claudio Freire wrote:
>>
>> On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire 
>> wrote:
>>>
>>> The results seem all over the map. Some regressions seem significant
>>> (both in the amount of performance lost and their significance, since
>>> all 4 runs show a similar regression). The worst being "CREATE INDEX
>>> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
>>> work_mem, which should be an in-memory sort, which makes it odd.
>>>
>>> I will re-run it overnight just in case to confirm the outcome.
>>
>>
>> A new run for "patched" gives better results, it seems it was some
>> kind of glitch in the run (maybe some cron decided to do something
>> while running those queries).
>>
>> Attached
>>
>> In essence, it doesn't look like it's harmfully affecting CPU
>> efficiency. Results seem neutral on the CPU front.
>
>
> Looking at the spreadsheet, there is a 40% slowdown in the "slow" "CREATE
> INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" test with 4GB
> of work_mem. I can't reproduce that on my laptop, though. Got any clue
> what's going on there?

It's not present in other runs, so I think it's a fluke the
spreadsheet isn't filtering out. Especially considering that one
should be a fully in-memory fast sort and thus unaffected by the
current patch (z and z2 being integers, IIRC, most comparisons should
be about comparing the first columns and thus rarely involve the big
strings).

I'll try to confirm that's the case though.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-22 Thread Heikki Linnakangas

On 09/22/2016 03:40 AM, Claudio Freire wrote:

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire  wrote:

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.


A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.


Looking at the spreadsheet, there is a 40% slowdown in the "slow" 
"CREATE INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" 
test with 4GB of work_mem. I can't reproduce that on my laptop, though. 
Got any clue what's going on there?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-21 Thread Claudio Freire
On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
>> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>>
>>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>>> the other thread, with these patches, that'd be nice. These patches are
>>> effectively a replacement for
>>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>>> would be much appreciated too, of course.
>>>
>>> Attached are new versions. Compared to last set, they contain a few comment
>>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>>> that were completely unused.
>>
>>
>> Will do so
>
> Well, here they are, the results.
>
> ODS format only (unless you've got issues opening the ODS).
>
> The results seem all over the map. Some regressions seem significant
> (both in the amount of performance lost and their significance, since
> all 4 runs show a similar regression). The worst being "CREATE INDEX
> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
> work_mem, which should be an in-memory sort, which makes it odd.
>
> I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.


logtape_preload_timings.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-20 Thread Claudio Freire
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

Well, here they are, the results.

ODS format only (unless you've got issues opening the ODS).

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.


logtape_preload_timings.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-17 Thread Peter Geoghegan
On Sat, Sep 17, 2016 at 9:41 AM, Heikki Linnakangas  wrote:
> Ok. I'll probably read through it myself once more some time next week, and
> also have a first look at your actual parallel sorting patch. Have a good
> trip!

Thanks! It will be good to get away for a while.

I'd be delighted to recruit you to work on the parallel CREATE INDEX
patch. I've already explained how I think this preread patch of yours
works well with parallel tuplesort (as proposed) in particular. To
reiterate: while what you've come up with here is technically an
independent improvement to merging, it's much more valuable in the
overall context of parallel sort, where disproportionate wall clock
time is spent merging, and where multiple passes are the norm (one
pass in each worker, plus one big final pass in the leader process
alone -- logtape.c fragmentation becomes a real issue). The situation
is actually similar to the original batch memory preloading patch that
went into 9.6 (which your new patch supersedes), and the subsequently
committed quicksort for external sort patch (which my new patch
extends to work in parallel).

Because I think of your preload patch as a part of the overall
parallel CREATE INDEX effort, if that was the limit of your
involvement then I'd still think it fair to credit you as my
co-author. I hope it isn't the limit of your involvement, though,
because it seems likely that the final result will be better still if
you get involved with the big patch that formally introduces parallel
CREATE INDEX.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-17 Thread Heikki Linnakangas

On 09/17/2016 07:27 PM, Peter Geoghegan wrote:

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas  wrote:

Addressed all your comments one way or another, new patch attached.


So, it's clear that this isn't ready today. As I mentioned, I'm going
away for a week now. I ask that you hold off on committing this until
I return, and have a better opportunity to review the performance
characteristics of this latest revision, for one thing.


Ok. I'll probably read through it myself once more some time next week, 
and also have a first look at your actual parallel sorting patch. Have a 
good trip!


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-17 Thread Peter Geoghegan
On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas  wrote:
> Addressed all your comments one way or another, new patch attached.

So, it's clear that this isn't ready today. As I mentioned, I'm going
away for a week now. I ask that you hold off on committing this until
I return, and have a better opportunity to review the performance
characteristics of this latest revision, for one thing.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-15 Thread Peter Geoghegan
On Thu, Sep 15, 2016 at 1:51 PM, Heikki Linnakangas  wrote:
> BTW, does a 1-way merge make any sense? I was surprised to see this in the
> log, even without this patch:
>
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;

Another thing that I think it worth pointing out here is that the
number of merge passes shown is excessive, practically speaking. I
suggested that we have something like checkpoint_warning for this last
year, which Greg Stark eventually got behind, but Robert Haas didn't
seem to like. Maybe this idea should be revisited. What do you think?

There is no neat explanation for why it's considered excessive to
checkpoint every 10 seconds, but not every 2 minutes. But, we warn
about the former case by default, and not the latter. It's hard to
know exactly where to draw the line, but that isn't a great reason to
not do it (maybe one extra merge pass is a good threshold -- that's
what I suggested once). I think that other database systems similarly
surface multiple merge passes. It's just inefficient to ever do
multiple merge passes, even if you're very frugal with memory.
Certainly, it's almost impossible to defend doing 3+ passes these
days.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-15 Thread Peter Geoghegan
On Thu, Sep 15, 2016 at 1:51 PM, Heikki Linnakangas  wrote:
>> I still don't get why you're doing all of this within mergeruns() (the
>> beginning of when we start merging -- we merge all quicksorted runs),
>> rather than within beginmerge() (the beginning of one particular merge
>> pass, of which there are potentially more than one). As runs are
>> merged in a non-final merge pass, fewer tapes will remain active for
>> the next merge pass. It doesn't do to do all that up-front when we
>> have multiple merge passes, which will happen from time to time.
>
>
> Now that the pre-reading is done in logtape.c, it doesn't stop at a run
> boundary. For example, when we read the last 1 MB of the first run on a
> tape, and we're using a 10 MB read buffer, we will merrily also read the
> first 9 MB from the next run. You cannot un-read that data, even if the tape
> is inactive in the next merge pass.

I'm not sure that I like that approach. At the very least, it seems to
not be a good fit with the existing structure of things. I need to
think about it some more, and study how that plays out in practice.

> BTW, does a 1-way merge make any sense?

Not really, no, but it's something that I've seen plenty of times.

This is seen when runs are distributed such that mergeonerun() only
finds one real run on all active tapes, with all other active tapes
having only dummy runs. In general, dummy runs are "logically merged"
(decremented) in preference to any real runs on the same tape (that's
the reason why they exist), so you end up with what we call a "1-way
merge" when you see one real one on one active tape only. You never
see something like "0-way merge" within trace_sort output, though,
because that case is optimized to be almost a no-op.

It's almost a no-op because when it happens then mergeruns() knows to
itself directly decrement the number of dummy runs once for each
active tape, making that "logical merge" completed with only that
simple change in metadata (that is, the "merge" completes by just
decrementing dummy run counts -- no actual call to mergeonerun()).

We could optimize away "1-way merge" cases, perhaps, so that tuples
don't have to be spilt out one at a time (there could perhaps instead
be just some localized change to metadata, a bit like the all-dummy
case). That doesn't seem worth bothering with, especially with this
new approach of yours. I prefer to avoid special cases like that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-15 Thread Heikki Linnakangas

On 09/15/2016 10:12 PM, Peter Geoghegan wrote:

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas  wrote:

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).



Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:



I changed that so that it does actually change state->maxTapes. I considered
having a separate numTapes field, that can be smaller than maxTapes, but we
don't need the original maxTapes value after that point anymore, so it
would've been just pro forma to track them separately. I hope the comment
now explains that better.


I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.


Now that the pre-reading is done in logtape.c, it doesn't stop at a run 
boundary. For example, when we read the last 1 MB of the first run on a 
tape, and we're using a 10 MB read buffer, we will merrily also read the 
first 9 MB from the next run. You cannot un-read that data, even if the 
tape is inactive in the next merge pass.


I don't think it makes much difference in practice, because most merge 
passes use all, or almost all, of the available tapes. BTW, I think the 
polyphase algorithm prefers to do all the merges that don't use all 
tapes upfront, so that the last final merge always uses all the tapes. 
I'm not 100% sure about that, but that's my understanding of the 
algorithm, and that's what I've seen in my testing.



Correct me if I'm wrong, but I think that you're more skeptical of the
need for polyphase merge than I am. I at least see no reason to not
keep it around. I also think it has some value. It doesn't make this
optimization any harder, really.


We certainly still need multi-pass merges.

BTW, does a 1-way merge make any sense? I was surprised to see this in 
the log, even without this patch:


LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

LOG:  finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-15 Thread Peter Geoghegan
On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas  wrote:
> Addressed all your comments one way or another, new patch attached. Comments
> on some specific points below:

Cool. My response here is written under time pressure, which is not
ideal. I think it's still useful, though.

>> * You should probably point out that typically, access to batch memory
>> will still be sequential, despite your block-based scheme.

> That's not true, the "buffers" in batch memory are not accessed
> sequentially. When we pull the next tuple from a tape, we store it in the
> next free buffer. Usually, that buffer was used to hold the previous tuple
> that was returned from gettuple(), and was just added to the free list.
>
> It's still quite cache-friendly, though, because we only need a small number
> of slots (one for each tape).

That's kind of what I meant, I think -- it's more or less sequential.
Especially in the common case where there is only one merge pass.

> True. I fixed that by putting a MaxAllocSize cap on the buffer size instead.
> I doubt that doing > 1 GB of read-ahead of a single tape will do any good.

You may well be right about that, but ideally that could be verified.
I think that the tuplesort is entitled to have whatever memory the
user makes available, unless that's almost certainly useless. It
doesn't seem like our job to judge that it's always wrong to use extra
memory with only a small expected benefit. If it's actually a
microscopic expected benefit, or just as often negative to the sort
operation's performance, then I'd say it's okay to cap it at
MaxAllocSize. But it's not obvious to me that this is true; not yet,
anyway.

> Hmm. We don't really need the availMem accounting at all, after we have
> started merging. There is nothing we can do to free memory if we run out,
> and we use fairly little memory anyway. But yes, better safe than sorry. I
> tried to clarify the comments on that.

It is true that we don't really care about the accounting at all. But,
the same applies to the existing grow_memtuples() case at the
beginning of merging. The point is, we *do* care about availMem, this
one last time. We must at least produce a sane (i.e. >= 0) number in
any calculation. (I think you understand this already -- just saying.)

> OK. I solved that by calling LogicalTapeRewind(), when we're done reading a
> tape. Rewinding a tape has the side-effect of freeing the buffer. I was
> going to put that into mergereadnext(), but it turns out that it's tricky to
> figure out if there are any more runs on the same tape, because we have the
> "actual" tape number there, but the tp_runs is indexed by "logical" tape
> number. So I put the rewind calls at the end of mergeruns(), and in
> TSS_FINALMERGE processing, instead. It means that we don't free the buffers
> quite as early as we could, but I think this is good enough.

That seems adequate.

>> Spotted another issue with this code just now. Shouldn't it be based
>> on state->tapeRange? You don't want the destTape to get memory, since
>> you don't use batch memory for tapes that are written to (and final
>> on-the-fly merges don't use their destTape at all).

>> Wait, you're using a local variable maxTapes here, which potentially
>> differs from state->maxTapes:

> I changed that so that it does actually change state->maxTapes. I considered
> having a separate numTapes field, that can be smaller than maxTapes, but we
> don't need the original maxTapes value after that point anymore, so it
> would've been just pro forma to track them separately. I hope the comment
> now explains that better.

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Correct me if I'm wrong, but I think that you're more skeptical of the
need for polyphase merge than I am. I at least see no reason to not
keep it around. I also think it has some value. It doesn't make this
optimization any harder, really.

> Hmm, yes, using currentRun here is wrong. It needs to be "currentRun + 1",
> because we need one more tape than there are runs, to hold the output.

As I said, I think it should be the number of active tapes, as you see
today within beginmerge() + mergebatch(). Why not do it that way? If
you don't get the distinction, see my remarks below on final merges
always using batch memory, even when there are to be multiple merge
passes (no reason to add that restriction here). More generally, I
really don't want to mess with the long standing definition of
maxTapes and things like that, because I see no advantage.

> Ah, no, the "+ 1" comes from the 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-14 Thread Heikki Linnakangas
Addressed all your comments one way or another, new patch attached. 
Comments on some specific points below:


On 09/12/2016 01:13 AM, Peter Geoghegan wrote:

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.


That's not true, the "buffers" in batch memory are not accessed 
sequentially. When we pull the next tuple from a tape, we store it in 
the next free buffer. Usually, that buffer was used to hold the previous 
tuple that was returned from gettuple(), and was just added to the free 
list.


It's still quite cache-friendly, though, because we only need a small 
number of slots (one for each tape).



* i think you should move "bool   *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.


Hmm. I think I prefer to keep the memory management stuff in a separate 
section. While it's true that it's currently only used during merging, 
it's not hard to imagine using it when building the initial runs, for 
example. Except for replacement selection, the pattern for building the 
runs is: add a bunch of tuples, sort, flush them out. It would be 
straightforward to use one large chunk of memory to hold all the tuples. 
I'm not going to do that now, but I think keeping the memory management 
stuff separate from merge-related state makes sense.



* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.


Hmm. We don't actually use it after the initial runs at all anymore. I 
added a call to destroy it in mergeruns().


Now that we use the batch memory buffers for allocations < 1 kB (I 
pulled that number out of a hat, BTW), and we only need one allocation 
per tape (plus one), there's not much risk of fragmentation.



On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan  wrote:

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:


@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool 
forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+   /* Allocate a read buffer */
+   if (lt->buffer)
+   pfree(lt->buffer);
+   lt->buffer = palloc(lt->read_buffer_size);
+   lt->buffer_size = lt->read_buffer_size;


Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.


True. I fixed that by putting a MaxAllocSize cap on the buffer size 
instead. I doubt that doing > 1 GB of read-ahead of a single tape will 
do any good.


I wonder if we should actually have a smaller cap there. Even 1 GB seems 
excessive. Might be better to start the merging sooner, rather than wait 
for the read of 1 GB to complete. The point of the OS readahead is that 
the OS will do that for us, in the background. And other processes might 
have better use for the memory anyway.



* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):


+   /*
+* Use all the spare memory we have available for read buffers. Divide it
+* memory evenly among all the tapes.
+*/
+   avail_blocks = state->availMem / BLCKSZ;
+   per_tape = avail_blocks / maxTapes;
+   cutoff = avail_blocks % maxTapes;
+   if (per_tape == 0)
+   {
+   per_tape = 1;
+   cutoff = 0;
+   }
+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+   LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+   (per_tape + (tapenum < cutoff ? 1 : 0)) 
* BLCKSZ);
+   }


In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-12 Thread Peter Geoghegan
On Mon, Sep 12, 2016 at 12:07 PM, Heikki Linnakangas  wrote:
> Here's a fixed version. I'll go through Peter's comments and address those,
> but I don't think there was anything there that should affect performance
> much, so I think you can proceed with your benchmarking with this version.
> (You'll also need to turn off assertions for that!)

I agree that it's unlikely that addressing any of my feedback will
result in any major change to performance.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-12 Thread Heikki Linnakangas

On 09/12/2016 06:47 PM, Claudio Freire wrote:

On Mon, Sep 12, 2016 at 12:02 PM, Claudio Freire  wrote:

On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas  wrote:

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.


I don't know what was up with the other ones, but this one works fine.

Benchmarking now.


I spoke too soon, git AM had failed and I didn't notice.

regression.diffs attached

Built with

./configure --enable-debug --enable-cassert && make clean && make -j7
&& make check


Ah, of course! I had been building without assertions, as I was doing 
performance testing. With --enable-cassert, it failed for me as well 
(and there was even a compiler warning pointing out one of the issues). 
Sorry about that.


Here's a fixed version. I'll go through Peter's comments and address 
those, but I don't think there was anything there that should affect 
performance much, so I think you can proceed with your benchmarking with 
this version. (You'll also need to turn off assertions for that!)


- Heikki

>From 6101a4b91f537bf483059b0b6e8ff13d6e7be9fa Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 12 Sep 2016 22:02:34 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized buffers, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge.
---
 src/backend/utils/sort/logtape.c   | 140 +-
 src/backend/utils/sort/tuplesort.c | 996 +++--
 src/include/utils/logtape.h|   1 +
 3 files changed, 399 insertions(+), 738 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..3377cef 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -79,6 +79,7 @@
 
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#include "utils/memutils.h"
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -131,9 +132,12 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	int			read_buffer_size;	/* buffer size to use when reading */
 } LogicalTape;
 
 /*
@@ -228,6 +232,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +597,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +681,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-12 Thread Peter Geoghegan
On Mon, Sep 12, 2016 at 8:47 AM, Claudio Freire  wrote:
> I spoke too soon, git AM had failed and I didn't notice.

I wrote the regression test that causes Postgres to crash with the
patch applied. It tests, among other things, that CLUSTER tuples are
fixed-up by a routine like the current MOVETUP(), which is removed in
Heikki's patch. (There was a 9.6 bug where CLUSTER was broken due to
that.)

It shouldn't be too difficult for Heikki to fix this.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-12 Thread Claudio Freire
On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas  wrote:
> Here's a new version of these patches, rebased over current master. I
> squashed the two patches into one, there's not much point to keep them
> separate.


I don't know what was up with the other ones, but this one works fine.

Benchmarking now.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Gavin Flower

On 12/09/16 12:16, Gavin Flower wrote:
[...]
 two blocks would be logically adjacent (which means they are likely 
to be physically close together, but not guaranteed!).



[...]

Actual disk layouts are quite complicated, the above is an over 
simplification, but the message is still valid.


There are various tricks of disc layout ( & low level handling) that can 
be used to minimise the time taken to read 2 blocks that are logically 
adjacent.  I had to know this stuff for discs that MainFrame computers 
used in the 1980's - modern disk systems are way more complicated, but 
the conclusions are still valid.


I am extremely glad that I no longer have to concern myself with 
understanding the precise low stuff on discs these days - there is no 
longer a one-to-one correspondence of what the O/S thinks is a disk 
block, with how the data is physically recorded on the disc.



Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Gavin Flower

On 12/09/16 10:13, Peter Geoghegan wrote:

On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas  wrote:

[...]

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).


If you select ten pages at random and sort them, then consecutive reads 
of the sorted list are more likely to access pages in the same block or 
closely adjacent (is my understanding).


eg

blocks:            
pages:   0 1   2 3   4 56 78 9

if the ten 'random pages' were selected in the random order:
6 1 7 8 4 2 9 3 0
Consecutive reads would always read new blocks, but the sorted list 
would have blocks read sequentially.


In practice, it would be rarely this simple.  However, if any of the 
random pages where in the same block, then that block would only need to 
be fetched once - similarly if 2 of the random pages where in 
consecutive blocks, then the two blocks would be logically adjacent 
(which means they are likely to be physically close together, but not 
guaranteed!).


[...]


Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan  wrote:
>> +   for (tapenum = 0; tapenum < maxTapes; tapenum++)
>> +   {
>> +   LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>> +   (per_tape + (tapenum < cutoff ? 1 : 
>> 0)) * BLCKSZ);
>> +   }

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

(Looks again...)

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

> +   /*
> +* If we had fewer runs than tapes, refund buffers for tapes that were 
> never
> +* allocated.
> +*/
> +   maxTapes = state->maxTapes;
> +   if (state->currentRun < maxTapes)
> +   {
> +   FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
> +   maxTapes = state->currentRun;
> +   }

I find this confusing, and think it's probably buggy. I don't think
you should have a local variable called maxTapes that you modify at
all, since state->maxTapes is supposed to not change once established.
You can't use state->currentRun like that, either, I think, because
it's the high watermark number of runs (quicksorted runs), not runs
after any particular merge phase, where we end up with fewer runs as
they're merged (we must also consider dummy runs to get this) -- we
want something like activeTapes. cf. the code you removed for the
beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
there are multiple merge passes, which suggests all this code really
has no business being in mergeruns() (it should instead be in
beginmerge(), or code that beginmerge() reliably calls).

Immediately afterwards, you do this:

> +   /* Initialize the merge tuple buffer arena.  */
> +   state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
> +   state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * 
> MERGETUPLEBUFFER_SIZE;
> +   state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
> +   USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);

The fact that you size the buffer based on "maxTapes + 1" also
suggests a problem. I think that the code looks like this because it
must instruct logtape.c that the destTape tape requires some buffer
(iff there is to be a non-final merge). Is that right? I hope that you
don't give the typically unused destTape tape a full share of batch
memory all the time (the same applies to any other
inactive-at-final-merge tapes).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan  wrote:
> * Doesn't this code need to call MemoryContextAllocHuge() rather than 
> palloc()?:
>
>> @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, 
>> bool forWrite)
>> Assert(lt->frozen);
>> datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
>> }
>> +
>> +   /* Allocate a read buffer */
>> +   if (lt->buffer)
>> +   pfree(lt->buffer);
>> +   lt->buffer = palloc(lt->read_buffer_size);
>> +   lt->buffer_size = lt->read_buffer_size;

Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.

More generally, I think you also need to explain in comments why there
is a "read_buffer_size" field in addition to the "buffer_size" field.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan  wrote:
> * Please make this use the ".., + 1023" natural rounding trick that is
> used in the similar traces that are removed:
>
>> +#ifdef TRACE_SORT
>> +   if (trace_sort)
>> +   elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB 
>> per tape",
>> +(int) (state->availMem / 1024), maxTapes, (int) (per_tape * 
>> BLCKSZ) / 1024);
>> +#endif

Also, please remove the int cast, and use INT64_FORMAT. Again, this
should match existing trace_sort traces concerning batch memory.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Peter Geoghegan
On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas  wrote:
> Here's a new version of these patches, rebased over current master. I
> squashed the two patches into one, there's not much point to keep them
> separate.

I think I have my head fully around this now. For some reason, I
initially thought that this patch was a great deal more radical than
it actually is. (Like Greg, I somehow initially thought that you were
rejecting the idea of batch memory in general, and somehow (over)
leveraging the filesystem cache. I think I misunderstood your remarks
when we talked on IM about it early on.)

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).

I mention this because I think that cost_sort() should be updated to
consider sequential I/O the norm, alongside this patch of yours (your
patch strengthens the argument [1] for that general idea). The reason
that this new approach produces more sequential I/O, apart from the
simple fact that you effectively have much more memory available and
so fewer rounds of preloading, is that the indirection block stuff can
make I/O less sequential in order to support eager reclamation of
space. For example, maybe there is interleaving of blocks as logtape.c
manages to reclaim blocks in the event of multiple merge steps. I care
about that second factor a lot more now than I would have a year ago,
when a final on-the-fly merge generally avoids multiple passes (and
associated logtape.c block fragmentation), because parallel CREATE
INDEX is usually affected by that (workers will often want to do their
own merge ahead of the leader's final merge), and because I want to
cap the number of tapes used, which will make multiple passes a bit
more common in practice.

I was always suspicious of the fact that memtuples is so large during
merging, and had a vague plan to fix that (although I was the one
responsible for growing memtuples even more for the merge in 9.6, that
was just because under the status quo of having many memtuples during
the merge, the size of memtuples should at least be in balance with
remaining memory for caller tuples -- it wasn't an endorsement of the
status quo). However, it never occurred to me to do that by pushing
batch memory into the head of logtape.c, which now seems like an
excellent idea.

To summarize my understanding of this patch: If it wasn't for my work
on parallel CREATE INDEX, I would consider this patch to give us only
a moderate improvement to user-visible performance, since it doesn't
really help memory rich cases very much (cases that are not likely to
have much random I/O anyway). In that universe, I'd be more
appreciative of the patch as a simplifying exercise, since while
refactoring. It's nice to get a boost for more memory constrained
cases, it's not a huge win. However, that's not the universe we live
in -- I *am* working on parallel CREATE INDEX, of course. And so, I
really am very excited about this patch, because it really helps with
the particular challenges I have there, even though it's fair to
assume that we have a reasonable amount of memory available when
parallelism is used. If workers are going to do their own merges, as
they often will, then multiple merge pass cases become far more
important, and any additional I/O is a real concern, *especially*
additional random I/O (say from logtape.c fragmentation). The patch
directly addresses that, which is great. Your patch, alongside my
just-committed patch concerning how we maintain the heap invariant,
together attack the merge bottleneck from two different directions:
they address I/O costs, and CPU costs, respectively.

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.

* I think that logtape.c header comments are needed for this. Maybe
that's where you should point out that memory access is largely
sequential. But it's surely true that logtape.c needs to take
responsibility for being the place where the vast majority of memory
is allocated during merging.

* i think you should move "bool   *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.

* You didn't carry over my 0002-* batch memory patch modifications to
comments, even though you should have in a few cases. There remains
some references in comments to batch memory, as a thing exclusively
usable by final on-the-fly merges. That's not true anymore -- it's
usable by final merges, too. For 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-11 Thread Heikki Linnakangas
Here's a new version of these patches, rebased over current master. I 
squashed the two patches into one, there's not much point to keep them 
separate.


- Heikki

>From 6e3813d876cf3efbe5f1b80c45f44ed5494304ab Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Sun, 11 Sep 2016 18:41:44 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized buffers, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge.
---
 src/backend/utils/sort/logtape.c   | 134 -
 src/backend/utils/sort/tuplesort.c | 984 +++--
 src/include/utils/logtape.h|   1 +
 3 files changed, 389 insertions(+), 730 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..05d7697 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -131,9 +131,12 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	int			read_buffer_size;	/* buffer size to use when reading */
 } LogicalTape;
 
 /*
@@ -228,6 +231,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +596,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +680,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +691,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-BLCKSZ : lt->lastBlockBytes;
-		}
+			

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-10 Thread Peter Geoghegan
On Sat, Sep 10, 2016 at 12:04 AM, Heikki Linnakangas  wrote:
>> Did I misunderstand something? I'm applying the first patch of Peter's
>> series (cap number of tapes), then your first one (remove prefetch)
>> and second one (use larger read buffers).
>
>
> Yes. I have been testing without Peter's first patch, with just the two
> patches I posted. But it should work together (and does, I just tested) with
> that one as well.

You're going to need to rebase, since the root displace patch is based
on top of my patch 0002-*, not Heikki's alternative. But, that should
be very straightforward.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-10 Thread Heikki Linnakangas

On 09/10/2016 04:21 AM, Claudio Freire wrote:

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:


Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.


Will do so


Thanks!


It seems both 1 and 1+2 break make check.


Oh. Works for me. What's the failure you're getting?


Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).


Yes. I have been testing without Peter's first patch, with just the two 
patches I posted. But it should work together (and does, I just tested) 
with that one as well.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Claudio Freire
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

It seems both 1 and 1+2 break make check.

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Peter's patch needs some rebasing on top of those but nothing major.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Claudio Freire
On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>
> Claudio, if you could also repeat the tests you ran on Peter's patch set on
> the other thread, with these patches, that'd be nice. These patches are
> effectively a replacement for
> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
> would be much appreciated too, of course.
>
> Attached are new versions. Compared to last set, they contain a few comment
> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
> that were completely unused.


Will do so


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Peter Geoghegan
On Fri, Sep 9, 2016 at 5:25 AM, Greg Stark  wrote:
> Wow, this is really cool. We should do something like this for query
> execution too.

We should certainly do this for tuplestore.c, too. I've been meaning
to adopt it to use batch memory. I did look at it briefly, and recall
that it was surprisingly awkward because a surprisingly large number
of callers want to have memory that they can manage independently of
the lifetime of their tuplestore.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Peter Geoghegan
On Fri, Sep 9, 2016 at 4:55 AM, Heikki Linnakangas  wrote:
> I'm happy with the amount of testing I've done now, and the results. Does
> anyone want to throw out any more test cases where there might be a
> regression? If not, let's get these reviewed and committed.

I'll try to look at this properly tomorrow. Currently still working
away at creating a new revision of my sorting patchset. Obviously this
is interesting, but it raises certain questions for the parallel
CREATE INDEX patch in particular that I'd like to get straight, aside
from everything else.

I've been using an AWS d2.4xlarge instance for testing all my recent
sort patches, with 16 vCPUs, 122 GiB RAM, 12 x 2 TB disks. It worked
well to emphasize I/O throughput and parallelism over latency. I'd
like to investigate how this pre-reading stuff does there. I recall
that for one very large case, it took a full minute to do just the
first round of preloading during the leader's final merge (this was
with something like 50GB of maintenance_work_mem). So, it will be
interesting.

BTW, noticed a typo here:

> + * track memory usage of indivitual tuples.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Heikki Linnakangas

On 09/09/2016 03:25 PM, Greg Stark wrote:

On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas  wrote:

I'm happy with what it looks like. We are in fact getting a more sequential
access pattern with these patches, because we're not expanding the pre-read
tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
SortTuples, allows caching more data overall.



Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.


The first patch removed prefetching, but the second patch re-introduced 
it, in a different form. The prefetching is now done in logtape.c, by 
reading multiple pages at a time. The on-tape representation of tuples 
is more compact than having them in memory as SortTuples, so you can fit 
more data in memory overall, which makes the access pattern more sequential.


There's one difference between these approaches that I didn't point out 
earlier: We used to prefetch tuples from each *run*, and stopped 
pre-reading when we reached the end of the run. Now that we're doing the 
prefetching as raw tape blocks, we don't stop at run boundaries. I don't 
think that makes any big difference one way or another, but I thought 
I'd mention it.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Greg Stark
On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas  wrote:
> I'm happy with what it looks like. We are in fact getting a more sequential
> access pattern with these patches, because we're not expanding the pre-read
> tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
> SortTuples, allows caching more data overall.


Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Heikki Linnakangas

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan  wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.


I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.


Ok, I ran a few tests with 20 GB tables. I thought this would show any 
differences in I/O behaviour, but in fact it was still completely CPU 
bound, like the tests on smaller tables I posted yesterday. I guess I 
need to point temp_tablespaces to a USB drive or something. But here we go.


It looks like there was a regression when sorting random text, with 256 
MB work_mem. I suspect that was a fluke - I only ran these tests once 
because they took so long. But I don't know for sure.


Claudio, if you could also repeat the tests you ran on Peter's patch set 
on the other thread, with these patches, that'd be nice. These patches 
are effectively a replacement for 
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review 
would be much appreciated too, of course.


Attached are new versions. Compared to last set, they contain a few 
comment fixes, and a change to the 2nd patch to not allocate tape 
buffers for tapes that were completely unused.


- Heikki

~/git-sandbox/postgresql/src/test/sort (sort-speedups)$ PGDATABASE=sorttest 
LD_LIBRARY_PATH=~/pgsql.master/lib ./speed
# Tests on large-sized tables (20 GB), 16MB work_mem
-
random_ints: 1106152 ms
random_text: 2941289 ms

# Tests on large-sized tables (20 GB), 256MB work_mem
-
random_ints: 624327 ms
random_text: 1551049 ms

# Tests on large-sized tables (20 GB), 512MB work_mem
-
random_ints: 537217 ms
random_text: 1552546 ms

# Tests on large-sized tables (20 GB), 2GB work_mem
-
random_ints: 525460 ms
random_text: 1619040 ms
~/git-sandbox/postgresql/src/test/sort (sort-speedups)$ PGDATABASE=sorttest 
LD_LIBRARY_PATH=~/pgsql.master/lib ./speed
# Tests on large-sized tables (20 GB), 16MB work_mem
-
random_ints: 755071 ms
random_text: 3054989 ms

# Tests on large-sized tables (20 GB), 256MB work_mem
-
random_ints: 500955 ms
random_text: 1797228 ms

# Tests on large-sized tables (20 GB), 512MB work_mem
-
random_ints: 413773 ms
random_text: 1520891 ms

# Tests on large-sized tables (20 GB), 2GB work_mem
-
random_ints: 400744 ms
random_text: 1667530 ms

>From 90137ebfac0d5f2e80e2fb24cd12bfb664367f5d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 9 Sep 2016 14:10:05 +0300
Subject: [PATCH 1/2] Don't bother to pre-read tuples into SortTuple slots
 during merge.

That only seems to add overhead. We're doing the same number of READTUP()
calls either way, but we're spreading the memory usage over a larger area
if we try to pre-read, so it doesn't seem worth it.

The pre-reading can be helpful, to trigger the OS readahead of the
underlying tape, because it will make the read pattern appear more
sequential. But we'll fix that in the next patch, by teaching logtape.c to
read in larger chunks.
---
 src/backend/utils/sort/tuplesort.c | 903 ++---
 1 file changed, 226 insertions(+), 677 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c8fbcf8..a6d167a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -162,7 +162,7 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * can be freed by a simple pfree() (except during merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-08 Thread Heikki Linnakangas

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan  wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.


I spent some more time polishing this up, and also added some code to 
logtape.c, to use larger read buffers, to compensate for the fact that 
we don't do pre-reading from tuplesort.c anymore. That should trigger 
the OS read-ahead, and make the I/O more sequential, like was the 
purpose of the old pre-reading code. But simpler. I haven't tested that 
part much yet, but I plan to run some tests on larger data sets that 
don't fit in RAM, to make the I/O effects visible.


I wrote a little testing toolkit, see third patch. I'm not proposing to 
commit that, but that's what I used for testing. It creates four tables, 
about 1GB in size each (it also creates smaller and larger tables, but I 
used the "medium" sized ones for these tests). Two of the tables contain 
integers, and two contain text strings. Two of the tables are completely 
ordered, two are in random order. To measure, it runs ORDER BY queries 
on the tables, with different work_mem settings.


Attached are the full results. In summary, these patches improve 
performance in some of the tests, and are a wash on others. The patches 
help in particular in the randomly ordered cases, with up to 512 MB of 
work_mem.


For example, with work_mem=256MB, which is enough to get a single merge 
pass:


with patches:

ordered_ints: 7078 ms, 6910 ms, 6849 ms
random_ints: 15639 ms, 15575 ms, 15625 ms
ordered_text: 11121 ms, 12318 ms, 10824 ms
random_text: 53462 ms, 53420 ms, 52949 ms

unpatched master:

ordered_ints: 6961 ms, 7040 ms, 7044 ms
random_ints: 19205 ms, 18797 ms, 18955 ms
ordered_text: 11045 ms, 11377 ms, 11203 ms
random_text: 57117 ms, 54489 ms, 54806 ms

(The same queries were run three times in a row, that's what the three 
numbers on each row mean. I.e. the differences between the numbers on 
same row are noise)



It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.


I'm not 100% sure I'm accounting for all the memory correctly. But I 
didn't touch the way the initial quicksort works, nor the way the runs 
are built. And the merge passes don't actually need or benefit from a 
lot of memory, so I doubt it's very sensitive to that.


In this patch, the memory available for the read buffers is just divided 
evenly across maxTapes. The buffers for the tapes that are not currently 
active are wasted. It could be made smarter, by freeing all the 
currently-unused buffers for tapes that are not active at the moment. 
Might do that later, but this is what I'm going to benchmark for now. I 
don't think adding buffers is helpful beyond a certain point, so this is 
probably good enough in practice. Although it would be nice to free the 
memory we don't need earlier, in case there are other processes that 
could make use of it.



FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).


Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just 
double-buffering.


- Heikki

>From d4d89c88c5e26be70c976a756e874af65ad6ec55 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 8 Sep 2016 14:31:31 +0300
Subject: [PATCH 1/3] Don't bother to pre-read tuples into SortTuple slots
 during merge.

That only seems to add overhead. We're doing the same 

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan  wrote:
> Offhand, I would think that taken together this is very important. I'd
> certainly want to see cases in the hundreds of megabytes or gigabytes
> of work_mem alongside your 4MB case, even just to be able to talk
> informally about this. As you know, the default work_mem value is very
> conservative.

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuplesort merge pre-reading

2016-09-06 Thread Peter Geoghegan
On Tue, Sep 6, 2016 at 5:20 AM, Heikki Linnakangas  wrote:
> I wrote a quick patch to test that, attached. It seems to improve
> performance, at least in this small test case:
>
> create table lotsofints(i integer);
> insert into lotsofints select random() * 10.0 from
> generate_series(1, 1000);
> vacuum freeze;
>
> select count(*) FROM (select * from lotsofints order by i) t;
>
> On my laptop, with default work_mem=4MB, that select takes 7.8 s on
> unpatched master, and 6.2 s with the attached patch.

The benefits have a lot to do with OS read-ahead, and efficient use of
memory bandwidth during the merge, where we want to access the caller
tuples sequentially per tape (i.e. that's what the batch memory stuff
added -- it also made much better use of available memory). Note that
I've been benchmarking the parallel CREATE INDEX patch on a server
with many HDDs, since sequential performance is mostly all that
matters. I think that in 1999, the preloading had a lot more to do
with logtape.c's ability to aggressively recycle blocks during merges,
such that the total storage overhead does not exceed the original size
of the caller tuples (plus what it calls "trivial bookkeeping
overhead" IIRC). That's less important these days, but still matters
some (it's more of an issue when you can't complete the sort in one
pass, which is rare these days).

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers