Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-11 Thread Heikki Linnakangas

On 10/11/2016 12:56 AM, Peter Geoghegan wrote:

Also, something about trace_sort here:


+#ifdef TRACE_SORT
+   if (trace_sort)
+   elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d 
input tapes",
+(state->availMem) / 1024, numInputTapes);
+#endif
+
+   state->read_buffer_size = state->availMem / numInputTapes;
+   USEMEM(state, state->availMem);


Maybe you should just make the trace_sort output talk about blocks at
this point?


With # of blocks, you then have to mentally divide by 8 to get the 
actual memory used. I think kB is nicer in messages that are meant to be 
read by humans.


The bigger problem with this message is that it's not very accurate 
anymore. The actual amount of memory used is rounded down, and capped by 
MaxAllocSize*numInputTapes. And would it be better to print the per-tape 
buffer size, rather than the total?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-10 Thread Peter Geoghegan
On Mon, Oct 10, 2016 at 5:21 AM, Heikki Linnakangas  wrote:
> Admittedly that's confusing. Thinking about this some more, I came up with
> the attached. I removed the separate LogicalTapeAssignReadBufferSize() call
> altogether - the read buffer size is now passed as argument to the
> LogicalTapeRewindForRead() call.
>
> I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and
> LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally
> challenging, because when reading a call site, you have to remember which
> way round the boolean is. And now you also pass the extra buffer_size
> argument when rewinding for reading, but not when writing.

I like the general idea here.

> I gave up on the logic to calculate the quotient and reminder of the
> available memory, and assigning one extra buffer to some of the tapes. I'm
> now just rounding it down to the nearest BLCKSZ boundary. That simplifies
> the code further, although we're now not using all the memory we could. I'm
> pretty sure that's OK, although I haven't done any performance testing of
> this.

The only thing I notice is that this new struct field could use a comment:

> --- a/src/backend/utils/sort/tuplesort.c
> +++ b/src/backend/utils/sort/tuplesort.c
> @@ -366,6 +366,8 @@ struct Tuplesortstate
> char   *slabMemoryEnd;  /* end of slab memory arena */
> SlabSlot   *slabFreeHead;   /* head of free list */
>
> +   size_t  read_buffer_size;
> +

Also, something about trace_sort here:

> +#ifdef TRACE_SORT
> +   if (trace_sort)
> +   elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among 
> %d input tapes",
> +(state->availMem) / 1024, numInputTapes);
> +#endif
> +
> +   state->read_buffer_size = state->availMem / numInputTapes;
> +   USEMEM(state, state->availMem);

Maybe you should just make the trace_sort output talk about blocks at
this point?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-10 Thread Heikki Linnakangas

On 10/06/2016 06:44 PM, Peter Geoghegan wrote:

While the fix you pushed was probably a good idea anyway, I still
think you should not use swhichtate->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part.


Admittedly that's confusing. Thinking about this some more, I came up 
with the attached. I removed the separate 
LogicalTapeAssignReadBufferSize() call altogether - the read buffer size 
is now passed as argument to the LogicalTapeRewindForRead() call.


I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and 
LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is 
mentally challenging, because when reading a call site, you have to 
remember which way round the boolean is. And now you also pass the extra 
buffer_size argument when rewinding for reading, but not when writing.


I gave up on the logic to calculate the quotient and reminder of the 
available memory, and assigning one extra buffer to some of the tapes. 
I'm now just rounding it down to the nearest BLCKSZ boundary. That 
simplifies the code further, although we're now not using all the memory 
we could. I'm pretty sure that's OK, although I haven't done any 
performance testing of this.


- Heikki

>From 9862ff0cd184afc50515854b267e3a6456547ac6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 10 Oct 2016 15:20:05 +0300
Subject: [PATCH 1/1] Simplify the code for logical tape read buffers.

Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.

This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.
---
 src/backend/utils/sort/logtape.c   | 214 ++---
 src/backend/utils/sort/tuplesort.c | 119 ++---
 src/include/utils/logtape.h|   5 +-
 3 files changed, 137 insertions(+), 201 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 6cc06b2..25090c7 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -141,14 +141,6 @@ typedef struct LogicalTape
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
-
-	/*
-	 * Desired buffer size to use when reading.  To keep things simple, we use
-	 * a single-block buffer when writing, or when reading a frozen tape.  But
-	 * when we are reading and will only read forwards, we allocate a larger
-	 * buffer, determined by read_buffer_size.
-	 */
-	int			read_buffer_size;
 } LogicalTape;
 
 /*
@@ -609,7 +601,6 @@ LogicalTapeSetCreate(int ntapes)
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
 		lt->buffer_size = 0;
-		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -739,13 +730,20 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 }
 
 /*
- * Rewind logical tape and switch from writing to reading or vice versa.
+ * Rewind logical tape and switch from writing to reading.
  *
- * Unless the tape has been "frozen" in read state, forWrite must be the
- * opposite of the previous tape state.
+ * The tape must currently be in writing state, or "frozen" in read state.
+ *
+ * 'buffer_size' specifies how much memory to use for the read buffer.  It
+ * does not include the memory needed for the indirect blocks.  Regardless
+ * of the argument, the actual amount of memory used is between BLCKSZ and
+ * MaxAllocSize, and is a multiple of BLCKSZ.  The given value is rounded
+ * down and truncated to fit those constraints, if necessary.  If the tape
+ * is frozen, the 'buffer_size' argument is ignored, and a small BLCKSZ byte
+ * buffer is used.
  */
 void
-LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
+LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
 {
 	LogicalTape *lt;
 	long		datablocknum;
@@ -753,88 +751,111 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = >tapes[tapenum];
 
-	if (!forWrite)
+	/*
+	 * Round and cap the buffer_size if needed.
+	 */
+	if (lt->frozen)
+		buffer_size = BLCKSZ;
+	else
 	{
-		if (lt->writing)
-		{
-			/*
-			 * Completion of a write phase.  Flush last partial data block,
-			 * flush any partial indirect blocks, rewind for normal
-			 * (destructive) read.
-			 */
-			if (lt->dirty)
-ltsDumpBuffer(lts, 

Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-06 Thread Peter Geoghegan
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geoghegan  wrote:
> Besides, what I propose to do is really exactly the same as what you
> also want to do, except it avoids actually changing state->maxTapes.
> We'd just pass down what you propose to assign to state->maxTapes
> directly, which differs (and not just in the common case where there
> are inactive tapes -- it's always at least off-by-one). Right?

What I mean is that you should pass down numTapes alongside
numInputTapes. The function init_tape_buffers() could either have an
additional argument (numTapes), or derive numTapes itself.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-06 Thread Peter Geoghegan
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangas  wrote:
> This is related to earlier the discussion with Peter G, on whether we should
> change state->maxTapes to reflect the actual number of tape that were used,
> when that's less than maxTapes. I think his confusion about the loop in
> init_tape_buffers(), in
> CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would
> also have been avoided, if we had done that. So I think we should reconsider
> that.

-1 on that from me. I don't think that you should modify a variable
that is directly linkable to Knuth's description of polyphase merge --
doing so seems like a bad idea. state->maxTapes (Knuth's T) really is
something that is established pretty early on, and doesn't change.

While the fix you pushed was probably a good idea anyway, I still
think you should not use state->maxTapes to exhaustively call
LogicalTapeAssignReadBufferSize() on every tape, even non-active
tapes. That's the confusing part. It's not as if your need for the
number of input tapes (the number of maybe-active tapes) is long
lived; you just need to instruct logtape.c on buffer sizing once, at
the start of mergeruns().

Besides, what I propose to do is really exactly the same as what you
also want to do, except it avoids actually changing state->maxTapes.
We'd just pass down what you propose to assign to state->maxTapes
directly, which differs (and not just in the common case where there
are inactive tapes -- it's always at least off-by-one). Right?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)

2016-10-06 Thread Heikki Linnakangas

On 10/06/2016 07:50 AM, Tomas Vondra wrote:

it seems e94568ecc10 has a pretty bad memory leak. A simple


Oops, fixed, thanks for the report!

To be precise, this wasn't a memory leak, just a gross overallocation of 
memory. The new code in tuplesort.c assumes that it's harmless to  call 
LogicalTapeRewind() on never-used tapes, but in fact, it allocated the 
read buffer for the tape. I fixed that by changing LogicalTapeRewind() 
to not allocate it, if the tape was completely empty.


This is related to earlier the discussion with Peter G, on whether we 
should change state->maxTapes to reflect the actual number of tape that 
were used, when that's less than maxTapes. I think his confusion about 
the loop in init_tape_buffers(), in 
CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would 
also have been avoided, if we had done that. So I think we should 
reconsider that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers