Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On 10/11/2016 12:56 AM, Peter Geoghegan wrote: Also, something about trace_sort here: +#ifdef TRACE_SORT + if (trace_sort) + elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes", +(state->availMem) / 1024, numInputTapes); +#endif + + state->read_buffer_size = state->availMem / numInputTapes; + USEMEM(state, state->availMem); Maybe you should just make the trace_sort output talk about blocks at this point? With # of blocks, you then have to mentally divide by 8 to get the actual memory used. I think kB is nicer in messages that are meant to be read by humans. The bigger problem with this message is that it's not very accurate anymore. The actual amount of memory used is rounded down, and capped by MaxAllocSize*numInputTapes. And would it be better to print the per-tape buffer size, rather than the total? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On Mon, Oct 10, 2016 at 5:21 AM, Heikki Linnakangaswrote: > Admittedly that's confusing. Thinking about this some more, I came up with > the attached. I removed the separate LogicalTapeAssignReadBufferSize() call > altogether - the read buffer size is now passed as argument to the > LogicalTapeRewindForRead() call. > > I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and > LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally > challenging, because when reading a call site, you have to remember which > way round the boolean is. And now you also pass the extra buffer_size > argument when rewinding for reading, but not when writing. I like the general idea here. > I gave up on the logic to calculate the quotient and reminder of the > available memory, and assigning one extra buffer to some of the tapes. I'm > now just rounding it down to the nearest BLCKSZ boundary. That simplifies > the code further, although we're now not using all the memory we could. I'm > pretty sure that's OK, although I haven't done any performance testing of > this. The only thing I notice is that this new struct field could use a comment: > --- a/src/backend/utils/sort/tuplesort.c > +++ b/src/backend/utils/sort/tuplesort.c > @@ -366,6 +366,8 @@ struct Tuplesortstate > char *slabMemoryEnd; /* end of slab memory arena */ > SlabSlot *slabFreeHead; /* head of free list */ > > + size_t read_buffer_size; > + Also, something about trace_sort here: > +#ifdef TRACE_SORT > + if (trace_sort) > + elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among > %d input tapes", > +(state->availMem) / 1024, numInputTapes); > +#endif > + > + state->read_buffer_size = state->availMem / numInputTapes; > + USEMEM(state, state->availMem); Maybe you should just make the trace_sort output talk about blocks at this point? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On 10/06/2016 06:44 PM, Peter Geoghegan wrote: While the fix you pushed was probably a good idea anyway, I still think you should not use swhichtate->maxTapes to exhaustively call LogicalTapeAssignReadBufferSize() on every tape, even non-active tapes. That's the confusing part. Admittedly that's confusing. Thinking about this some more, I came up with the attached. I removed the separate LogicalTapeAssignReadBufferSize() call altogether - the read buffer size is now passed as argument to the LogicalTapeRewindForRead() call. I split LogicalTapeRewind() into two: LogicalTapeRewindForRead() and LogicalTapeRewindForWrite(). A boolean argument like 'forWrite' is mentally challenging, because when reading a call site, you have to remember which way round the boolean is. And now you also pass the extra buffer_size argument when rewinding for reading, but not when writing. I gave up on the logic to calculate the quotient and reminder of the available memory, and assigning one extra buffer to some of the tapes. I'm now just rounding it down to the nearest BLCKSZ boundary. That simplifies the code further, although we're now not using all the memory we could. I'm pretty sure that's OK, although I haven't done any performance testing of this. - Heikki >From 9862ff0cd184afc50515854b267e3a6456547ac6 Mon Sep 17 00:00:00 2001 From: Heikki LinnakangasDate: Mon, 10 Oct 2016 15:20:05 +0300 Subject: [PATCH 1/1] Simplify the code for logical tape read buffers. Pass the buffer size as argument to LogicalTapeRewindForRead, rather than setting it earlier with the separate LogicTapeAssignReadBufferSize call. This way, the buffer size is set closer to where it's actually used, which makes the code easier to understand. This makes the calculation for how much memory to use for the buffers less precise. We now use the same amount of memory for every tape, rounded down to the nearest BLCKSZ boundary, instead of using one more block for some tapes, to get the total up to exact amount of memory available. That should be OK, merging isn't too sensitive to the exact amount of memory used. --- src/backend/utils/sort/logtape.c | 214 ++--- src/backend/utils/sort/tuplesort.c | 119 ++--- src/include/utils/logtape.h| 5 +- 3 files changed, 137 insertions(+), 201 deletions(-) diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 6cc06b2..25090c7 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -141,14 +141,6 @@ typedef struct LogicalTape long curBlockNumber; /* this block's logical blk# within tape */ int pos; /* next read/write position in buffer */ int nbytes; /* total # of valid bytes in buffer */ - - /* - * Desired buffer size to use when reading. To keep things simple, we use - * a single-block buffer when writing, or when reading a frozen tape. But - * when we are reading and will only read forwards, we allocate a larger - * buffer, determined by read_buffer_size. - */ - int read_buffer_size; } LogicalTape; /* @@ -609,7 +601,6 @@ LogicalTapeSetCreate(int ntapes) lt->lastBlockBytes = 0; lt->buffer = NULL; lt->buffer_size = 0; - lt->read_buffer_size = BLCKSZ; lt->curBlockNumber = 0L; lt->pos = 0; lt->nbytes = 0; @@ -739,13 +730,20 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum, } /* - * Rewind logical tape and switch from writing to reading or vice versa. + * Rewind logical tape and switch from writing to reading. * - * Unless the tape has been "frozen" in read state, forWrite must be the - * opposite of the previous tape state. + * The tape must currently be in writing state, or "frozen" in read state. + * + * 'buffer_size' specifies how much memory to use for the read buffer. It + * does not include the memory needed for the indirect blocks. Regardless + * of the argument, the actual amount of memory used is between BLCKSZ and + * MaxAllocSize, and is a multiple of BLCKSZ. The given value is rounded + * down and truncated to fit those constraints, if necessary. If the tape + * is frozen, the 'buffer_size' argument is ignored, and a small BLCKSZ byte + * buffer is used. */ void -LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite) +LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size) { LogicalTape *lt; long datablocknum; @@ -753,88 +751,111 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite) Assert(tapenum >= 0 && tapenum < lts->nTapes); lt = >tapes[tapenum]; - if (!forWrite) + /* + * Round and cap the buffer_size if needed. + */ + if (lt->frozen) + buffer_size = BLCKSZ; + else { - if (lt->writing) - { - /* - * Completion of a write phase. Flush last partial data block, - * flush any partial indirect blocks, rewind for normal - * (destructive) read. - */ - if (lt->dirty) -ltsDumpBuffer(lts,
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On Thu, Oct 6, 2016 at 8:44 AM, Peter Geogheganwrote: > Besides, what I propose to do is really exactly the same as what you > also want to do, except it avoids actually changing state->maxTapes. > We'd just pass down what you propose to assign to state->maxTapes > directly, which differs (and not just in the common case where there > are inactive tapes -- it's always at least off-by-one). Right? What I mean is that you should pass down numTapes alongside numInputTapes. The function init_tape_buffers() could either have an additional argument (numTapes), or derive numTapes itself. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On Thu, Oct 6, 2016 at 12:00 AM, Heikki Linnakangaswrote: > This is related to earlier the discussion with Peter G, on whether we should > change state->maxTapes to reflect the actual number of tape that were used, > when that's less than maxTapes. I think his confusion about the loop in > init_tape_buffers(), in > CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would > also have been avoided, if we had done that. So I think we should reconsider > that. -1 on that from me. I don't think that you should modify a variable that is directly linkable to Knuth's description of polyphase merge -- doing so seems like a bad idea. state->maxTapes (Knuth's T) really is something that is established pretty early on, and doesn't change. While the fix you pushed was probably a good idea anyway, I still think you should not use state->maxTapes to exhaustively call LogicalTapeAssignReadBufferSize() on every tape, even non-active tapes. That's the confusing part. It's not as if your need for the number of input tapes (the number of maybe-active tapes) is long lived; you just need to instruct logtape.c on buffer sizing once, at the start of mergeruns(). Besides, what I propose to do is really exactly the same as what you also want to do, except it avoids actually changing state->maxTapes. We'd just pass down what you propose to assign to state->maxTapes directly, which differs (and not just in the common case where there are inactive tapes -- it's always at least off-by-one). Right? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
On 10/06/2016 07:50 AM, Tomas Vondra wrote: it seems e94568ecc10 has a pretty bad memory leak. A simple Oops, fixed, thanks for the report! To be precise, this wasn't a memory leak, just a gross overallocation of memory. The new code in tuplesort.c assumes that it's harmless to call LogicalTapeRewind() on never-used tapes, but in fact, it allocated the read buffer for the tape. I fixed that by changing LogicalTapeRewind() to not allocate it, if the tape was completely empty. This is related to earlier the discussion with Peter G, on whether we should change state->maxTapes to reflect the actual number of tape that were used, when that's less than maxTapes. I think his confusion about the loop in init_tape_buffers(), in CAM3SWZQ7=fcy1iuz6jnzuunnktag6uitc1i-donxscp-9zs...@mail.gmail.com would also have been avoided, if we had done that. So I think we should reconsider that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] memory leak in e94568ecc10 (pre-reading in external sort)
Hi, it seems e94568ecc10 has a pretty bad memory leak. A simple pgbench -i -s 300 allocates ~32GB of memory before it fails vacuum... set primary keys... ERROR: out of memory DETAIL: Failed on request of size 134184960. The relevant bit from the memory context stats: TupleSort main: 33278738504 total in 263 blocks; 78848 free (23 chunks); 33278659656 used regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers