Re: [HACKERS] Logical tape pause/resume
On Wed, Dec 21, 2016 at 4:25 AM, Heikki Linnakangas wrote: > In the meanwhile, Robert committed the cap on the number of tapes. Since > that's in, I'm not sure if the pause/resume part of this is worthwhile. > Maybe it is. I rebased my parallel tuplesort patch on top of what you committed a few days back (your 0001-* patch). It definitely made my changes to logtape.c significantly simpler, which was a big improvement. I would be inclined to not go forward with 0002-* though, because I think it's cleaner for the parallel tuplesort patch to have workers rely on the tape freezing code to flush the last block out to make state available in temp files for the leader to process/merge. The memory savings that remain on the table are probably not measurable if we were to fix them, given the work we've already done, palloc() fragmentation, and so on. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On 12/21/2016 03:22 PM, Robert Haas wrote: On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas wrote: For now, barring objections, I'm going to commit the first patch. It seems like a worthwhile simplification in any case, especially for Peter's Parallel tuplesort patch set. Well, it removes more code than it adds. That's definitely something. And saving memory per-empty-tape is good, too. A few random comments: "seeked" is kind of a lame variable name. How about "seekpos" or "newpos" or something like that? Ok. /* * Even in minimum memory, use at least a MINORDER merge. On the other * hand, even when we have lots of memory, do not use more than a MAXORDER - * merge. Tapes are pretty cheap, but they're not entirely free. Each - * additional tape reduces the amount of memory available to build runs, - * which in turn can cause the same sort to need more runs, which makes - * merging slower even if it can still be done in a single pass. Also, - * high order merges are quite slow due to CPU cache effects; it can be - * faster to pay the I/O cost of a polyphase merge than to perform a single - * merge pass across many hundreds of tapes. + * merge. Tapes are pretty cheap, but they're not entirely free. High order + * merges are quite slow due to CPU cache effects; it can be faster to pay + * the I/O cost of a polyphase merge than to perform a single merge pass + * across many hundreds of tapes. */ I think you could leave this comment as-is. You haven't zeroed out the overhead of a tape, and I like those additional bits I crammed in there. :-) Yes, quite right. That was a mishap in rebasing, that change to remove the comment really belongs to the pause/resume patch rather than the 1st one. Thanks for the review! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas wrote: > For now, barring objections, I'm going to commit the first patch. It seems > like a worthwhile simplification in any case, especially for Peter's > Parallel tuplesort patch set. Well, it removes more code than it adds. That's definitely something. And saving memory per-empty-tape is good, too. A few random comments: "seeked" is kind of a lame variable name. How about "seekpos" or "newpos" or something like that? /* * Even in minimum memory, use at least a MINORDER merge. On the other * hand, even when we have lots of memory, do not use more than a MAXORDER - * merge. Tapes are pretty cheap, but they're not entirely free. Each - * additional tape reduces the amount of memory available to build runs, - * which in turn can cause the same sort to need more runs, which makes - * merging slower even if it can still be done in a single pass. Also, - * high order merges are quite slow due to CPU cache effects; it can be - * faster to pay the I/O cost of a polyphase merge than to perform a single - * merge pass across many hundreds of tapes. + * merge. Tapes are pretty cheap, but they're not entirely free. High order + * merges are quite slow due to CPU cache effects; it can be faster to pay + * the I/O cost of a polyphase merge than to perform a single merge pass + * across many hundreds of tapes. */ I think you could leave this comment as-is. You haven't zeroed out the overhead of a tape, and I like those additional bits I crammed in there. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On 10/12/2016 05:30 PM, Heikki Linnakangas wrote: On 10/10/2016 10:31 PM, Peter Geoghegan wrote: On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas wrote: Ok, good. I think we're in agreement on doing this, then, even if we don't agree on the justification :-). Agreed. :-) Attached are new patch versions. Rebased them over current head, fixed a number of bugs in the seek and backspace code, and did a bunch of stylistic cleanups and comment fixes. A rebased set of patches attached. In the meanwhile, Robert committed the cap on the number of tapes. Since that's in, I'm not sure if the pause/resume part of this is worthwhile. Maybe it is. For now, barring objections, I'm going to commit the first patch. It seems like a worthwhile simplification in any case, especially for Peter's Parallel tuplesort patch set. - Heikki >From 70470e8903c5e9db8252463e1f0b063d209727dd Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 21 Dec 2016 14:12:37 +0200 Subject: [PATCH 1/2] Simplify tape block format. No more indirect blocks. The blocks form a linked list instead. This saves some memory, because we don't need to have a buffer in memory to hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD is reduced from 3 to 1 buffer, which allows using more memory for building the initial runs. --- src/backend/utils/sort/logtape.c | 627 +++-- src/backend/utils/sort/tuplesort.c | 69 ++-- src/include/utils/logtape.h| 4 +- 3 files changed, 218 insertions(+), 482 deletions(-) diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 316301061d..07d147dc47 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -31,15 +31,8 @@ * in BLCKSZ-size blocks. Space allocation boils down to keeping track * of which blocks in the underlying file belong to which logical tape, * plus any blocks that are free (recycled and not yet reused). - * The blocks in each logical tape are remembered using a method borrowed - * from the Unix HFS filesystem: we store data block numbers in an - * "indirect block". If an indirect block fills up, we write it out to - * the underlying file and remember its location in a second-level indirect - * block. In the same way second-level blocks are remembered in third- - * level blocks, and so on if necessary (of course we're talking huge - * amounts of data here). The topmost indirect block of a given logical - * tape is never actually written out to the physical file, but all lower- - * level indirect blocks will be. + * The blocks in each logical tape form a chain, with a prev- and next- + * pointer in each block. * * The initial write pass is guaranteed to fill the underlying file * perfectly sequentially, no matter how data is divided into logical tapes. @@ -87,58 +80,65 @@ #include "utils/memutils.h" /* - * Block indexes are "long"s, so we can fit this many per indirect block. - * NB: we assume this is an exact fit! - */ -#define BLOCKS_PER_INDIR_BLOCK ((int) (BLCKSZ / sizeof(long))) - -/* - * We use a struct like this for each active indirection level of each - * logical tape. If the indirect block is not the highest level of its - * tape, the "nextup" link points to the next higher level. Only the - * "ptrs" array is written out if we have to dump the indirect block to - * disk. If "ptrs" is not completely full, we store -1L in the first - * unused slot at completion of the write phase for the logical tape. + * A TapeBlockTrailer is stored at the end of each BLCKSZ block. + * + * The first block of a tape has prev == -1. The last block of a tape + * stores the number of valid bytes on the block, inverted, in 'next' + * Therefore next < 0 indicates the last block. */ -typedef struct IndirectBlock +typedef struct TapeBlockTrailer { - int nextSlot; /* next pointer slot to write or read */ - struct IndirectBlock *nextup; /* parent indirect level, or NULL if - * top */ - long ptrs[BLOCKS_PER_INDIR_BLOCK]; /* indexes of contained blocks */ -} IndirectBlock; + long prev; /* previous block on this tape, or -1 on first + * block */ + long next; /* next block on this tape, or # of valid + * bytes on last block (if < 0) */ +} TapeBlockTrailer; + +#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer)) +#define TapeBlockGetTrailer(buf) \ + ((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize)) + +#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0) +#define TapeBlockGetNBytes(buf) \ + (TapeBlockIsLast(buf) ? \ + (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize) +#define TapeBlockSetNBytes(buf, nbytes) \ + (TapeBlockGetTrailer(buf)->next = -(nbytes)) + /* * This data structure represents a single "logical tape" within the set - * of logical tapes stored in the same file. We must keep track of the - * current partially-read-or-written data block as well as th
Re: [HACKERS] Logical tape pause/resume
On 10/10/2016 10:31 PM, Peter Geoghegan wrote: On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas wrote: Ok, good. I think we're in agreement on doing this, then, even if we don't agree on the justification :-). Agreed. :-) Attached are new patch versions. Rebased them over current head, fixed a number of bugs in the seek and backspace code, and did a bunch of stylistic cleanups and comment fixes. I changed the API of LogicalTapeBackspace() slightly. If you try to back up beyond the beginning of tape, it used to return FALSE and do nothing. Now it backspaces as much as it can, to the very beginning of the tape, and returns the amount backspaced. That was more convenient than the old behaviour with this new implementation. - Heikki >From 588d74efe33f380221391b1d8520dc4b4cfa09be Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 12 Oct 2016 12:26:25 +0300 Subject: [PATCH 1/2] Simplify tape block format. No more indirect blocks. The blocks form a linked list instead. This saves some memory, because we don't need to have a buffer in memory to hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD is reduced from 3 to 1 buffer, which allows using more memory for building the initial runs. --- src/backend/utils/sort/logtape.c | 627 +++-- src/backend/utils/sort/tuplesort.c | 58 ++-- src/include/utils/logtape.h| 4 +- 3 files changed, 214 insertions(+), 475 deletions(-) diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 3163010..07d147d 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -31,15 +31,8 @@ * in BLCKSZ-size blocks. Space allocation boils down to keeping track * of which blocks in the underlying file belong to which logical tape, * plus any blocks that are free (recycled and not yet reused). - * The blocks in each logical tape are remembered using a method borrowed - * from the Unix HFS filesystem: we store data block numbers in an - * "indirect block". If an indirect block fills up, we write it out to - * the underlying file and remember its location in a second-level indirect - * block. In the same way second-level blocks are remembered in third- - * level blocks, and so on if necessary (of course we're talking huge - * amounts of data here). The topmost indirect block of a given logical - * tape is never actually written out to the physical file, but all lower- - * level indirect blocks will be. + * The blocks in each logical tape form a chain, with a prev- and next- + * pointer in each block. * * The initial write pass is guaranteed to fill the underlying file * perfectly sequentially, no matter how data is divided into logical tapes. @@ -87,58 +80,65 @@ #include "utils/memutils.h" /* - * Block indexes are "long"s, so we can fit this many per indirect block. - * NB: we assume this is an exact fit! - */ -#define BLOCKS_PER_INDIR_BLOCK ((int) (BLCKSZ / sizeof(long))) - -/* - * We use a struct like this for each active indirection level of each - * logical tape. If the indirect block is not the highest level of its - * tape, the "nextup" link points to the next higher level. Only the - * "ptrs" array is written out if we have to dump the indirect block to - * disk. If "ptrs" is not completely full, we store -1L in the first - * unused slot at completion of the write phase for the logical tape. + * A TapeBlockTrailer is stored at the end of each BLCKSZ block. + * + * The first block of a tape has prev == -1. The last block of a tape + * stores the number of valid bytes on the block, inverted, in 'next' + * Therefore next < 0 indicates the last block. */ -typedef struct IndirectBlock +typedef struct TapeBlockTrailer { - int nextSlot; /* next pointer slot to write or read */ - struct IndirectBlock *nextup; /* parent indirect level, or NULL if - * top */ - long ptrs[BLOCKS_PER_INDIR_BLOCK]; /* indexes of contained blocks */ -} IndirectBlock; + long prev; /* previous block on this tape, or -1 on first + * block */ + long next; /* next block on this tape, or # of valid + * bytes on last block (if < 0) */ +} TapeBlockTrailer; + +#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer)) +#define TapeBlockGetTrailer(buf) \ + ((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize)) + +#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0) +#define TapeBlockGetNBytes(buf) \ + (TapeBlockIsLast(buf) ? \ + (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize) +#define TapeBlockSetNBytes(buf, nbytes) \ + (TapeBlockGetTrailer(buf)->next = -(nbytes)) + /* * This data structure represents a single "logical tape" within the set - * of logical tapes stored in the same file. We must keep track of the - * current partially-read-or-written data block as well as the active - * indirect block level(s). + * of logical tapes stored in the same file. + * + * Wh
Re: [HACKERS] Logical tape pause/resume
On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas wrote: > Regardless of the number of tapes, the memory used for the tape buffers, > while building the initial runs, is wasted. It's not entirely wasted when > you have multiple merge passes, because without the buffer you need to read > the last partial page back into memory, when you start to output the next > run on it. But even in that case, I believe that memory would be better used > for the quicksort, to create larger runs. Okay. Somehow, I was confused on that point. > Yeah, there might be value in having a cap, because operating the merge heap > becomes more expensive when it becomes larger. Mainly because of cache > misses. We should perhaps have a limit on the size of the merge heap, and > use the same limit for the replacement selection. Although, the heap behaves > slightly differently during replacement selection: During replacement > selection, new values added to the heap tend to go to the bottom of the > heap, but during merge, new values tend to go close to the top. The latter > pattern incurs fewer cache misses. I don't think it makes sense to bring replacement selection into it. I'd vote for completely killing replacement selection at this point. The merge heap work committed a few weeks back really weakened the case for it, a case that the 9.6 work left hanging by a thread to begin with. In general, having a replacement selection heap that is larger can only be helpful when that means that we have enough "juggling capacity" to reorder things into one (or relatively few) long runs. For RS, the optimal size of the heap is good enough to do any juggling that is possible to make runs as long as possible, but no larger. The default replacement_sort_tuples setting (150,000) seems very high to me, considered from that point of view, actually. Multiple passes just don't seem to be that bad on modern hardware, in the cases where they still happen. You don't see a big drop when tuning down work_mem to just below the threshold at which the sort can complete in one pass. And, multi-pass sorts still won't happen very often in practice. Now, this work of yours risks slowing down multiple pass sorts when that does happen, but I think it could still be worth it. > But that's orthogonal to the wasting-memory aspect of this. Even if a we > have a cap of 100, it's good to not spend memory for the tape buffers of > those 100 tapes, when building the initial runs. You're right about that. >> Okay. But, you haven't actually addressed the problem with non-trivial >> amounts of memory being logically allocated ahead of time for no good >> reason -- you've only address the constant factor (the overhead >> per-tape). > > > I thought I did. Can you elaborate? > > Are you referring to the fact that the array of LogicalTapes is still > allocated ahead of time, with size maxTapes? True, it'd be nice to allocate > the LogicalTape structs only when needed. But that's peanuts, compared to > the tape buffers. Is that's all that is left to remove, in terms of wasteful USEMEM() overhead? Then, yeah, I guess I am just talking about "peanuts", plus the more significant issue of merge heap CPU cache efficiency. >> I can definitely see value in refactoring, to make that code less >> complicated -- I just don't think it's justified by the amount of >> memory that is wasted on tapes. > > > Ok, good. I think we're in agreement on doing this, then, even if we don't > agree on the justification :-). Agreed. :-) > Yeah. I'm not sure how partitioning and all that would be done here. But I'm > prettty sure this simpler logtape.c code is easier to work with, for > partitioning too. That's my intuition, too. Obviously my thoughts on partitioning are still very hand-wavy, but I'm pretty sure that partitioning is the future for the parallelization of sorting in the executor. Pushing *everything* down is more important than actually having the sort be as fast as possible there. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
Here are freshly rebased versions of these patches. No big changes, but edited some comments slightly. - Heikki >From 324706fb2e15facbc5686e98ee26bbdc5017366a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Mon, 10 Oct 2016 10:59:38 +0300 Subject: [PATCH 1/2] Simplify tape block format. No more indirect blocks. The blocks form a linked list instead. This saves some memory, because we don't need to have a buffer in memory to hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD is reduced from 3 to 1 buffer, which allows using more memory for building the initial runs. --- src/backend/utils/sort/logtape.c | 581 ++--- src/backend/utils/sort/tuplesort.c | 17 +- src/include/utils/logtape.h| 2 +- 3 files changed, 156 insertions(+), 444 deletions(-) diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index 6cc06b2..7e1f0d8 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -31,15 +31,8 @@ * in BLCKSZ-size blocks. Space allocation boils down to keeping track * of which blocks in the underlying file belong to which logical tape, * plus any blocks that are free (recycled and not yet reused). - * The blocks in each logical tape are remembered using a method borrowed - * from the Unix HFS filesystem: we store data block numbers in an - * "indirect block". If an indirect block fills up, we write it out to - * the underlying file and remember its location in a second-level indirect - * block. In the same way second-level blocks are remembered in third- - * level blocks, and so on if necessary (of course we're talking huge - * amounts of data here). The topmost indirect block of a given logical - * tape is never actually written out to the physical file, but all lower- - * level indirect blocks will be. + * The blocks in each logical tape form a chain, with a prev- and next- + * pointer in each block. * * The initial write pass is guaranteed to fill the underlying file * perfectly sequentially, no matter how data is divided into logical tapes. @@ -87,58 +80,58 @@ #include "utils/memutils.h" /* - * Block indexes are "long"s, so we can fit this many per indirect block. - * NB: we assume this is an exact fit! - */ -#define BLOCKS_PER_INDIR_BLOCK ((int) (BLCKSZ / sizeof(long))) - -/* - * We use a struct like this for each active indirection level of each - * logical tape. If the indirect block is not the highest level of its - * tape, the "nextup" link points to the next higher level. Only the - * "ptrs" array is written out if we have to dump the indirect block to - * disk. If "ptrs" is not completely full, we store -1L in the first - * unused slot at completion of the write phase for the logical tape. + * A TapeBlockTrailer is stored at the end of each BLCKSZ block. + * + * The first block of a tape has prev == -1. The last block of a tape + * stores the number of valid bytes on the block, subtracted by BLCKSZ. + * That's always negative, so prev < 0 indicates the last block. */ -typedef struct IndirectBlock +typedef struct TapeBlockTrailer { - int nextSlot; /* next pointer slot to write or read */ - struct IndirectBlock *nextup; /* parent indirect level, or NULL if - * top */ - long ptrs[BLOCKS_PER_INDIR_BLOCK]; /* indexes of contained blocks */ -} IndirectBlock; + long prev; /* previous block on this tape, or -1 on first block */ + long next; /* next block on this tape, or # of valid bytes on last block */ +} TapeBlockTrailer; + +#define TapeBlockGetTrailer(buf) ((TapeBlockTrailer *) ((char *) buf + BLCKSZ - sizeof(TapeBlockTrailer))) +#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer)) + +#define TapeBlockGetNBytes(buf) (TapeBlockGetTrailer(buf)->next < 0 ? (TapeBlockGetTrailer(buf)->next + BLCKSZ) : TapeBlockPayloadSize) + +#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0) /* * This data structure represents a single "logical tape" within the set - * of logical tapes stored in the same file. We must keep track of the - * current partially-read-or-written data block as well as the active - * indirect block level(s). + * of logical tapes stored in the same file. + * + * While writing, we hold the current partially-written data block in the + * buffer. While reading, we can hold multiple blocks in the buffer. Note + * that we don't retain the trailers of a block when it's read into the + * buffer. The buffer therefore contains one large contiguous chunk of data + * from the tape. */ typedef struct LogicalTape { - IndirectBlock *indirect; /* bottom of my indirect-block hierarchy */ bool writing; /* T while in write phase */ bool frozen; /* T if blocks should not be freed when read */ bool dirty; /* does buffer need to be written? */ /* - * The total data volume in the logical tape is numFullBlocks * BLCKSZ + - * lastBlockBytes. BUT: we do not update lastBlockBytes du
Re: [HACKERS] Logical tape pause/resume
On 10/09/2016 03:27 AM, Peter Geoghegan wrote: You shouldn't really waste 8% of the budget with low work_mem settings with my cap patch applied, because the new cap never limits the number of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you have about 1GB of work_mem. So, if there is still any waste at the low end, that can only be solved by tweaking the main calculation within tuplesort_merge_order(). (Also, to be clear to others following along: that memory is never actually allocated, so it's only "wasted" from the perspective of one sort operation alone). Regardless of the number of tapes, the memory used for the tape buffers, while building the initial runs, is wasted. It's not entirely wasted when you have multiple merge passes, because without the buffer you need to read the last partial page back into memory, when you start to output the next run on it. But even in that case, I believe that memory would be better used for the quicksort, to create larger runs. The cost of multiple passes is paid in sequential I/O of tape temp files, which is now clearly a relatively low cost. OTOH, the cost of a larger merge heap is paid in more CPU cache misses, which is a cost we can feel quite severely. While it's really hard to come up with a generic break-even point, I do think that there is value in capping the number of tapes somewhere in the hundreds. It's not like a cap on the number of tapes along the lines I've proposed was not thought about from day one, by both Tom and Simon. Noah's relatively recent MaxAllocSize work has given the issue new importance, though (the same might have been said during the 9.6 replacement selection vs. quicksort discussions, actually). Yeah, there might be value in having a cap, because operating the merge heap becomes more expensive when it becomes larger. Mainly because of cache misses. We should perhaps have a limit on the size of the merge heap, and use the same limit for the replacement selection. Although, the heap behaves slightly differently during replacement selection: During replacement selection, new values added to the heap tend to go to the bottom of the heap, but during merge, new values tend to go close to the top. The latter pattern incurs fewer cache misses. But that's orthogonal to the wasting-memory aspect of this. Even if a we have a cap of 100, it's good to not spend memory for the tape buffers of those 100 tapes, when building the initial runs. When we finish writing an initial run to a tape, we keep the tape buffers around. That's the reason we need to reserve that memory. But there's no fundamental reason we have to do that. As I suggested in [2], we could flush the buffers to disk, and only keep in memory the location of the last block we wrote. If we need to write another run to the tape, we can reload the last incomplete block from the disk, and continue writing. Okay. But, you haven't actually addressed the problem with non-trivial amounts of memory being logically allocated ahead of time for no good reason -- you've only address the constant factor (the overhead per-tape). I thought I did. Can you elaborate? Are you referring to the fact that the array of LogicalTapes is still allocated ahead of time, with size maxTapes? True, it'd be nice to allocate the LogicalTape structs only when needed. But that's peanuts, compared to the tape buffers. In addition to saving a little bit of memory, I'd like to do this refactoring because it simplifies the code. It's code that has stayed largely unchanged for the past 15 years, so I'm not too eager to touch it, but looking at the changes coming with Peter's parallel tuplesort patch set, I think this makes sense. I can definitely see value in refactoring, to make that code less complicated -- I just don't think it's justified by the amount of memory that is wasted on tapes. Ok, good. I think we're in agreement on doing this, then, even if we don't agree on the justification :-). That said, I'm not especially worried about the complexity within the logtape.c changes of the parallel CREATE INDEX patch alone. I'm much more interested in having a logtape.c that could be more easily made to support binary searching, etc, to find *partition* boundaries, which my CREATE INDEX patch doesn't use or care about. This is needed for tuplesort.c partition-based sorting. When parallel sort isn't just used by CREATE INDEX, partitioning becomes important. And, with partitioning, dynamic sampling is essential (I think that this will end up living in logtape.c). Yeah. I'm not sure how partitioning and all that would be done here. But I'm prettty sure this simpler logtape.c code is easier to work with, for partitioning too. We might want to have a each partition on a separate tape, for example, and therefore have a lot more tapes than currently. Not wasting memory on the tape buffers becomes important in that case. - Heikki -- Sent via pgsql-hackers m
Re: [HACKERS] Logical tape pause/resume
Apologies for the delayed response to this. On Tue, Oct 4, 2016 at 3:47 AM, Heikki Linnakangas wrote: > One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] is > to put a cap on the number of tapes to use for the sort. > That amounts to about 8% of the available memory. That's quite wasteful. > Peter's approach of putting an arbitrary cap on the max number of tapes > works, but I find it a bit hackish. And you still waste that 8% with smaller > work_mem settings. I don't think it's hackish -- there was never a theoretical justification for using as many tapes as possible from Knuth, or anyone else. I think that Simon's 2006 work on allowing for a number of tapes much greater than Knuth's "sweet spot" of 7 was useful only because it sometimes enabled final on-the-fly merging where we are not merging a huge number of runs (i.e. when merging only somewhat more than 7 runs). Your recent work on tape preloading has probably greatly diminished the value of not doing all merging on-the-fly in a single, final merge, though. Knuth's sweet spot of 7 had little or nothing to do with the economics of buying many tape drives. You shouldn't really waste 8% of the budget with low work_mem settings with my cap patch applied, because the new cap never limits the number of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you have about 1GB of work_mem. So, if there is still any waste at the low end, that can only be solved by tweaking the main calculation within tuplesort_merge_order(). (Also, to be clear to others following along: that memory is never actually allocated, so it's only "wasted" from the perspective of one sort operation alone). The cost of multiple passes is paid in sequential I/O of tape temp files, which is now clearly a relatively low cost. OTOH, the cost of a larger merge heap is paid in more CPU cache misses, which is a cost we can feel quite severely. While it's really hard to come up with a generic break-even point, I do think that there is value in capping the number of tapes somewhere in the hundreds. It's not like a cap on the number of tapes along the lines I've proposed was not thought about from day one, by both Tom and Simon. Noah's relatively recent MaxAllocSize work has given the issue new importance, though (the same might have been said during the 9.6 replacement selection vs. quicksort discussions, actually). > When we finish writing an initial run to a tape, we keep the tape buffers > around. That's the reason we need to reserve that memory. But there's no > fundamental reason we have to do that. As I suggested in [2], we could flush > the buffers to disk, and only keep in memory the location of the last block > we wrote. If we need to write another run to the tape, we can reload the > last incomplete block from the disk, and continue writing. Okay. But, you haven't actually addressed the problem with non-trivial amounts of memory being logically allocated ahead of time for no good reason -- you've only address the constant factor (the overhead per-tape). Can't we also fix the general problem, by applying a cap? Better to have a cap that is approximately right (in the hundreds or so) than one that is exactly wrong (infinity -- no cap). > Reloading the last block, requires an extra I/O. That's OK. It's quite rare > to have a multi-pass sort these days, so it's a good bet that you don't need > to do it. And if you have a very small work_mem, so that you need to do a > multi-pass sort, having some more memory available for building the initial > runs is probably worth it, and there's also a good chance that the block is > still in the OS cache. That analysis does seem sound to me. > In addition to saving a little bit of memory, I'd like to do this > refactoring because it simplifies the code. It's code that has stayed > largely unchanged for the past 15 years, so I'm not too eager to touch it, > but looking at the changes coming with Peter's parallel tuplesort patch set, > I think this makes sense. I can definitely see value in refactoring, to make that code less complicated -- I just don't think it's justified by the amount of memory that is wasted on tapes. That said, I'm not especially worried about the complexity within the logtape.c changes of the parallel CREATE INDEX patch alone. I'm much more interested in having a logtape.c that could be more easily made to support binary searching, etc, to find *partition* boundaries, which my CREATE INDEX patch doesn't use or care about. This is needed for tuplesort.c partition-based sorting. When parallel sort isn't just used by CREATE INDEX, partitioning becomes important. And, with partitioning, dynamic sampling is essential (I think that this will end up living in logtape.c). To recap on what I went into in the paritioning-to-parallel-tuplesort thread [1], I think that partitioning will come in a future release, and will be of more value to parallel queries, where much more can be pushed down wi
Re: [HACKERS] Logical tape pause/resume
On 10/04/2016 07:06 PM, Claudio Freire wrote: On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas wrote: 1. The first patch changes the way we store the logical tapes on disk. Instead of using indirect blocks, HFS style, the blocks form a linked list. Each block has a trailer, with the block numbers of the previous and next block of the tape. That eliminates the indirect blocks, which simplifies the code quite a bit, and makes it simpler to implement the pause/resume functionality in the second patch. It also immediately reduces the memory needed for the buffers, from 3 to 1 block per tape, as we don't need to hold the indirect blocks in memory. Doesn't that make prefetching more than a buffer ahead harder? Yes, indeed. That's a potential downside, if we wanted to do that. We rely wholly on OS readahead for that currently, so it doesn't matter. I don't think we want to start issuing explicit pre-fetching calls here any time soon. But if we did, we could presume that the pages in sequential order for prefetching purposes, and that would work pretty well in practice (that's what the OS readahead is doing for us now). Or we could try to enforce that by allocating blocks in larger chunks. In short, I'm OK with that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas wrote: > 1. The first patch changes the way we store the logical tapes on disk. > Instead of using indirect blocks, HFS style, the blocks form a linked list. > Each block has a trailer, with the block numbers of the previous and next > block of the tape. That eliminates the indirect blocks, which simplifies the > code quite a bit, and makes it simpler to implement the pause/resume > functionality in the second patch. It also immediately reduces the memory > needed for the buffers, from 3 to 1 block per tape, as we don't need to hold > the indirect blocks in memory. Doesn't that make prefetching more than a buffer ahead harder? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On 10/04/2016 05:58 PM, Simon Riggs wrote: On 4 October 2016 at 12:47, Heikki Linnakangas wrote: Why not just make each new run start at a block boundary? That way we waste on average BLCKSZ/2 disk space per run, which is negligible but we avoid any need to have code to read back in the last block. Hmm. You'd still have to read back the last block, so that you can update its next-pointer. If each run is in its own file, then you can skip that bit. Then you need to remember the names of the files, in memory. That's closer to my idea of having only one run on each tape, but you're taking further by also saying that each tape is a separate file. That might be a good idea for building the initial runs. Each file would then be written and read sequentially, so we could perhaps get rid of the whole pre-reading code, and rely completely on OS read-ahead. However, can't really do that for multi-pass merges, because you want to reuse the space of the input tapes for the output tape, as you go. And we do want the sort to disk to use multiple files so we can parallelize I/O as well as CPU. Huh? Why can't you parallelize I/O on a single file? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On 4 October 2016 at 12:47, Heikki Linnakangas wrote: >> Why not just make each new run start at a block boundary? >> That way we waste on average BLCKSZ/2 disk space per run, which is >> negligible but we avoid any need to have code to read back in the last >> block. > > > Hmm. You'd still have to read back the last block, so that you can update > its next-pointer. If each run is in its own file, then you can skip that bit. And we do want the sort to disk to use multiple files so we can parallelize I/O as well as CPU. So since we know we'll want multiple files, we should be thinking about how to split things up between files. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On 10/04/2016 02:09 PM, Simon Riggs wrote: On 4 October 2016 at 11:47, Heikki Linnakangas wrote: When we finish writing an initial run to a tape, we keep the tape buffers around. That's the reason we need to reserve that memory. But there's no fundamental reason we have to do that. As I suggested in [2], we could flush the buffers to disk, and only keep in memory the location of the last block we wrote. If we need to write another run to the tape, we can reload the last incomplete block from the disk, and continue writing. Reloading the last block, requires an extra I/O. That's OK. It's quite rare to have a multi-pass sort these days, so it's a good bet that you don't need to do it. And if you have a very small work_mem, so that you need to do a multi-pass sort, having some more memory available for building the initial runs is probably worth it, and there's also a good chance that the block is still in the OS cache. Sounds like a good idea. Why not just make each new run start at a block boundary? That way we waste on average BLCKSZ/2 disk space per run, which is negligible but we avoid any need to have code to read back in the last block. Hmm. You'd still have to read back the last block, so that you can update its next-pointer. What you could do, though, is to always store only one run on each tape. If we can make the in-memory representation of a tape small enough, that might be acceptable. You only need 4 bytes to store the starting block number. Or we could dump the starting block numbers of the tapes, to yet another tape. The number of tapes you merge in one pass would still be limited by the amount of memory you have, but the number of tapes would be unlimited. I don't want to do that right now, it gets more invasive. Something to keep in mind for the future, though. This also related to your other suggestion below: That also makes it easier to put each run in its own file, which will surely help with parallelism and compression since we can spread multiple run files across multiple temp tablespaces. Can we also please read in the value from the last tuple in a run, so we have min and max values in memory for each run? That can be used during the merge step to avoid merging runs unless the value ranges overlap. This gets more advanced... Yeah, stuff like that would make sense. You could also split the tapes into smaller parts, and store the min/max for each part, to make the merging even more efficient and easier to parallelize. Or to take that to the extreme, you could make each tape a little B-tree. But that's for another day and another patch :-). What I have now is a drop-in replacement for the current logical tape implementation. These more advanced schemes would build on top of that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical tape pause/resume
On 4 October 2016 at 11:47, Heikki Linnakangas wrote: > When we finish writing an initial run to a tape, we keep the tape buffers > around. That's the reason we need to reserve that memory. But there's no > fundamental reason we have to do that. As I suggested in [2], we could flush > the buffers to disk, and only keep in memory the location of the last block > we wrote. If we need to write another run to the tape, we can reload the > last incomplete block from the disk, and continue writing. > > Reloading the last block, requires an extra I/O. That's OK. It's quite rare > to have a multi-pass sort these days, so it's a good bet that you don't need > to do it. And if you have a very small work_mem, so that you need to do a > multi-pass sort, having some more memory available for building the initial > runs is probably worth it, and there's also a good chance that the block is > still in the OS cache. Sounds like a good idea. Why not just make each new run start at a block boundary? That way we waste on average BLCKSZ/2 disk space per run, which is negligible but we avoid any need to have code to read back in the last block. That also makes it easier to put each run in its own file, which will surely help with parallelism and compression since we can spread multiple run files across multiple temp tablespaces. Can we also please read in the value from the last tuple in a run, so we have min and max values in memory for each run? That can be used during the merge step to avoid merging runs unless the value ranges overlap. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers