Re: [HACKERS] Logical tape pause/resume

2016-12-28 Thread Peter Geoghegan
On Wed, Dec 21, 2016 at 4:25 AM, Heikki Linnakangas  wrote:
> In the meanwhile, Robert committed the cap on the number of tapes. Since
> that's in, I'm not sure if the pause/resume part of this is worthwhile.
> Maybe it is.

I rebased my parallel tuplesort patch on top of what you committed a
few days back (your 0001-* patch). It definitely made my changes to
logtape.c significantly simpler, which was a big improvement.

I would be inclined to not go forward with 0002-* though, because I
think it's cleaner for the parallel tuplesort patch to have workers
rely on the tape freezing code to flush the last block out to make
state available in temp files for the leader to process/merge. The
memory savings that remain on the table are probably not measurable if
we were to fix them, given the work we've already done, palloc()
fragmentation, and so on.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-12-21 Thread Heikki Linnakangas

On 12/21/2016 03:22 PM, Robert Haas wrote:

On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas  wrote:

For now, barring objections, I'm going to commit the first patch. It seems
like a worthwhile simplification in any case, especially for Peter's
Parallel tuplesort patch set.


Well, it removes more code than it adds.  That's definitely something.
And saving memory per-empty-tape is good, too.  A few random comments:

"seeked" is kind of a lame variable name.  How about "seekpos" or
"newpos" or something like that?


Ok.


 /*
  * Even in minimum memory, use at least a MINORDER merge.  On the other
  * hand, even when we have lots of memory, do not use more than a MAXORDER
- * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
- * additional tape reduces the amount of memory available to build runs,
- * which in turn can cause the same sort to need more runs, which makes
- * merging slower even if it can still be done in a single pass.  Also,
- * high order merges are quite slow due to CPU cache effects; it can be
- * faster to pay the I/O cost of a polyphase merge than to perform a single
- * merge pass across many hundreds of tapes.
+ * merge.  Tapes are pretty cheap, but they're not entirely free.
High order
+ * merges are quite slow due to CPU cache effects; it can be faster to pay
+ * the I/O cost of a polyphase merge than to perform a single merge pass
+ * across many hundreds of tapes.
  */

I think you could leave this comment as-is.  You haven't zeroed out
the overhead of a tape, and I like those additional bits I crammed in
there.  :-)


Yes, quite right. That was a mishap in rebasing, that change to remove 
the comment really belongs to the pause/resume patch rather than the 1st 
one.


Thanks for the review!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-12-21 Thread Robert Haas
On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas  wrote:
> For now, barring objections, I'm going to commit the first patch. It seems
> like a worthwhile simplification in any case, especially for Peter's
> Parallel tuplesort patch set.

Well, it removes more code than it adds.  That's definitely something.
And saving memory per-empty-tape is good, too.  A few random comments:

"seeked" is kind of a lame variable name.  How about "seekpos" or
"newpos" or something like that?

 /*
  * Even in minimum memory, use at least a MINORDER merge.  On the other
  * hand, even when we have lots of memory, do not use more than a MAXORDER
- * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
- * additional tape reduces the amount of memory available to build runs,
- * which in turn can cause the same sort to need more runs, which makes
- * merging slower even if it can still be done in a single pass.  Also,
- * high order merges are quite slow due to CPU cache effects; it can be
- * faster to pay the I/O cost of a polyphase merge than to perform a single
- * merge pass across many hundreds of tapes.
+ * merge.  Tapes are pretty cheap, but they're not entirely free.
High order
+ * merges are quite slow due to CPU cache effects; it can be faster to pay
+ * the I/O cost of a polyphase merge than to perform a single merge pass
+ * across many hundreds of tapes.
  */

I think you could leave this comment as-is.  You haven't zeroed out
the overhead of a tape, and I like those additional bits I crammed in
there.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-12-21 Thread Heikki Linnakangas

On 10/12/2016 05:30 PM, Heikki Linnakangas wrote:

On 10/10/2016 10:31 PM, Peter Geoghegan wrote:

On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas  wrote:

Ok, good. I think we're in agreement on doing this, then, even if we don't
agree on the justification :-).


Agreed. :-)


Attached are new patch versions. Rebased them over current head, fixed a
number of bugs in the seek and backspace code, and did a bunch of
stylistic cleanups and comment fixes.


A rebased set of patches attached.

In the meanwhile, Robert committed the cap on the number of tapes. Since 
that's in, I'm not sure if the pause/resume part of this is worthwhile. 
Maybe it is.


For now, barring objections, I'm going to commit the first patch. It 
seems like a worthwhile simplification in any case, especially for 
Peter's Parallel tuplesort patch set.


- Heikki

>From 70470e8903c5e9db8252463e1f0b063d209727dd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 21 Dec 2016 14:12:37 +0200
Subject: [PATCH 1/2] Simplify tape block format.

No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
---
 src/backend/utils/sort/logtape.c   | 627 +++--
 src/backend/utils/sort/tuplesort.c |  69 ++--
 src/include/utils/logtape.h|   4 +-
 3 files changed, 218 insertions(+), 482 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 316301061d..07d147dc47 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -31,15 +31,8 @@
  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
  * of which blocks in the underlying file belong to which logical tape,
  * plus any blocks that are free (recycled and not yet reused).
- * The blocks in each logical tape are remembered using a method borrowed
- * from the Unix HFS filesystem: we store data block numbers in an
- * "indirect block".  If an indirect block fills up, we write it out to
- * the underlying file and remember its location in a second-level indirect
- * block.  In the same way second-level blocks are remembered in third-
- * level blocks, and so on if necessary (of course we're talking huge
- * amounts of data here).  The topmost indirect block of a given logical
- * tape is never actually written out to the physical file, but all lower-
- * level indirect blocks will be.
+ * The blocks in each logical tape form a chain, with a prev- and next-
+ * pointer in each block.
  *
  * The initial write pass is guaranteed to fill the underlying file
  * perfectly sequentially, no matter how data is divided into logical tapes.
@@ -87,58 +80,65 @@
 #include "utils/memutils.h"
 
 /*
- * Block indexes are "long"s, so we can fit this many per indirect block.
- * NB: we assume this is an exact fit!
- */
-#define BLOCKS_PER_INDIR_BLOCK	((int) (BLCKSZ / sizeof(long)))
-
-/*
- * We use a struct like this for each active indirection level of each
- * logical tape.  If the indirect block is not the highest level of its
- * tape, the "nextup" link points to the next higher level.  Only the
- * "ptrs" array is written out if we have to dump the indirect block to
- * disk.  If "ptrs" is not completely full, we store -1L in the first
- * unused slot at completion of the write phase for the logical tape.
+ * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
+ *
+ * The first block of a tape has prev == -1.  The last block of a tape
+ * stores the number of valid bytes on the block, inverted, in 'next'
+ * Therefore next < 0 indicates the last block.
  */
-typedef struct IndirectBlock
+typedef struct TapeBlockTrailer
 {
-	int			nextSlot;		/* next pointer slot to write or read */
-	struct IndirectBlock *nextup;		/* parent indirect level, or NULL if
-		 * top */
-	long		ptrs[BLOCKS_PER_INDIR_BLOCK];	/* indexes of contained blocks */
-} IndirectBlock;
+	long		prev;			/* previous block on this tape, or -1 on first
+ * block */
+	long		next;			/* next block on this tape, or # of valid
+ * bytes on last block (if < 0) */
+}	TapeBlockTrailer;
+
+#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+#define TapeBlockGetTrailer(buf) \
+	((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
+
+#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
+#define TapeBlockGetNBytes(buf) \
+	(TapeBlockIsLast(buf) ? \
+	 (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize)
+#define TapeBlockSetNBytes(buf, nbytes) \
+	(TapeBlockGetTrailer(buf)->next = -(nbytes))
+
 
 /*
  * This data structure represents a single "logical tape" within the set
- * of logical tapes stored in the same file.  We must keep track of the
- * current partially-read-or-written data block as well as th

Re: [HACKERS] Logical tape pause/resume

2016-10-12 Thread Heikki Linnakangas

On 10/10/2016 10:31 PM, Peter Geoghegan wrote:

On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas  wrote:

Ok, good. I think we're in agreement on doing this, then, even if we don't
agree on the justification :-).


Agreed. :-)


Attached are new patch versions. Rebased them over current head, fixed a 
number of bugs in the seek and backspace code, and did a bunch of 
stylistic cleanups and comment fixes.


I changed the API of LogicalTapeBackspace() slightly. If you try to back 
up beyond the beginning of tape, it used to return FALSE and do nothing. 
Now it backspaces as much as it can, to the very beginning of the tape, 
and returns the amount backspaced. That was more convenient than the old 
behaviour with this new implementation.


- Heikki

>From 588d74efe33f380221391b1d8520dc4b4cfa09be Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 12 Oct 2016 12:26:25 +0300
Subject: [PATCH 1/2] Simplify tape block format.

No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
---
 src/backend/utils/sort/logtape.c   | 627 +++--
 src/backend/utils/sort/tuplesort.c |  58 ++--
 src/include/utils/logtape.h|   4 +-
 3 files changed, 214 insertions(+), 475 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 3163010..07d147d 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -31,15 +31,8 @@
  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
  * of which blocks in the underlying file belong to which logical tape,
  * plus any blocks that are free (recycled and not yet reused).
- * The blocks in each logical tape are remembered using a method borrowed
- * from the Unix HFS filesystem: we store data block numbers in an
- * "indirect block".  If an indirect block fills up, we write it out to
- * the underlying file and remember its location in a second-level indirect
- * block.  In the same way second-level blocks are remembered in third-
- * level blocks, and so on if necessary (of course we're talking huge
- * amounts of data here).  The topmost indirect block of a given logical
- * tape is never actually written out to the physical file, but all lower-
- * level indirect blocks will be.
+ * The blocks in each logical tape form a chain, with a prev- and next-
+ * pointer in each block.
  *
  * The initial write pass is guaranteed to fill the underlying file
  * perfectly sequentially, no matter how data is divided into logical tapes.
@@ -87,58 +80,65 @@
 #include "utils/memutils.h"
 
 /*
- * Block indexes are "long"s, so we can fit this many per indirect block.
- * NB: we assume this is an exact fit!
- */
-#define BLOCKS_PER_INDIR_BLOCK	((int) (BLCKSZ / sizeof(long)))
-
-/*
- * We use a struct like this for each active indirection level of each
- * logical tape.  If the indirect block is not the highest level of its
- * tape, the "nextup" link points to the next higher level.  Only the
- * "ptrs" array is written out if we have to dump the indirect block to
- * disk.  If "ptrs" is not completely full, we store -1L in the first
- * unused slot at completion of the write phase for the logical tape.
+ * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
+ *
+ * The first block of a tape has prev == -1.  The last block of a tape
+ * stores the number of valid bytes on the block, inverted, in 'next'
+ * Therefore next < 0 indicates the last block.
  */
-typedef struct IndirectBlock
+typedef struct TapeBlockTrailer
 {
-	int			nextSlot;		/* next pointer slot to write or read */
-	struct IndirectBlock *nextup;		/* parent indirect level, or NULL if
-		 * top */
-	long		ptrs[BLOCKS_PER_INDIR_BLOCK];	/* indexes of contained blocks */
-} IndirectBlock;
+	long		prev;			/* previous block on this tape, or -1 on first
+ * block */
+	long		next;			/* next block on this tape, or # of valid
+ * bytes on last block (if < 0) */
+}	TapeBlockTrailer;
+
+#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+#define TapeBlockGetTrailer(buf) \
+	((TapeBlockTrailer *) ((char *) buf + TapeBlockPayloadSize))
+
+#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
+#define TapeBlockGetNBytes(buf) \
+	(TapeBlockIsLast(buf) ? \
+	 (- TapeBlockGetTrailer(buf)->next) : TapeBlockPayloadSize)
+#define TapeBlockSetNBytes(buf, nbytes) \
+	(TapeBlockGetTrailer(buf)->next = -(nbytes))
+
 
 /*
  * This data structure represents a single "logical tape" within the set
- * of logical tapes stored in the same file.  We must keep track of the
- * current partially-read-or-written data block as well as the active
- * indirect block level(s).
+ * of logical tapes stored in the same file.
+ *
+ * Wh

Re: [HACKERS] Logical tape pause/resume

2016-10-10 Thread Peter Geoghegan
On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas  wrote:
> Regardless of the number of tapes, the memory used for the tape buffers,
> while building the initial runs, is wasted. It's not entirely wasted when
> you have multiple merge passes, because without the buffer you need to read
> the last partial page back into memory, when you start to output the next
> run on it. But even in that case, I believe that memory would be better used
> for the quicksort, to create larger runs.

Okay. Somehow, I was confused on that point.

> Yeah, there might be value in having a cap, because operating the merge heap
> becomes more expensive when it becomes larger. Mainly because of cache
> misses. We should perhaps have a limit on the size of the merge heap, and
> use the same limit for the replacement selection. Although, the heap behaves
> slightly differently during replacement selection: During replacement
> selection, new values added to the heap tend to go to the bottom of the
> heap, but during merge, new values tend to go close to the top. The latter
> pattern incurs fewer cache misses.

I don't think it makes sense to bring replacement selection into it.
I'd vote for completely killing replacement selection at this point.
The merge heap work committed a few weeks back really weakened the
case for it, a case that the 9.6 work left hanging by a thread to
begin with. In general, having a replacement selection heap that is
larger can only be helpful when that means that we have enough
"juggling capacity" to reorder things into one (or relatively few)
long runs. For RS, the optimal size of the heap is good enough to do
any juggling that is possible to make runs as long as possible, but no
larger. The default replacement_sort_tuples setting (150,000) seems
very high to me, considered from that point of view, actually.

Multiple passes just don't seem to be that bad on modern hardware, in
the cases where they still happen. You don't see a big drop when
tuning down work_mem to just below the threshold at which the sort can
complete in one pass. And, multi-pass sorts still won't happen very
often in practice. Now, this work of yours risks slowing down multiple
pass sorts when that does happen, but I think it could still be worth
it.

> But that's orthogonal to the wasting-memory aspect of this. Even if a we
> have a cap of 100, it's good to not spend memory for the tape buffers of
> those 100 tapes, when building the initial runs.

You're right about that.

>> Okay. But, you haven't actually addressed the problem with non-trivial
>> amounts of memory being logically allocated ahead of time for no good
>> reason -- you've only address the constant factor (the overhead
>> per-tape).
>
>
> I thought I did. Can you elaborate?
>
> Are you referring to the fact that the array of LogicalTapes is still
> allocated ahead of time, with size maxTapes? True, it'd be nice to allocate
> the LogicalTape structs only when needed. But that's peanuts, compared to
> the tape buffers.

Is that's all that is left to remove, in terms of wasteful USEMEM()
overhead? Then, yeah, I guess I am just talking about "peanuts", plus
the more significant issue of merge heap CPU cache efficiency.

>> I can definitely see value in refactoring, to make that code less
>> complicated -- I just don't think it's justified by the amount of
>> memory that is wasted on tapes.
>
>
> Ok, good. I think we're in agreement on doing this, then, even if we don't
> agree on the justification :-).

Agreed. :-)

> Yeah. I'm not sure how partitioning and all that would be done here. But I'm
> prettty sure this simpler logtape.c code is easier to work with, for
> partitioning too.

That's my intuition, too. Obviously my thoughts on partitioning are
still very hand-wavy, but I'm pretty sure that partitioning is the
future for the parallelization of sorting in the executor. Pushing
*everything* down is more important than actually having the sort be
as fast as possible there.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-10-10 Thread Heikki Linnakangas
Here are freshly rebased versions of these patches. No big changes, but 
edited some comments slightly.


- Heikki

>From 324706fb2e15facbc5686e98ee26bbdc5017366a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 10 Oct 2016 10:59:38 +0300
Subject: [PATCH 1/2] Simplify tape block format.

No more indirect blocks. The blocks form a linked list instead.

This saves some memory, because we don't need to have a buffer in memory to
hold the indirect block (or blocks). To reflect that, TAPE_BUFFER_OVERHEAD
is reduced from 3 to 1 buffer, which allows using more memory for building
the initial runs.
---
 src/backend/utils/sort/logtape.c   | 581 ++---
 src/backend/utils/sort/tuplesort.c |  17 +-
 src/include/utils/logtape.h|   2 +-
 3 files changed, 156 insertions(+), 444 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 6cc06b2..7e1f0d8 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -31,15 +31,8 @@
  * in BLCKSZ-size blocks.  Space allocation boils down to keeping track
  * of which blocks in the underlying file belong to which logical tape,
  * plus any blocks that are free (recycled and not yet reused).
- * The blocks in each logical tape are remembered using a method borrowed
- * from the Unix HFS filesystem: we store data block numbers in an
- * "indirect block".  If an indirect block fills up, we write it out to
- * the underlying file and remember its location in a second-level indirect
- * block.  In the same way second-level blocks are remembered in third-
- * level blocks, and so on if necessary (of course we're talking huge
- * amounts of data here).  The topmost indirect block of a given logical
- * tape is never actually written out to the physical file, but all lower-
- * level indirect blocks will be.
+ * The blocks in each logical tape form a chain, with a prev- and next-
+ * pointer in each block.
  *
  * The initial write pass is guaranteed to fill the underlying file
  * perfectly sequentially, no matter how data is divided into logical tapes.
@@ -87,58 +80,58 @@
 #include "utils/memutils.h"
 
 /*
- * Block indexes are "long"s, so we can fit this many per indirect block.
- * NB: we assume this is an exact fit!
- */
-#define BLOCKS_PER_INDIR_BLOCK	((int) (BLCKSZ / sizeof(long)))
-
-/*
- * We use a struct like this for each active indirection level of each
- * logical tape.  If the indirect block is not the highest level of its
- * tape, the "nextup" link points to the next higher level.  Only the
- * "ptrs" array is written out if we have to dump the indirect block to
- * disk.  If "ptrs" is not completely full, we store -1L in the first
- * unused slot at completion of the write phase for the logical tape.
+ * A TapeBlockTrailer is stored at the end of each BLCKSZ block.
+ *
+ * The first block of a tape has prev == -1.  The last block of a tape
+ * stores the number of valid bytes on the block, subtracted by BLCKSZ.
+ * That's always negative, so prev < 0 indicates the last block.
  */
-typedef struct IndirectBlock
+typedef struct TapeBlockTrailer
 {
-	int			nextSlot;		/* next pointer slot to write or read */
-	struct IndirectBlock *nextup;		/* parent indirect level, or NULL if
-		 * top */
-	long		ptrs[BLOCKS_PER_INDIR_BLOCK];	/* indexes of contained blocks */
-} IndirectBlock;
+	long		prev;	/* previous block on this tape, or -1 on first block */
+	long		next;	/* next block on this tape, or # of valid bytes on last block */
+} TapeBlockTrailer;
+
+#define TapeBlockGetTrailer(buf) ((TapeBlockTrailer *) ((char *) buf + BLCKSZ - sizeof(TapeBlockTrailer)))
+#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+
+#define TapeBlockGetNBytes(buf) (TapeBlockGetTrailer(buf)->next < 0 ? (TapeBlockGetTrailer(buf)->next + BLCKSZ) : TapeBlockPayloadSize)
+
+#define TapeBlockIsLast(buf) (TapeBlockGetTrailer(buf)->next < 0)
 
 /*
  * This data structure represents a single "logical tape" within the set
- * of logical tapes stored in the same file.  We must keep track of the
- * current partially-read-or-written data block as well as the active
- * indirect block level(s).
+ * of logical tapes stored in the same file.
+ *
+ * While writing, we hold the current partially-written data block in the
+ * buffer.  While reading, we can hold multiple blocks in the buffer.  Note
+ * that we don't retain the trailers of a block when it's read into the
+ * buffer.  The buffer therefore contains one large contiguous chunk of data
+ * from the tape.
  */
 typedef struct LogicalTape
 {
-	IndirectBlock *indirect;	/* bottom of my indirect-block hierarchy */
 	bool		writing;		/* T while in write phase */
 	bool		frozen;			/* T if blocks should not be freed when read */
 	bool		dirty;			/* does buffer need to be written? */
 
 	/*
-	 * The total data volume in the logical tape is numFullBlocks * BLCKSZ +
-	 * lastBlockBytes.  BUT: we do not update lastBlockBytes du

Re: [HACKERS] Logical tape pause/resume

2016-10-09 Thread Heikki Linnakangas

On 10/09/2016 03:27 AM, Peter Geoghegan wrote:

You shouldn't really waste 8% of the budget with low work_mem settings
with my cap patch applied, because the new cap never limits the number
of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
have about 1GB of work_mem. So, if there is still any waste at the low
end, that can only be solved by tweaking the main calculation within
tuplesort_merge_order(). (Also, to be clear to others following along:
that memory is never actually allocated, so it's only "wasted" from
the perspective of one sort operation alone).


Regardless of the number of tapes, the memory used for the tape buffers, 
while building the initial runs, is wasted. It's not entirely wasted 
when you have multiple merge passes, because without the buffer you need 
to read the last partial page back into memory, when you start to output 
the next run on it. But even in that case, I believe that memory would 
be better used for the quicksort, to create larger runs.



The cost of multiple passes is paid in sequential I/O of tape temp
files, which is now clearly a relatively low cost. OTOH, the cost of a
larger merge heap is paid in more CPU cache misses, which is a cost we
can feel quite severely. While it's really hard to come up with a
generic break-even point, I do think that there is value in capping
the number of tapes somewhere in the hundreds. It's not like a cap on
the number of tapes along the lines I've proposed was not thought
about from day one, by both Tom and Simon. Noah's relatively recent
MaxAllocSize work has given the issue new importance, though (the same
might have been said during the 9.6 replacement selection vs.
quicksort discussions, actually).


Yeah, there might be value in having a cap, because operating the merge 
heap becomes more expensive when it becomes larger. Mainly because of 
cache misses. We should perhaps have a limit on the size of the merge 
heap, and use the same limit for the replacement selection. Although, 
the heap behaves slightly differently during replacement selection: 
During replacement selection, new values added to the heap tend to go to 
the bottom of the heap, but during merge, new values tend to go close to 
the top. The latter pattern incurs fewer cache misses.


But that's orthogonal to the wasting-memory aspect of this. Even if a we 
have a cap of 100, it's good to not spend memory for the tape buffers of 
those 100 tapes, when building the initial runs.



When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.


Okay. But, you haven't actually addressed the problem with non-trivial
amounts of memory being logically allocated ahead of time for no good
reason -- you've only address the constant factor (the overhead
per-tape).


I thought I did. Can you elaborate?

Are you referring to the fact that the array of LogicalTapes is still 
allocated ahead of time, with size maxTapes? True, it'd be nice to 
allocate the LogicalTape structs only when needed. But that's peanuts, 
compared to the tape buffers.



In addition to saving a little bit of memory, I'd like to do this
refactoring because it simplifies the code. It's code that has stayed
largely unchanged for the past 15 years, so I'm not too eager to touch it,
but looking at the changes coming with Peter's parallel tuplesort patch set,
I think this makes sense.


I can definitely see value in refactoring, to make that code less
complicated -- I just don't think it's justified by the amount of
memory that is wasted on tapes.


Ok, good. I think we're in agreement on doing this, then, even if we 
don't agree on the justification :-).



That said, I'm not especially worried about the complexity within the
logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
more interested in having a logtape.c that could be more easily made
to support binary searching, etc, to find *partition* boundaries,
which my CREATE INDEX patch doesn't use or care about. This is needed
for tuplesort.c partition-based sorting. When parallel sort isn't just
used by CREATE INDEX, partitioning becomes important. And, with
partitioning, dynamic sampling is essential (I think that this will
end up living in logtape.c).


Yeah. I'm not sure how partitioning and all that would be done here. But 
I'm prettty sure this simpler logtape.c code is easier to work with, for 
partitioning too.


We might want to have a each partition on a separate tape, for example, 
and therefore have a lot more tapes than currently. Not wasting memory 
on the tape buffers becomes important in that case.


- Heikki



--
Sent via pgsql-hackers m

Re: [HACKERS] Logical tape pause/resume

2016-10-08 Thread Peter Geoghegan
Apologies for the delayed response to this.

On Tue, Oct 4, 2016 at 3:47 AM, Heikki Linnakangas  wrote:
> One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] is
> to put a cap on the number of tapes to use for the sort.

> That amounts to about 8% of the available memory. That's quite wasteful.
> Peter's approach of putting an arbitrary cap on the max number of tapes
> works, but I find it a bit hackish. And you still waste that 8% with smaller
> work_mem settings.

I don't think it's hackish -- there was never a theoretical
justification for using as many tapes as possible from Knuth, or
anyone else. I think that Simon's 2006 work on allowing for a number
of tapes much greater than Knuth's "sweet spot" of 7 was useful only
because it sometimes enabled final on-the-fly merging where we are not
merging a huge number of runs (i.e. when merging only somewhat more
than 7 runs). Your recent work on tape preloading has probably greatly
diminished the value of not doing all merging on-the-fly in a single,
final merge, though. Knuth's sweet spot of 7 had little or nothing to
do with the economics of buying many tape drives.

You shouldn't really waste 8% of the budget with low work_mem settings
with my cap patch applied, because the new cap never limits the number
of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
have about 1GB of work_mem. So, if there is still any waste at the low
end, that can only be solved by tweaking the main calculation within
tuplesort_merge_order(). (Also, to be clear to others following along:
that memory is never actually allocated, so it's only "wasted" from
the perspective of one sort operation alone).

The cost of multiple passes is paid in sequential I/O of tape temp
files, which is now clearly a relatively low cost. OTOH, the cost of a
larger merge heap is paid in more CPU cache misses, which is a cost we
can feel quite severely. While it's really hard to come up with a
generic break-even point, I do think that there is value in capping
the number of tapes somewhere in the hundreds. It's not like a cap on
the number of tapes along the lines I've proposed was not thought
about from day one, by both Tom and Simon. Noah's relatively recent
MaxAllocSize work has given the issue new importance, though (the same
might have been said during the 9.6 replacement selection vs.
quicksort discussions, actually).

> When we finish writing an initial run to a tape, we keep the tape buffers
> around. That's the reason we need to reserve that memory. But there's no
> fundamental reason we have to do that. As I suggested in [2], we could flush
> the buffers to disk, and only keep in memory the location of the last block
> we wrote. If we need to write another run to the tape, we can reload the
> last incomplete block from the disk, and continue writing.

Okay. But, you haven't actually addressed the problem with non-trivial
amounts of memory being logically allocated ahead of time for no good
reason -- you've only address the constant factor (the overhead
per-tape). Can't we also fix the general problem, by applying a cap?
Better to have a cap that is approximately right (in the hundreds or
so) than one that is exactly wrong (infinity -- no cap).

> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
> to have a multi-pass sort these days, so it's a good bet that you don't need
> to do it. And if you have a very small work_mem, so that you need to do a
> multi-pass sort, having some more memory available for building the initial
> runs is probably worth it, and there's also a good chance that the block is
> still in the OS cache.

That analysis does seem sound to me.

> In addition to saving a little bit of memory, I'd like to do this
> refactoring because it simplifies the code. It's code that has stayed
> largely unchanged for the past 15 years, so I'm not too eager to touch it,
> but looking at the changes coming with Peter's parallel tuplesort patch set,
> I think this makes sense.

I can definitely see value in refactoring, to make that code less
complicated -- I just don't think it's justified by the amount of
memory that is wasted on tapes.

That said, I'm not especially worried about the complexity within the
logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
more interested in having a logtape.c that could be more easily made
to support binary searching, etc, to find *partition* boundaries,
which my CREATE INDEX patch doesn't use or care about. This is needed
for tuplesort.c partition-based sorting. When parallel sort isn't just
used by CREATE INDEX, partitioning becomes important. And, with
partitioning, dynamic sampling is essential (I think that this will
end up living in logtape.c).

To recap on what I went into in the paritioning-to-parallel-tuplesort
thread [1], I think that partitioning will come in a future release,
and will be of more value to parallel queries, where much more can be
pushed down wi

Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Heikki Linnakangas

On 10/04/2016 07:06 PM, Claudio Freire wrote:

On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas  wrote:

1. The first patch changes the way we store the logical tapes on disk.
Instead of using indirect blocks, HFS style, the blocks form a linked list.
Each block has a trailer, with the block numbers of the previous and next
block of the tape. That eliminates the indirect blocks, which simplifies the
code quite a bit, and makes it simpler to implement the pause/resume
functionality in the second patch. It also immediately reduces the memory
needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
the indirect blocks in memory.


Doesn't that make prefetching more than a buffer ahead harder?


Yes, indeed. That's a potential downside, if we wanted to do that. We 
rely wholly on OS readahead for that currently, so it doesn't matter. I 
don't think we want to start issuing explicit pre-fetching calls here 
any time soon. But if we did, we could presume that the pages in 
sequential order for prefetching purposes, and that would work pretty 
well in practice (that's what the OS readahead is doing for us now). Or 
we could try to enforce that by allocating blocks in larger chunks.


In short, I'm OK with that.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Claudio Freire
On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas  wrote:
> 1. The first patch changes the way we store the logical tapes on disk.
> Instead of using indirect blocks, HFS style, the blocks form a linked list.
> Each block has a trailer, with the block numbers of the previous and next
> block of the tape. That eliminates the indirect blocks, which simplifies the
> code quite a bit, and makes it simpler to implement the pause/resume
> functionality in the second patch. It also immediately reduces the memory
> needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
> the indirect blocks in memory.

Doesn't that make prefetching more than a buffer ahead harder?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Heikki Linnakangas

On 10/04/2016 05:58 PM, Simon Riggs wrote:

On 4 October 2016 at 12:47, Heikki Linnakangas  wrote:


Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.


Hmm. You'd still have to read back the last block, so that you can update
its next-pointer.


If each run is in its own file, then you can skip that bit.


Then you need to remember the names of the files, in memory. That's 
closer to my idea of having only one run on each tape, but you're taking 
further by also saying that each tape is a separate file.


That might be a good idea for building the initial runs. Each file would 
then be written and read sequentially, so we could perhaps get rid of 
the whole pre-reading code, and rely completely on OS read-ahead.


However, can't really do that for multi-pass merges, because you want to 
reuse the space of the input tapes for the output tape, as you go.



And we do want the sort to disk to use multiple files so we can
parallelize I/O as well as CPU.


Huh? Why can't you parallelize I/O on a single file?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Simon Riggs
On 4 October 2016 at 12:47, Heikki Linnakangas  wrote:

>> Why not just make each new run start at a block boundary?
>> That way we waste on average BLCKSZ/2 disk space per run, which is
>> negligible but we avoid any need to have code to read back in the last
>> block.
>
>
> Hmm. You'd still have to read back the last block, so that you can update
> its next-pointer.

If each run is in its own file, then you can skip that bit.

And we do want the sort to disk to use multiple files so we can
parallelize I/O as well as CPU.

So since we know we'll want multiple files, we should be thinking
about how to split things up between files.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Heikki Linnakangas

On 10/04/2016 02:09 PM, Simon Riggs wrote:

On 4 October 2016 at 11:47, Heikki Linnakangas  wrote:


When we finish writing an initial run to a tape, we keep the tape buffers
around. That's the reason we need to reserve that memory. But there's no
fundamental reason we have to do that. As I suggested in [2], we could flush
the buffers to disk, and only keep in memory the location of the last block
we wrote. If we need to write another run to the tape, we can reload the
last incomplete block from the disk, and continue writing.

Reloading the last block, requires an extra I/O. That's OK. It's quite rare
to have a multi-pass sort these days, so it's a good bet that you don't need
to do it. And if you have a very small work_mem, so that you need to do a
multi-pass sort, having some more memory available for building the initial
runs is probably worth it, and there's also a good chance that the block is
still in the OS cache.


Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.


Hmm. You'd still have to read back the last block, so that you can 
update its next-pointer.


What you could do, though, is to always store only one run on each tape. 
If we can make the in-memory representation of a tape small enough, that 
might be acceptable. You only need 4 bytes to store the starting block 
number. Or we could dump the starting block numbers of the tapes, to yet 
another tape. The number of tapes you merge in one pass would still be 
limited by the amount of memory you have, but the number of tapes would 
be unlimited.


I don't want to do that right now, it gets more invasive. Something to 
keep in mind for the future, though. This also related to your other 
suggestion below:



That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges
overlap.


This gets more advanced... Yeah, stuff like that would make sense. You 
could also split the tapes into smaller parts, and store the min/max for 
each part, to make the merging even more efficient and easier to 
parallelize. Or to take that to the extreme, you could make each tape a 
little B-tree.


But that's for another day and another patch :-). What I have now is a 
drop-in replacement for the current logical tape implementation. These 
more advanced schemes would build on top of that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Simon Riggs
On 4 October 2016 at 11:47, Heikki Linnakangas  wrote:

> When we finish writing an initial run to a tape, we keep the tape buffers
> around. That's the reason we need to reserve that memory. But there's no
> fundamental reason we have to do that. As I suggested in [2], we could flush
> the buffers to disk, and only keep in memory the location of the last block
> we wrote. If we need to write another run to the tape, we can reload the
> last incomplete block from the disk, and continue writing.
>
> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
> to have a multi-pass sort these days, so it's a good bet that you don't need
> to do it. And if you have a very small work_mem, so that you need to do a
> multi-pass sort, having some more memory available for building the initial
> runs is probably worth it, and there's also a good chance that the block is
> still in the OS cache.

Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.

That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges
overlap.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers