Re: [HACKERS] Polyphase merge is obsolete

2017-09-11 Thread Tomas Vondra
Hi,

I planned to do some benchmarking on this patch, but apparently the
patch no longer applies. Rebase please?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2017-08-30 Thread Peter Geoghegan
On Wed, Aug 30, 2017 at 12:48 PM, Robert Haas  wrote:
> These are separate topics.  They should each be discussed on their own
> thread.  Please don't hijack this thread to talk about something else.

I don't think that that is a fair summary.

Heikki has done a number of things in this area, of which this is only
the latest. I'm saying "hey, have you thought about RS too?". Whether
or not I'm "hijacking" this thread is, at best, ambiguous.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2017-08-30 Thread Robert Haas
On Wed, Aug 30, 2017 at 2:54 PM, Peter Geoghegan  wrote:
> I noticed that this is in the upcoming CF 1 for v11. I'm signed up to review.
>
> I'd like to point out that replacement selection is also obsolete,
> which is something I brought up recently [1]. I don't actually have
> any feature-driven reason to want to kill replacement selection - it's
> just an annoyance at this point. I do think that RS is more deserving
> of being killed than Polyphase merge, because it actually costs users
> something to continue to support it. The replacement_sort_tuples GUC
> particularly deserves to be removed.
>
> It would be nice if killing RS was put in scope here. I'd appreciate
> it, at least, since it would simplify the heap routines noticeably.
> The original analysis that led to adding replacement_sort_tuples was
> based on certain performance characteristics of merging that have
> since changed by quite a bit, due to our work for v10.

These are separate topics.  They should each be discussed on their own
thread.  Please don't hijack this thread to talk about something else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2017-08-30 Thread Peter Geoghegan
On Mon, Feb 27, 2017 at 2:45 PM, Peter Geoghegan  wrote:
> Since we have an awful lot of stuff in the last CF, and this patch
> doesn't seem particularly strategic, I've marked it "Returned with
> Feedback".

I noticed that this is in the upcoming CF 1 for v11. I'm signed up to review.

I'd like to point out that replacement selection is also obsolete,
which is something I brought up recently [1]. I don't actually have
any feature-driven reason to want to kill replacement selection - it's
just an annoyance at this point. I do think that RS is more deserving
of being killed than Polyphase merge, because it actually costs users
something to continue to support it. The replacement_sort_tuples GUC
particularly deserves to be removed.

It would be nice if killing RS was put in scope here. I'd appreciate
it, at least, since it would simplify the heap routines noticeably.
The original analysis that led to adding replacement_sort_tuples was
based on certain performance characteristics of merging that have
since changed by quite a bit, due to our work for v10.

[1] 
postgr.es/m/cah2-wzmmnjg_k0r9nqywmq3zjyjjk+hcbizynghay-zyjs6...@mail.gmail.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2017-02-27 Thread Peter Geoghegan
On Mon, Jan 16, 2017 at 5:56 PM, Peter Geoghegan  wrote:
> On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas  wrote:
>> The number of *input* tapes we can use in each merge pass is still limited,
>> by the memory needed for the tape buffers and the merge heap, but only one
>> output tape is active at any time. The inactive output tapes consume very
>> few resources. So the whole idea of trying to efficiently reuse input tapes
>> as output tapes is pointless
>
> I picked this up again. The patch won't apply cleanly. Can you rebase?

Since we have an awful lot of stuff in the last CF, and this patch
doesn't seem particularly strategic, I've marked it "Returned with
Feedback".

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2017-01-16 Thread Peter Geoghegan
On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas  wrote:
> The number of *input* tapes we can use in each merge pass is still limited,
> by the memory needed for the tape buffers and the merge heap, but only one
> output tape is active at any time. The inactive output tapes consume very
> few resources. So the whole idea of trying to efficiently reuse input tapes
> as output tapes is pointless

I picked this up again. The patch won't apply cleanly. Can you rebase?
Also, please look at my bugfix for logtape.c free block management [1]
before doing so, as that might be prerequisite. Finally, I don't think
that the Logical tape pause/resume idea is compelling. Is it hard to
not do that, but still do everything else that you propose here?
That's what I lean towards doing right now.

Anyway, efficient use of tapes certainly mattered a lot more when
rewinding meant sitting around for a magnetic tape deck to physically
rewind. There is another algorithm in AoCP Vol III that lets us write
to tapes backwards, actually, which is motivated by similar obsolete
considerations about hardware. Why not write while we rewind, to avoid
doing nothing else while rewinding?!

Perhaps this patch should make a clean break from the "rewinding"
terminology. Perhaps you should rename LogicalTapeRewindForRead() to
LogicalTapePrepareForRead(), and so on. It's already a bit awkward
that that routine is called LogicalTapeRewindForRead(), because it
behaves significantly differently when a tape is frozen, and because
the whole point of logtape.c is space reuse that is completely
dissimilar to rewinding. (Space reuse is thus totally unlike how
polyphase merge is supposed to reuse space, which is all about
rewinding, and isn't nearly as eager. Same applies to K-way balanced
merge, of course.)

I think that the "rewinding" terminology does more harm than good, now
that it doesn't even help the Knuth reader to match Knuth's remarks to
what's going on in tuplesort.c. Just a thought.

> Let's switch over to a simple k-way balanced merge. Because it's simpler. If
> you're severely limited on memory, like when sorting 1GB of data with
> work_mem='1MB' or less, it's also slightly faster. I'm not too excited about
> the performance aspect, because in the typical case of a single-pass merge,
> there's no difference. But it would be worth changing on simplicity grounds,
> since we're mucking around in tuplesort.c anyway.

I actually think that the discontinuities in the merge scheduling are
worse than you suggest here. There doesn't have to be as extreme a
difference between work_mem and the size of input as you describe
here. As an example:

create table seq_tab(t int);
insert into seq_tab select generate_series(1, 1000);
set work_mem = '4MB';
select count(distinct t) from seq_tab;

The trace_sort output ends like this:

30119/2017-01-16 17:17:05 PST LOG:  begin datum sort: workMem = 4096,
randomAccess = f
30119/2017-01-16 17:17:05 PST LOG:  switching to external sort with 16
tapes: CPU: user: 0.07 s, system: 0.00 s, elapsed: 0.06 s
30119/2017-01-16 17:17:05 PST LOG:  starting quicksort of run 1: CPU:
user: 0.07 s, system: 0.00 s, elapsed: 0.06 s
30119/2017-01-16 17:17:05 PST LOG:  finished quicksort of run 1: CPU:
user: 0.07 s, system: 0.00 s, elapsed: 0.07 s
*** SNIP ***
30119/2017-01-16 17:17:08 PST LOG:  finished writing run 58 to tape 0:
CPU: user: 2.50 s, system: 0.27 s, elapsed: 2.78 s
30119/2017-01-16 17:17:08 PST LOG:  using 4095 KB of memory for read
buffers among 15 input tapes
30119/2017-01-16 17:17:08 PST LOG:  finished 1-way merge step: CPU:
user: 2.52 s, system: 0.28 s, elapsed: 2.80 s
30119/2017-01-16 17:17:08 PST LOG:  finished 4-way merge step: CPU:
user: 2.58 s, system: 0.30 s, elapsed: 2.89 s
30119/2017-01-16 17:17:08 PST LOG:  finished 14-way merge step: CPU:
user: 2.86 s, system: 0.34 s, elapsed: 3.20 s
30119/2017-01-16 17:17:08 PST LOG:  finished 14-way merge step: CPU:
user: 3.09 s, system: 0.41 s, elapsed: 3.51 s
30119/2017-01-16 17:17:09 PST LOG:  finished 15-way merge step: CPU:
user: 3.61 s, system: 0.52 s, elapsed: 4.14 s
30119/2017-01-16 17:17:09 PST LOG:  performsort done (except 15-way
final merge): CPU: user: 3.61 s, system: 0.52 s, elapsed: 4.14 s
30119/2017-01-16 17:17:10 PST LOG:  external sort ended, 14678 disk
blocks used: CPU: user: 4.93 s, system: 0.57 s, elapsed: 5.51 s

(This is the test case that Cy posted earlier today, for the bug that
was just fixed in master.)

The first 1-way merge step is clearly kind of a waste of time. We
incur no actual comparisons during this "merge", since there is only
one real run from each input tape (all other active tapes contain only
dummy runs). We are, in effect, just shoveling the tuples from that
single run from one tape to another (from one range in the underlying
logtape.c BufFile space to another). I've seen this quite a lot
before, over the years, while working on sorting. It's not that big of
a deal, but it's a degenerate case that 

Re: [HACKERS] Polyphase merge is obsolete

2016-10-14 Thread Peter Geoghegan
On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas  wrote:
> Let's switch over to a simple k-way balanced merge. Because it's simpler. If
> you're severely limited on memory, like when sorting 1GB of data with
> work_mem='1MB' or less, it's also slightly faster. I'm not too excited about
> the performance aspect, because in the typical case of a single-pass merge,
> there's no difference. But it would be worth changing on simplicity grounds,
> since we're mucking around in tuplesort.c anyway.

This analysis seems sound. I suppose we might as well simplify things
while we're at it.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2016-10-12 Thread Heikki Linnakangas

On 10/12/2016 08:27 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

The beauty of the polyphase merge algorithm is that it allows reusing
input tapes as output tapes efficiently ... So the whole idea of trying to
efficiently reuse input tapes as output tapes is pointless.


It's been awhile since I looked at that code, but I'm quite certain that
it *never* thought it was dealing with actual tapes.  Rather, the point of
sticking with polyphase merge was that it allowed efficient incremental
re-use of temporary disk files, so that the maximum on-disk footprint was
only about equal to the volume of data to be sorted, rather than being a
multiple of that.  Have we thrown that property away?


No, there's no difference to that behavior. logtape.c takes care of 
incremental re-use of disk space, regardless of the merging pattern.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2016-10-12 Thread Tom Lane
Heikki Linnakangas  writes:
> The beauty of the polyphase merge algorithm is that it allows reusing 
> input tapes as output tapes efficiently ... So the whole idea of trying to 
> efficiently reuse input tapes as output tapes is pointless.

It's been awhile since I looked at that code, but I'm quite certain that
it *never* thought it was dealing with actual tapes.  Rather, the point of
sticking with polyphase merge was that it allowed efficient incremental
re-use of temporary disk files, so that the maximum on-disk footprint was
only about equal to the volume of data to be sorted, rather than being a
multiple of that.  Have we thrown that property away?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Polyphase merge is obsolete

2016-10-12 Thread Heikki Linnakangas
The beauty of the polyphase merge algorithm is that it allows reusing 
input tapes as output tapes efficiently. So if you have N tape drives, 
you can keep them all busy throughout the merge.


That doesn't matter, when we can easily have as many "tape drives" as we 
want. In PostgreSQL, a tape drive consumes just a few kB of memory, for 
the buffers. With the patch being discussed to allow a tape to be 
"paused" between write passes [1], we don't even keep the tape buffers 
around, when a tape is not actively read written to, so all it consumes 
is the memory needed for the LogicalTape struct.


The number of *input* tapes we can use in each merge pass is still 
limited, by the memory needed for the tape buffers and the merge heap, 
but only one output tape is active at any time. The inactive output 
tapes consume very few resources. So the whole idea of trying to 
efficiently reuse input tapes as output tapes is pointless.


Let's switch over to a simple k-way balanced merge. Because it's 
simpler. If you're severely limited on memory, like when sorting 1GB of 
data with work_mem='1MB' or less, it's also slightly faster. I'm not too 
excited about the performance aspect, because in the typical case of a 
single-pass merge, there's no difference. But it would be worth changing 
on simplicity grounds, since we're mucking around in tuplesort.c anyway.


I came up with the attached patch to do that. This patch applies on top 
of my latest "Logical tape pause/resume" patches [1]. It includes 
changes to the logtape.c interface, that make it possible to create and 
destroy LogicalTapes in a tapeset on the fly. I believe this will also 
come handy for Peter's parallel tuplesort patch set.


[1] Logical tape pause/resume, 
https://www.postgresql.org/message-id/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi


PS. I finally bit the bullet and got self a copy of The Art of Computer 
Programming, Vol 3, 2nd edition. In section 5.4 on External Sorting, 
Knuth says:


"
When this book was first written, magnetic tapes were abundant and disk 
drives were expensive. But disks became enormously better during the 
1980s, and by the late 1990s they had almost completely replaced 
magnetic tape units on most of the world's computer systems. Therefore 
the once-crucial topic of tape merging has become of limited relevance 
to current needs.


Yet many of the patterns are quite beautiful, and the associated 
algorithms reflect some of the best research done in computer science 
during its early days; the techniques are just too nice to be discarded 
abruptly onto the rubbish heap of history. Indeed, the ways in which 
these methods blend theory with practice are especially instructive. 
Therefore merging patterns are discussed carefully and completely below, 
in what may be their last grand appearance before they accept the final 
curtain call.

"

Yep, the polyphase and other merge patterns are beautiful. I enjoyed 
reading through those sections. Now let's put them to rest in PostgreSQL.


- Heikki
>From 15169f32f99a69401d3565c8d0ff0d532d6b6638 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 12 Oct 2016 20:04:17 +0300
Subject: [PATCH 1/1] Replace polyphase merge algorithm with a simple balanced
 k-way merge.

The advantage of polyphase merge is that it can reuse the input tapes as
output tapes efficiently, but that is irrelevant on modern hardware, when
we can easily emulate any number of tape drives. The number of input tapes
we can/should use during merging is limited by work_mem, but output tapes
that we are not currently writing to only cost a little bit of memory, so
there is no need to skimp on them.

Refactor LogicalTapeSet/LogicalTape interface. All the tape functions,
like LogicalTapeRead and LogicalTapeWrite, take a LogicalTape as argument,
instead of LogicalTapeSet+tape number. You can create any number of
LogicalTapes in a single LogicalTapeSet, and you don't need to decide the
number upfront, when you create the tape set.
---
 src/backend/utils/sort/logtape.c   | 223 +
 src/backend/utils/sort/tuplesort.c | 665 +++--
 src/include/utils/logtape.h|  37 ++-
 3 files changed, 366 insertions(+), 559 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 1f540f0..2370fa5 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -118,6 +118,8 @@ typedef struct TapeBlockTrailer
  */
 typedef struct LogicalTape
 {
+	LogicalTapeSet *tapeSet;	/* tape set this tape is part of */
+
 	bool		writing;		/* T while in write phase */
 	bool		paused;			/* T if the tape is paused */
 	bool		frozen;			/* T if blocks should not be freed when read */
@@ -173,10 +175,6 @@ struct LogicalTapeSet
 	long	   *freeBlocks;		/* resizable array */
 	int			nFreeBlocks;	/* # of currently free blocks */
 	int			freeBlocksLen;	/* current allocated length of freeBlocks[] */
-
-	/* The 

Re: [HACKERS] polyphase merge?

2009-02-04 Thread Simon Riggs

On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

 4. any other issue needs consideration?

Most attempts to improve sorting further have fallen to nothing because
of the lack of repeatable test results. In particular, coming up with
test cases *after* writing the code is a good way to get lost in
discussions and have your work passed over.

The best starting point is to collect a number of test cases, both
typical and extreme cases. Do some research to show that typical cases
really are that and be prepared for some expressions of reasonable
doubt. Publish that data so test results can be objectively verified and
then measure those cases on existing code, with various settings of
tunable parameters.

Then it will be a simple matter to prove your changes are effective in
target cases without damaging other cases. We would also want the
changes to work automatically without additional tunable parameters.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] polyphase merge?

2009-02-04 Thread Simon Riggs

On Wed, 2009-02-04 at 10:55 +, Greg Stark wrote:
 Is this basically the same as our current algorithm but without
 multiplexing the tapes onto single files? I have been wondering
 whether we multiplex the tapes any better than filesystems can lay out
 separate files actually.

I don't think you'll be able to do that more efficiently than we already
do. Read the top of tuplesort.c

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] polyphase merge?

2009-02-04 Thread Tom Lane
Greg Stark st...@enterprisedb.com writes:
 Is this basically the same as our current algorithm but without
 multiplexing the tapes onto single files? I have been wondering
 whether we multiplex the tapes any better than filesystems can lay out
 separate files actually.

The reason for the multiplexing is so that space can get re-used
quickly.  If each tape were represented as a separate file, there would
be no way to release blocks as they're read; you could only give back
the whole file after reaching end of tape.  Which would at least double
the amount of disk space needed to sort X amount of data.  (It's
actually even worse, more like 4X, though the multiplier might depend on
the number of tapes --- I don't recall the details anymore.)

The penalty we pay is that in the later merge passes, the blocks
representing a single tape aren't very well ordered.

It might be interesting to think about some compromise that wastes a
little more space in order to get better sequentiality of disk access.
It'd be easy to do if we were willing to accept a 2X space penalty,
but I'm not sure if that would fly or not.  It definitely *wasn't*
acceptable to the community a few years ago when the current code was
written.  Disks have gotten bigger since then, but so have the problems
people want to solve.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] polyphase merge?

2009-02-04 Thread Tom Lane
Don Marvick donmarv...@gmail.com writes:
 Since nobody replied, I would give it a try. I am going to implement the
 merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
 follow:
 - create initial runs using replacement selection (basically this is as in
 the current implementation)
 - add enough dummy runs of size 0 so that the number of sorted run minus one
 can be divided by k-1 (k is merge fan in)
 - repetitively merge k smallest runs until only 1 run left

How is this better than, or even different from, what is there now?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] polyphase merge?

2009-02-04 Thread Gregory Stark
Simon Riggs si...@2ndquadrant.com writes:

 On Wed, 2009-02-04 at 10:55 +, Greg Stark wrote:
 Is this basically the same as our current algorithm but without
 multiplexing the tapes onto single files? I have been wondering
 whether we multiplex the tapes any better than filesystems can lay out
 separate files actually.

 I don't think you'll be able to do that more efficiently than we already
 do. Read the top of tuplesort.c

Huh?

The question I posed was whether we do it any better than filesystems do, not
whether we could do a better job. If filesystems can do as good a job then we
might as well create separate files for each tape and let the filesystem
decide where to allocate space. That would mean we could massively simplify
logtape.c and just create a separate file for every tape.

Possible reasons that filesystems could do better than us are that they have
access to more information about actual storage layout on disk, that they have
more information about hardware characteristics, and that it would give the
filesystem cache a better idea of the access pattern which logtape.c might be
confusing the picture of currently.

On the other hand possible reasons why filesystems would suck at this include
creating and deleting new files being a slow or locking operation on many
filesystems, and dealing with directories of large numbers of files being
poorly optimized on others.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] polyphase merge?

2009-02-04 Thread Greg Stark
Is this basically the same as our current algorithm but without
multiplexing the tapes onto single files? I have been wondering
whether we multiplex the tapes any better than filesystems can lay out
separate files actually.

Or is there something else that's different here?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] polyphase merge?

2009-02-04 Thread Don Marvick
Good point. I would note this issue. Thanks for the advice :).

Regards,

Don

On Wed, Feb 4, 2009 at 6:09 PM, Simon Riggs si...@2ndquadrant.com wrote:


 On Wed, 2009-02-04 at 14:42 +0800, Don Marvick wrote:

  4. any other issue needs consideration?

 Most attempts to improve sorting further have fallen to nothing because
 of the lack of repeatable test results. In particular, coming up with
 test cases *after* writing the code is a good way to get lost in
 discussions and have your work passed over.

 The best starting point is to collect a number of test cases, both
 typical and extreme cases. Do some research to show that typical cases
 really are that and be prepared for some expressions of reasonable
 doubt. Publish that data so test results can be objectively verified and
 then measure those cases on existing code, with various settings of
 tunable parameters.

 Then it will be a simple matter to prove your changes are effective in
 target cases without damaging other cases. We would also want the
 changes to work automatically without additional tunable parameters.

 --
  Simon Riggs   www.2ndQuadrant.com
  PostgreSQL Training, Services and Support




Re: [HACKERS] polyphase merge?

2009-02-04 Thread Don Marvick
This is what I understand from existing implementation:
- fix the merge fan in F, and number of tapes=F+1
- for each sorted run generated, we have a formula to determine which tape
it belongs to
- the sorted run is added to the end of the tape
- all sorted runs have been added to tape. There is always an empty tape
called output tape
- add dummy runs if necessary, to each tape
- merge the input tapes, write result to output tape, until one of the input
tape is empty
- repeat this process until 1 sorted run remains


I think the main difference is that at each step, the current impl does not
merge the smallest k-runs. It merges from the beginning of each tape. For 1
pass, this does not make any difference. For 2 passes onwards there are some
differences. In some complex query, where the memory is eaten by some other
operators, more passes are needed and the differences become larger. This is
the only improvement that can be achieved. Let me know if this does not make
any sense.

Regarding disk layout, I am open to suggestion whether we should reuse the
tape module or not.



On Wed, Feb 4, 2009 at 11:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Don Marvick donmarv...@gmail.com writes:
  Since nobody replied, I would give it a try. I am going to implement the
  merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
  follow:
  - create initial runs using replacement selection (basically this is as
 in
  the current implementation)
  - add enough dummy runs of size 0 so that the number of sorted run minus
 one
  can be divided by k-1 (k is merge fan in)
  - repetitively merge k smallest runs until only 1 run left

 How is this better than, or even different from, what is there now?

regards, tom lane



Re: [HACKERS] polyphase merge?

2009-02-04 Thread Don Marvick
Hmm probably it's worth to try this. Has anybody try this before? If not, I
am interested to give it a try.

Regards,

Don

On Wed, Feb 4, 2009 at 11:25 PM, Gregory Stark st...@enterprisedb.comwrote:

 Simon Riggs si...@2ndquadrant.com writes:

  On Wed, 2009-02-04 at 10:55 +, Greg Stark wrote:
  Is this basically the same as our current algorithm but without
  multiplexing the tapes onto single files? I have been wondering
  whether we multiplex the tapes any better than filesystems can lay out
  separate files actually.
 
  I don't think you'll be able to do that more efficiently than we already
  do. Read the top of tuplesort.c

 Huh?

 The question I posed was whether we do it any better than filesystems do,
 not
 whether we could do a better job. If filesystems can do as good a job then
 we
 might as well create separate files for each tape and let the filesystem
 decide where to allocate space. That would mean we could massively simplify
 logtape.c and just create a separate file for every tape.

 Possible reasons that filesystems could do better than us are that they
 have
 access to more information about actual storage layout on disk, that they
 have
 more information about hardware characteristics, and that it would give the
 filesystem cache a better idea of the access pattern which logtape.c might
 be
 confusing the picture of currently.

 On the other hand possible reasons why filesystems would suck at this
 include
 creating and deleting new files being a slow or locking operation on many
 filesystems, and dealing with directories of large numbers of files being
 poorly optimized on others.

 --
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!



Re: [HACKERS] polyphase merge?

2009-02-03 Thread Don Marvick
Dear All,

Since nobody replied, I would give it a try. I am going to implement the
merge pattern described in Knuth Page 365 (5.4.9), essentially it is as
follow:
- create initial runs using replacement selection (basically this is as in
the current implementation)
- add enough dummy runs of size 0 so that the number of sorted run minus one
can be divided by k-1 (k is merge fan in)
- repetitively merge k smallest runs until only 1 run left

I am new to postgresql, hence any advice would be appreciated.

Can anybody give me some advice on how it can done?
1. how a run should be represented? should I reuse the tape mechanism? e.g.
1 tape 1 run
- or should I use a temp buffer?

2. How memory should be allocated? I assume I divide the memory equally to k
runs, hence each run will get M/k memory. Each read of a run would be of
size M/k bytes.

3. Prefetching. Then, we can precompute the read sequence of blocks of run
during the entire merge process. Based on this, we know the blocks of run to
be prefetched at a point of time.

3. Is it possible to perform read/write I/O while the merge is being
performed? Hence we overlap I/O and computation.

4. any other issue needs consideration?

Best regards,

Don

On Thu, Jan 29, 2009 at 4:11 PM, Don Marvick donmarv...@gmail.com wrote:

 Dear All,

 I apologize if this has been discussed before. I have tried to search to
 the mailing list and could not find an exact answer.

 Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge
 sort. IMHO the algorithm is designed for tapes.

 Why don't the simple text book merge pattern described in Knuth Page 365
 (5.4.9) is used for disk based system? The same merge pattern is also
 described in Ramakrishnan text book and it is optimal if seek time is not
 counted, which of course not the real-world case.

 Best regards,

 Don




[HACKERS] polyphase merge?

2009-01-29 Thread Don Marvick
Dear All,

I apologize if this has been discussed before. I have tried to search to the
mailing list and could not find an exact answer.

Currently, Postgres uses Knuth's Algorithm 5.4.2D to perform external merge
sort. IMHO the algorithm is designed for tapes.

Why don't the simple text book merge pattern described in Knuth Page 365
(5.4.9) is used for disk based system? The same merge pattern is also
described in Ramakrishnan text book and it is optimal if seek time is not
counted, which of course not the real-world case.

Best regards,

Don


Re: [HACKERS] Polyphase Merge

2008-01-22 Thread Sam Mason
On Mon, Jan 21, 2008 at 04:13:32PM -0500, Tom Lane wrote:
 Sam Mason [EMAIL PROTECTED] writes:
  It's really up to you to find answers to these questions, especially
  the first one.  Once you've designed an efficient algorithm then the
  second point (which I'm interpreting as how you'd go about changing
  tuplestore(?) so that things can be read in reverse order) should
  just drop out as an implementation detail :) I'm guessing you'll
  end up not reading the store in reverse order but arranging things
  differently---it'll be interesting to see.
 
 I agree --- having to read the run back from external storage, only to
 write it out again with no further useful work done on it, sounds like
 a guaranteed loser.  

Manolo's idea (wherever it came from) will generate longer runs in some
specific non-random data distributions (i.e. hopefully real life), but
it'll obviously only be a net win if this is offset by not having to do
any extra work reordering data.  It would be great if it could be got to
work!

 To make this work you'll need some kind of ju-jitsu
 rearrangement that logically puts the run where it needs to go without
 physically moving any data.

yup, that's the fun part :)


  Sam

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Polyphase Merge

2008-01-22 Thread mac_man2005



--
From: Tom Lane [EMAIL PROTECTED]
Sent: Monday, January 21, 2008 10:13 PM
To: Sam Mason [EMAIL PROTECTED]
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Polyphase Merge



I agree --- having to read the run back from external storage, only to
write it out again with no further useful work done on it, sounds like
a guaranteed loser.  To make this work you'll need some kind of ju-jitsu
rearrangement that logically puts the run where it needs to go without
physically moving any data.


I'm not going to write it back with no useful work on it. I should just 
write them in reverse order during run formation (ju-jitsu couldn't help me 
in this case) or read them in reverse order while merging (ju-jitsu may 
help... the point is that I'm not so good in ju-jitsu).


An idea could be managing a list of pointers to runs contained into tapes. 
Any comment?




regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[HACKERS] Polyphase Merge

2008-01-21 Thread mac_man2005
I'm trying to refine the sorting module of tuplesort.c

During run creations I use two heaps instead of just one (yeah, it's still 
me... the one of the two heaps still trying to get some answer/help from 
-hackers)

Those two runs are built in a way such that if we would concatenate one of them 
to the other one red upside down, they will still form a run (recall that 
following Knuth, a run is a sorted sequence of data). There are a lot of 
possibility that with that refinement logical runs could be longer than 
ordinary runs built by the ordinary replacement selection. Remark we build 
runs: logical runs it's just a concept used to understand why we build runs 
that way.

ISSUES
a) how to distribute logical runs (that is both of its physical runs) into 
tapes?
b) one of the 2 physical runs of the logical run is to be red upside down while 
merging: how to efficiently do it?

Well, that's all for now.

Hope you can please give to me few seconds of you precious time. That would 
allow me to go on developing my refinement. Or at least tell me don't bother 
till the day next PostgreSQL release is out (when will it be released?) or 
don't bother anymore since nobody will ever answer to me.

Thanks for your attention.
Manolo.

Re: [HACKERS] Polyphase Merge

2008-01-21 Thread Sam Mason
On Mon, Jan 21, 2008 at 07:42:24PM +0100, [EMAIL PROTECTED] wrote:
 During run creations I use two heaps instead of just one (yeah,
 it's still me... the one of the two heaps still trying to get some
 answer/help from -hackers)

Hi again!

 ISSUES
 a) how to distribute logical runs (that is both of its physical runs)
 into tapes?
 b) one of the 2 physical runs of the logical run is to be red upside
 down while merging: how to efficiently do it?

It's really up to you to find answers to these questions, especially
the first one.  Once you've designed an efficient algorithm then the
second point (which I'm interpreting as how you'd go about changing
tuplestore(?) so that things can be read in reverse order) should
just drop out as an implementation detail :) I'm guessing you'll
end up not reading the store in reverse order but arranging things
differently---it'll be interesting to see.

What does your current code look like and how have you solved it there?
If you've already written it then you'll need to be much more specific
with your questions about integrating it into PG.


  Sam

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Polyphase Merge

2008-01-21 Thread Tom Lane
Sam Mason [EMAIL PROTECTED] writes:
 It's really up to you to find answers to these questions, especially
 the first one.  Once you've designed an efficient algorithm then the
 second point (which I'm interpreting as how you'd go about changing
 tuplestore(?) so that things can be read in reverse order) should
 just drop out as an implementation detail :) I'm guessing you'll
 end up not reading the store in reverse order but arranging things
 differently---it'll be interesting to see.

I agree --- having to read the run back from external storage, only to
write it out again with no further useful work done on it, sounds like
a guaranteed loser.  To make this work you'll need some kind of ju-jitsu
rearrangement that logically puts the run where it needs to go without
physically moving any data.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly