subject:"Re\\\: \\\[HACKERS\\\] Using quicksort for every external sort run"

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Peter Geoghegan

On Thu, Apr 7, 2016 at 11:10 AM, Robert Haas  wrote:
> I prefer units of tuples, with the GUC itself therefore being
> unitless.  I suggest we call the parameter replacement_sort_threshold
> and document that (1) the ideal value may depend on the amount of CPU
> cache available to running processes, with more cache implying higher
> values; and (2) the ideal value may depend somewhat on the input data,
> with more correlation implying higher values.  And then pick some
> value that you think is likely to work well for most people and call
> it good.
>
> If you could prepare a new patch with those changes and also making
> the changes requested in my other email, I will try to commit that
> before the deadline.  Thanks.

Attached revision of patch series:

* Breaks out the parts you don't want to commit right now, as agreed.

These separate patches in the rebased patch series are included here
for completeness, but will probably be submitted separately to 9.7. I
do still think you should commit 0002-* alongside 0001-*, though,
because it's useful to be able to enable the memory context dumps on
dev builds to debug external sorting. I won't insist on it, but that
is my recommendation.

* Fixes "over-zealous assertion" that I pointed out recently.

* Replaces replacement_sort_mem GUC with replacement_sort_tuples GUC,
since, as discussed, effective cut-off points for using replacement
selection for the first run are easier to derive from the size of
memtuples (the might-be heap) than from work_mem/maintenance_work_mem
(the fraction of all tuplesort memory used that is used for memtuples
could be very low in cases with what Tomas called "padding").

Since you didn't get back to me on the name of the GUC, I just ran
with the name replacement_sort_tuples, but that's something I'm
totally unattached to. Feel free to change it to whatever you prefer,
including your original suggestion of replacement_sort_threshold if
you still think that works.

The new default value that I came up with for replacement_sort_tuples
is 150,000 tuples, which is intended as a rough generic break-even
point. Note that trace_sort reports how many tuples were in the heap
should replacement selection actually be chosen for the first run.
150,000 seems like a high enough generic delta between an out-of-order
tuple, and its optimal in-order position; if *that* amount of buffer
space to "juggle" tuples isn't enough, it seems unlikely that
*anything* will be (anything that is less than 1/2 of the total number
of input tuples, at least).

Note that I use the term "cache oblivious" in the documentation now,
per your suggestion that CPU cache characteristics be addressed. We
have traditionally avoided using jargon like that, but I think it
works well here. The reader is not required to know the definition.
Dropping that term provides bread-crumbs for advance users to put all
this together in more detail, which I believe has value. It suggests
that increasing work_mem or maintenance_work_mem can have almost no
downside provided you don't need that memory for anything else, which
is true.

I will be glad to see this through. Thanks for your help with this, Robert.

-- 
Peter Geoghegan
From 167f9ef6720f87925b67d995a9e7bdf87e72767c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 5/5] Perform memory prefetching when writing memtuples

This patch is based on, but quite distinct to a separately submitted,
more general version which performs prefetching in several places [1].
This version now only performs prefetching of each "tuple proper" during
the writing of batches of tuples (an entire run, written following a
quicksort).  The case for prefetching each "tuple proper" at several
sites now seems weak due to difference in CPU microarchitecture.
However, it might still be that there is a consistent improvement
observable when writing out tuples, because that involves a particularly
tight inner loop, with relatively predictable processing to hide memory
latency behind.  A helpful generic prefetch hint may be possible for
this case, even if it proves impossible elsewhere.

This has been shown to appreciably help on both a POWER7 server
processor [2], and an Intel Mobile processor.

[1] https://commitfest.postgresql.org/6/305/
[2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com
---
 config/c-compiler.m4   | 17 +
 configure  | 31 +++
 configure.in   |  1 +
 src/backend/utils/sort/tuplesort.c | 14 ++
 src/include/c.h| 14 ++
 src/include/pg_config.h.in |  3 +++
 src/include/pg_config.h.win32  |  3 +++
 src/include/pg_config_manual.h | 10 ++
 8 files changed, 93 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index a7f6773..c181496 100644
---

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Peter Geoghegan

On Thu, Apr 7, 2016 at 11:10 AM, Robert Haas  wrote:
> I prefer units of tuples, with the GUC itself therefore being
> unitless.  I suggest we call the parameter replacement_sort_threshold
> and document that (1) the ideal value may depend on the amount of CPU
> cache available to running processes, with more cache implying higher
> values; and (2) the ideal value may depend somewhat on the input data,
> with more correlation implying higher values.  And then pick some
> value that you think is likely to work well for most people and call
> it good.

I really don't want to bikeshed about this, but I must ask: if the
name of the GUC must include the word "threshold", shouldn't it be
called quicksort_threshold?

My dictionary defines threshold as "any place or point of entering or
beginning". But this GUC does not govern where replacement selection
begins; it governs where it ends.

How do you feel about replacement_sort_tuples? We already use the word
"tuple" in the names of GUCs.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Peter Geoghegan

On Thu, Apr 7, 2016 at 11:05 AM, Robert Haas  wrote:
> I spent some time today reading through the new 0001 and in general I
> think it looks pretty good.

Cool.

> 1. Changing cost_sort to consider disk access as 90% sequential, 10%
> random rather than 75% sequential, 25% random.  As far as I can recall
> from the thread, zero test results have been posted to demonstrate
> that this is a good idea.  It also seems completely unprincipled.

I think that it's less unprincipled than the existing behavior, which
imagines that I/O is a significant cost overall, something that is
demonstrably wrong (there is an XXX comment about the existing disk
access costings). Still, I agree that there is no logical reason to
connect it to the bulk of what I want to do here, except that maybe it
would be good if we were more optimistic about the cost of external
sorting now. cost_sort() knows nothing about cache efficiency, of
course, so naturally we cannot teach it to weigh cache efficiency less
heavily. I guess I was worried that the smaller run sizes would put
cost_sort() off external sorts even more, even as they became far
cheaper.

> 2. Restricting the maximum number of tapes to 500.  This seems like a
> sound change and I don't object to it in theory.  But I've seen no
> benchmark results which demonstrate that this is a good idea, and it
> is quite separate from the core purpose of the patch.

Ditto. This is something that could be done separately. We've often
pondered if it made any sense at all (e.g. commit message of
c65ab0bfa97b71bceae6402498910f4074996279), and I'm sure that it
doesn't, but the memory refund stuff in the already memory management
patch at least refunds the cost for the final on-the-fly merge (iff
state->tuples).

> Since time is short, I recommend we remove both of these things from
> the patch and you can resubmit them as separate patches later.  As far
> as I can see, neither of them is so tied into the rest of the patch
> that the main part of the patch can't be committed without those
> changes.

I agree to all this. Now that you've indicated where you stand on
replacement_sort_mem, I have all the information I need to produce a
new revision. I'll go do that.

Thanks
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Robert Haas

On Thu, Apr 7, 2016 at 1:17 PM, Peter Geoghegan  wrote:
>> I certainly agree that GUCs that aren't easy to tune are bad.  I'm
>> wondering whether the fact that this one is hard to tune is something
>> that can be fixed.  The comments about "padding" - a term I don't
>> like, because it to me implies a deliberate attempt to game the
>> benchmark when in reality wanting to sort a wide row is entirely
>> reasonable - make me wonder if this should be based on a number of
>> tuples rather than an amount of memory.  If considering the row width
>> makes us get the wrong answer, then let's not do that.
>
> That's a good point. While I don't think it will make it easy to tune
> the GUC, it will make it easier. Although, I think that it should
> probably still be GUC_UNIT_KB. That should just be something that my
> useselection() function compares to the overall size of memtuples
> alone when we must initially spill, not the value of
> work_mem/maintenance_work_mem. The degree of padding isn't entirely
> irrelevant, because not all comparisons will be resolved at the
> stup.datum1 level, but it's still clearly an improvement to not have
> wide tuples mess with things.
>
> Would you like me to revise the patch along those lines? Or, do you
> prefer units of tuples? Tuples are basically equivalent, but make it
> way less obvious what the relationship with CPU cache might be. If I
> revise the patch along these lines, I should also reduce the default
> replacement_sort_mem to produce roughly equivalent behavior for
> non-padded cases.

I prefer units of tuples, with the GUC itself therefore being
unitless.  I suggest we call the parameter replacement_sort_threshold
and document that (1) the ideal value may depend on the amount of CPU
cache available to running processes, with more cache implying higher
values; and (2) the ideal value may depend somewhat on the input data,
with more correlation implying higher values.  And then pick some
value that you think is likely to work well for most people and call
it good.

If you could prepare a new patch with those changes and also making
the changes requested in my other email, I will try to commit that
before the deadline.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Robert Haas

On Mon, Mar 21, 2016 at 2:01 AM, Peter Geoghegan  wrote:
> On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas  wrote:
>> OK, I have now committed 0001
>
> I attach a revision of the external quicksort patch and supplementary
> small patches, rebased on top of the master branch.

I spent some time today reading through the new 0001 and in general I
think it looks pretty good.  But I think that there is some stuff in
there that logically seems to me to deserve to be separate patches.
In particular:

1. Changing cost_sort to consider disk access as 90% sequential, 10%
random rather than 75% sequential, 25% random.  As far as I can recall
from the thread, zero test results have been posted to demonstrate
that this is a good idea.  It also seems completely unprincipled.  If
the cost of sorts decreases as a result of this patch, it is because
we've reduced the CPU cost, not the I/O cost.  The changes we're
talking about here make I/O more random, not less random, because we
will now have more tapes, not fewer; which means merges will have to
seek the disk head more frequently, not less frequently.  Now, it's
tempting to say that this patch should result in some change to the
cost model: if the patch doesn't make sorting faster, we shouldn't
commit it at all, and if it does, then surely the cost model should
change accordingly.  But the question for the cost model isn't whether
the change to the model somehow reflects the increase in execution
speed.  It's whether we get better query plans with the change than
without.  I don't think there's been a degree of review of that aspect
of this patch on list that would give me confidence to commit a change
like this.

2. Restricting the maximum number of tapes to 500.  This seems like a
sound change and I don't object to it in theory.  But I've seen no
benchmark results which demonstrate that this is a good idea, and it
is quite separate from the core purpose of the patch.

Since time is short, I recommend we remove both of these things from
the patch and you can resubmit them as separate patches later.  As far
as I can see, neither of them is so tied into the rest of the patch
that the main part of the patch can't be committed without those
changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Peter Geoghegan

On Thu, Apr 7, 2016 at 6:55 AM, Robert Haas  wrote:
>> In reality there will be multiple processes running at the same time (e.g
>> backends when running parallel query), significantly reducing the amount of
>> cache per process, making the replacement sort inefficient and thus
>> eliminating the regressions (by making the master slower).
>
> Interesting point.

The effective use of CPU cache is *absolutely* critical here. I think
that this patch is valuable primarily because it makes sorting
predictable, and only secondarily because it makes it much faster.
Having discrete costs that can be modeled fairly accurately has
significant practical benefits for DBAs, and for query optimization,
especially when parallel worker sorts must be costed. Inefficient use
of CPU cache implies a big overall cost for the server, not just one
client; my sorting patches are usually tested on single client cases,
but the multi-client cases can be a lot more sympathetic (we saw this
with abbreviated keys at one point).

I wonder how many DBAs are put off by higher work_mem settings due to
issues with replacement selectionthey are effectively denied the
ability to set work_mem appropriately across the board, because of
this one weak spot. It really is perverse that there is, in effect, a
"Blackjack" cost function for sorts, which runs counter to the general
intuition that more memory is better.

> I certainly agree that GUCs that aren't easy to tune are bad.  I'm
> wondering whether the fact that this one is hard to tune is something
> that can be fixed.  The comments about "padding" - a term I don't
> like, because it to me implies a deliberate attempt to game the
> benchmark when in reality wanting to sort a wide row is entirely
> reasonable - make me wonder if this should be based on a number of
> tuples rather than an amount of memory.  If considering the row width
> makes us get the wrong answer, then let's not do that.

That's a good point. While I don't think it will make it easy to tune
the GUC, it will make it easier. Although, I think that it should
probably still be GUC_UNIT_KB. That should just be something that my
useselection() function compares to the overall size of memtuples
alone when we must initially spill, not the value of
work_mem/maintenance_work_mem. The degree of padding isn't entirely
irrelevant, because not all comparisons will be resolved at the
stup.datum1 level, but it's still clearly an improvement to not have
wide tuples mess with things.

Would you like me to revise the patch along those lines? Or, do you
prefer units of tuples? Tuples are basically equivalent, but make it
way less obvious what the relationship with CPU cache might be. If I
revise the patch along these lines, I should also reduce the default
replacement_sort_mem to produce roughly equivalent behavior for
non-padded cases.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-07 Thread Robert Haas

Sorry for not responding to this thread again sooner.  I was on
vacation Thursday-Sunday, and have been playing catch-up since then.

On Sun, Apr 3, 2016 at 8:24 AM, Tomas Vondra
 wrote:
> Secondly, master is faster only if there's enough on-CPU cache for the
> replacement sort (for the memtuples heap), but the benchmark is not
> realistic in this respect as it only ran 1 query at a time, so it used the
> whole cache (6MB for i5, 12MB for Xeon).
>
> In reality there will be multiple processes running at the same time (e.g
> backends when running parallel query), significantly reducing the amount of
> cache per process, making the replacement sort inefficient and thus
> eliminating the regressions (by making the master slower).

Interesting point.

> 3) replacement_sort_mem GUC
>
> I'm not quite sure what's the plan with this GUC. It was useful for
> development, but it seems to me it's pretty difficult to tune it in practice
> (especially if you don't know the internals, which users generally don't).
>
> The current patch includes the new GUC right next to work_mem, which seems
> rather unfortunate - I do expect users to simply mess with assuming "more is
> better" which seems to be rather poor idea.
>
> So I think we should either remove the GUC entirely, or move it to the
> developer section next to trace_sort (and removing it from the conf).

I certainly agree that GUCs that aren't easy to tune are bad.  I'm
wondering whether the fact that this one is hard to tune is something
that can be fixed.  The comments about "padding" - a term I don't
like, because it to me implies a deliberate attempt to game the
benchmark when in reality wanting to sort a wide row is entirely
reasonable - make me wonder if this should be based on a number of
tuples rather than an amount of memory.  If considering the row width
makes us get the wrong answer, then let's not do that.

> BTW couldn't we tune the value automatically for each sort, using the
> pg_stats.correlation for the sort keys, when available (increasing the
> replacement_sort_mem when correlation is close to 1.0)? Wouldn't that
> improve at least some of the regressions?

Surely not for 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-03 Thread Peter Geoghegan

On Sun, Apr 3, 2016 at 4:08 PM, Greg Stark  wrote:
>> SELECT * FROM (SELECT * FROM (SELECT a FROM int_test UNION SELECT a
>> FROM int_test_padding) gg OFFSET 1e10) ff;
>
> ISTM OFFSET binds more loosely than UNION so these should be equivalent.

Not exactly:

postgres=# explain analyze select i from fff union select i from ggg
offset 1e10;
  QUERY
PLAN
---
 Limit  (cost=357771.51..357771.51 rows=1 width=4) (actual
time=2989.378..2989.378 rows=0 loops=1)
   ->  Unique  (cost=345771.50..357771.51 rows=242 width=4)
(actual time=2031.044..2930.903 rows=151 loops=1)
 ->  Sort  (cost=345771.50..351771.51 rows=242 width=4)
(actual time=2031.042..2543.167 rows=242 loops=1)
   Sort Key: fff.i
   Sort Method: external merge  Disk: 32840kB
   ->  Append  (cost=0.00..58620.04 rows=242 width=4)
(actual time=0.048..435.408 rows=242 loops=1)
 ->  Seq Scan on fff  (cost=0.00..14425.01
rows=101 width=4) (actual time=0.048..100.435 rows=101
loops=1)
 ->  Seq Scan on ggg  (cost=0.00..20195.01
rows=141 width=4) (actual time=0.042..138.991 rows=141
loops=1)
 Planning time: 0.123 ms
 Execution time: 2999.564 ms
(10 rows)

postgres=# explain analyze select * from (select i from fff union
select i from ggg) fg offset 1e10;
  QUERY
PLAN
---
 Limit  (cost=381771.53..381771.53 rows=1 width=4) (actual
time=2982.519..2982.519 rows=0 loops=1)
   ->  Unique  (cost=345771.50..357771.51 rows=242 width=4)
(actual time=2009.176..2922.874 rows=151 loops=1)
 ->  Sort  (cost=345771.50..351771.51 rows=242 width=4)
(actual time=2009.174..2522.761 rows=242 loops=1)
   Sort Key: fff.i
   Sort Method: external merge  Disk: 32840kB
   ->  Append  (cost=0.00..58620.04 rows=242 width=4)
(actual time=0.056..428.934 rows=242 loops=1)
 ->  Seq Scan on fff  (cost=0.00..14425.01
rows=101 width=4) (actual time=0.055..100.806 rows=101
loops=1)
 ->  Seq Scan on ggg  (cost=0.00..20195.01
rows=141 width=4) (actual time=0.042..139.994 rows=141
loops=1)
 Planning time: 0.127 ms
 Execution time: 2993.294 ms
(10 rows)

The startup and total costs are greater in the latter case, but the
costs match at and below the Unique node. Whether or not this was
relevant is probably unimportant, though. My habit is to do the offset
outside of the subquery.

My theory is that the master branch happened to get a HashAggregate
for the 128MB case that caused us both confusion, because it looked
cheaper than an external sort + unique when the sort required many
passes on the master branch only (where my cost_sort() changes that
lower the costing of external sorts were not included).  This wasn't a
low cardinality case, so the HashAggregate may have only won by a
small amount. I suppose that this could happen when the HashAggregate
was not predicted to use memory > work_mem, but a sort was. Then, as
the sort requires fewer merge passes with more work_mem, the master
branch starts to agree with the patch on the cheapest plan once again.
The trend of the patch being faster continues, after this one hiccup.

This is down to the cost_sort() changes, not the tuplesort.c changes.
But this was just a quirk, and the trend still seems clear. This
theory seems very likely based on this strange query's numbers for i5
master as work_mem increases:

Master: 16.711, 9.94, 4.891, 8.32, 4.88

Patch: 17.23, 9.77, 9.78, 4.95, 4.94

ISTM that master's last and third-from-last cases *both* use a
HashAggregate, where the patch behaves more consistently. After all,
the patch does smooth the cost function of sorting, an independently
useful goal to simply making sorting faster. We don't have to be
afraid of crossing an arbitrary, fuzzy threshold.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-03 Thread Greg Stark

On Sun, Apr 3, 2016 at 12:50 AM, Peter Geoghegan  wrote:
> 1459308434.753 2016-03-30 05:27:14 CEST STATEMENT:  SELECT * FROM
> (SELECT a FROM int_test UNION SELECT a FROM int_test_padding OFFSET
> 1e10) ff;
>
> I think that this is invalid, because the query was intended as this:
>
> SELECT * FROM (SELECT * FROM (SELECT a FROM int_test UNION SELECT a
> FROM int_test_padding) gg OFFSET 1e10) ff;

ISTM OFFSET binds more loosely than UNION so these should be equivalent.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-03 Thread Peter Geoghegan

I just mean that, as you say, trace_sort is described in the documentation.

I don't think we'll end up with any kind of cost model here, so where that
would need to happen is only an academic matter. The create index parameter
would only be an option for the DBA. That's about the only case I can see
working for replacement selection: where indexes can be created with very
little memory quickly, by optimistically starting to write out the start of
the final index representation almost immediately, before most of the
underlying table has even been read in.

--
Peter Geoghegan

Re: [HACKERS] Using quicksort for every external sort run

2016-04-03 Thread Tomas Vondra


On 04/03/2016 09:41 PM, Peter Geoghegan wrote:

Hi Tomas,

...

3) replacement_sort_mem GUC

I'm not quite sure what's the plan with this GUC. It was useful for
development, but it seems to me it's pretty difficult to tune it in practice
(especially if you don't know the internals, which users generally don't).


I agree.


So I think we should either remove the GUC entirely, or move it to the
developer section next to trace_sort (and removing it from the conf).


I'll let Robert decide what's best here, but I see your point.

Side note: trace_sort actually is documented. It's a bit weird that we
have those TRACE_SORT macros at all IMV. I think we should rip those
out, and assume every build enables TRACE_SORT, because that's
probably true anyway.


What do you mean by documented? I thought this might be a good place is:

http://www.postgresql.org/docs/devel/static/runtime-config-developer.html

which is where trace_sort is documented.



I do think that replacement selection could be put to good use for
CREATE INDEX if the CREATE INDEX utility command had a "presorted"
parameter. Specifically, an implementation of the "presorted" idea
that I recently sketched [1] could do better than any presorted
replacement selection case we've seen so far because it allows the
implementation to optimistically create the index on-the-fly (if that
isn't possible, throw an error), without a second pass over tuples
sorted on tape. Nothing needs to be stored on a tape/temp file *at
all*; the only thing that is stored externally is the index itself.
But this patch doesn't add that feature, which can be worked on
without the user needing to know about replacement_sort_mem in 9.6.

So, I'm not in favor of ripping out the replacement selection code,
but think it could make sense to effectively disable it entirely for
the time being (with some developer feature to turn it back on for
testing). In general, I share your misgivings about the new GUC,
though.


OK.




I'm wondering whether 16MB default is not a bit too much, actually. As
explained before, that's not the amount of cache we should expect per
process, so maybe ~2-4MB would be a better default value?


The obvious presorted case is where we have a SERIAL column, but as I
mentioned even that isn't helped by RS. Moreover, it will be
significantly hurt with a default maintenance_work_mem of 64MB. Your
int4 CREATE INDEX cases clearly show this.


BTW couldn't we tune the value automatically for each sort, using the
pg_stats.correlation for the sort keys, when available (increasing the
replacement_sort_mem when correlation is close to 1.0)? Wouldn't that
improve at least some of the regressions?


Maybe, but that seems hard. That information isn't conveniently
available to the executor/tuplesort, and as we've seen with CREATE
INDEX int4 cases, it's far from clear that we'll win even when there
definitely is presorted input. Replacement selection needs more than a
simple correlation to win, so you'll end up building a cost model with
many new problems if this is to work.


Sure, that's non-trivial and definitely not a 9.6 material. I'm also 
wondering whether we need to do choose replacement_sort_mem at planning 
time, or whether it could be done in the executor based on actually 
observed data ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-03 Thread Peter Geoghegan

Hi Tomas,

Overall, I agree with your summary.

On Sun, Apr 3, 2016 at 5:24 AM, Tomas Vondra
 wrote:
> So, let me sum this up, the way I understand the current status.
>
>
> 1) overall, the patch seems to be a clear performance improvement

I think that's clear. There are even cases that are over 5x faster,
which are representative of some real workloads (e.g., "CREATE INDEX x
ON numeric_test (a)" when low_cardinality_almost_asc +
maintenance_work_mem=512MB). A lot of the aggregate (datum sort)
cases, and heap tuple cases are 3x - 4x faster.

> 2) it's unlikely we can improve the performance further

I think it's very unlikely that these remaining regressions can be fixed, yes.

> Secondly, master is faster only if there's enough on-CPU cache for the
> replacement sort (for the memtuples heap), but the benchmark is not
> realistic in this respect as it only ran 1 query at a time, so it used the
> whole cache (6MB for i5, 12MB for Xeon).
>
> In reality there will be multiple processes running at the same time (e.g
> backends when running parallel query), significantly reducing the amount of
> cache per process, making the replacement sort inefficient and thus
> eliminating the regressions (by making the master slower).

Agreed. And even though the 8MB work_mem cases always have more than
enough CPU cache to fit the replacement selection heap, it's still no
worse than a mixed picture. The replacement_work_mem=64KB + patch +
8MB (maintenance_)work_mem cases (i.e. replacement selection entirely
disabled) don't always do worse; they are often a draw, and sometimes
do much better. We *still* win in many cases, sometimes by quite a bit
(e.g. "SELECT COUNT(DISTINCT a) FROM int_test" typically loses about
50% of its runtime when patched and RS is disabled at work_mem=8MB).
The cases where we lose at work_mem=8MB involve padding and a
correlation. The really important case of CREATE INDEX on int4 almost
always wins, *even with sorted input* (the
almost-but-not-quite-asc-sorted case loses ~1%). We can shave 20% -
30% off the CREATE INDEX int4 cases with just maintenance_work_mem =
8MB.

Even in these cases with so much CPU cache relative to work_mem, you
need to search for regressed cases to find them, and they are less
representative cases. So, while the picture for the work_mem=8MB
column alone seems kind of bad, if you consider where the regressions
actually occur, you could argue that even that's a draw.

> 3) replacement_sort_mem GUC
>
> I'm not quite sure what's the plan with this GUC. It was useful for
> development, but it seems to me it's pretty difficult to tune it in practice
> (especially if you don't know the internals, which users generally don't).

I agree.

> So I think we should either remove the GUC entirely, or move it to the
> developer section next to trace_sort (and removing it from the conf).

I'll let Robert decide what's best here, but I see your point.

Side note: trace_sort actually is documented. It's a bit weird that we
have those TRACE_SORT macros at all IMV. I think we should rip those
out, and assume every build enables TRACE_SORT, because that's
probably true anyway.

I do think that replacement selection could be put to good use for
CREATE INDEX if the CREATE INDEX utility command had a "presorted"
parameter. Specifically, an implementation of the "presorted" idea
that I recently sketched [1] could do better than any presorted
replacement selection case we've seen so far because it allows the
implementation to optimistically create the index on-the-fly (if that
isn't possible, throw an error), without a second pass over tuples
sorted on tape. Nothing needs to be stored on a tape/temp file *at
all*; the only thing that is stored externally is the index itself.
But this patch doesn't add that feature, which can be worked on
without the user needing to know about replacement_sort_mem in 9.6.

So, I'm not in favor of ripping out the replacement selection code,
but think it could make sense to effectively disable it entirely for
the time being (with some developer feature to turn it back on for
testing). In general, I share your misgivings about the new GUC,
though.

> I'm wondering whether 16MB default is not a bit too much, actually. As
> explained before, that's not the amount of cache we should expect per
> process, so maybe ~2-4MB would be a better default value?

The obvious presorted case is where we have a SERIAL column, but as I
mentioned even that isn't helped by RS. Moreover, it will be
significantly hurt with a default maintenance_work_mem of 64MB. Your
int4 CREATE INDEX cases clearly show this.

> BTW couldn't we tune the value automatically for each sort, using the
> pg_stats.correlation for the sort keys, when available (increasing the
> replacement_sort_mem when correlation is close to 1.0)? Wouldn't that
> improve at least some of the regressions?

Maybe, but that seems hard. That information isn't conveniently
available to the

Re: [HACKERS] Using quicksort for every external sort run

2016-04-03 Thread Tomas Vondra


Hi,

So, let me sum this up, the way I understand the current status.


1) overall, the patch seems to be a clear performance improvement

There's far more "green" cells than "red" ones in the spreadsheets, and 
the patch often shaves off 30-75% of the sort duration. Improvements are 
pretty much all over the board, for all data sets (low/high/unique 
cardinality, initial ordering) and data types.



2) it's unlikely we can improve the performance further

The regressions are limited to low work_mem settings, which we believe 
are not representative (or at least not as much as the higher work_mem 
values), for two main reasons.


Firstly, if you need to sort a lot of data (e.g. 10M, as benchmarked), 
it's quite reasonable to use larger work_mem values. It'd be a bit 
backwards to reject a patch that gets you 2-4x speedup with enough 
memory, on the grounds that it may have negative impact with 
unreasonably small work_mem values.


Secondly, master is faster only if there's enough on-CPU cache for the 
replacement sort (for the memtuples heap), but the benchmark is not 
realistic in this respect as it only ran 1 query at a time, so it used 
the whole cache (6MB for i5, 12MB for Xeon).


In reality there will be multiple processes running at the same time 
(e.g backends when running parallel query), significantly reducing the 
amount of cache per process, making the replacement sort inefficient and 
thus eliminating the regressions (by making the master slower).



3) replacement_sort_mem GUC

I'm not quite sure what's the plan with this GUC. It was useful for 
development, but it seems to me it's pretty difficult to tune it in 
practice (especially if you don't know the internals, which users 
generally don't).


The current patch includes the new GUC right next to work_mem, which 
seems rather unfortunate - I do expect users to simply mess with 
assuming "more is better" which seems to be rather poor idea.


So I think we should either remove the GUC entirely, or move it to the 
developer section next to trace_sort (and removing it from the conf).


I'm wondering whether 16MB default is not a bit too much, actually. As 
explained before, that's not the amount of cache we should expect per 
process, so maybe ~2-4MB would be a better default value?


Also, not what I'm re-reading the docs for the GUC, I realize it also 
depends on how the input data is correlated - that seems like a rather 
useless criteria for tuning, though, because that varies per sort node, 
so using that for a GUC value set in postgresql.conf does not seem very 
wise. Actually even on per-query basis that's rather dubious, as it 
depends on how the sort node gets data (some nodes preserve ordering, 
some don't).


BTW couldn't we tune the value automatically for each sort, using the 
pg_stats.correlation for the sort keys, when available (increasing the 
replacement_sort_mem when correlation is close to 1.0)? Wouldn't that 
improve at least some of the regressions?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-02 Thread Peter Geoghegan

On Sat, Apr 2, 2016 at 3:22 PM, Peter Geoghegan  wrote:
> On Sat, Apr 2, 2016 at 3:20 PM, Greg Stark  wrote:
>> There are also some weird cases in this list where there's a
>> significant regression at 32MB but not at 8MB. I would like to see
>> 16MB and perhaps 12MB and 24MB. They would help understand if these
>> are just quirks or there's a consistent pattern.
>
> I'll need to drill down to trace_sort output to see what happened there.

I looked into this.

I too noticed that queries like "SELECT a FROM int_test UNION SELECT a
FROM int_test_padding" looked strangely faster for 128MB +
high_cardinality_almost_asc + i5 for master branch. This made the
patch look relatively bad for the test with those exact properties
only; the patch was faster with both lower and higher work_mem
settings than 128MB. There was a weird spike in performance for the
master branch only.

Having drilled down to trace_sort output, I think I know roughly why.
I see output like this:

1459308434.753 2016-03-30 05:27:14 CEST STATEMENT:  SELECT * FROM
(SELECT a FROM int_test UNION SELECT a FROM int_test_padding OFFSET
1e10) ff;

I think that this is invalid, because the query was intended as this:

SELECT * FROM (SELECT * FROM (SELECT a FROM int_test UNION SELECT a
FROM int_test_padding) gg OFFSET 1e10) ff;

This would have controlled for client overhead, per my request to
Tomas, without altering the "underlying query" that you see in the
final spreadsheet. I don't have an exact explanation for why you'd see
this spike at 128MB for the master branch but not the other at the
moment, but it seems like that one test is basically invalid, and
should be discarded. I suspect that the patch didn't see its own
similar spike due to my changes to cost_sort(), which reflected that
sorts don't need to do so much expensive random I/O.

This is the only case that I saw that was not more or less consistent
with my expectations, which is good.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-02 Thread Peter Geoghegan

On Sat, Apr 2, 2016 at 7:31 AM, Tomas Vondra
 wrote:
> So, I do have the results from both machines - I've attached the basic
> comparison spreadsheets, the complete summary is available here:
>
>https://github.com/tvondra/sort-benchmark
>
> The database log also includes the logs for trace_sort=on for each query
> (use the timestamp logged for each query in the spreadsheet to locate the
> right section of the log).

Thanks!

Each row in these spreadsheets shows what looks like a multimodal
distribution for the patch (if you focus on the actual run times, not
the ratios). IOW, you can clearly see the regressions are only where
master has its best case, and the patch its worst case; as the
work_mem increases for each benchmark case for the patch, by far the
largest improvement is usually seen as we cross the CPU cache
threshold. Master gets noticeably slower as work_mem goes from 8MB to
32MB, but the patch gets far far faster. Things continue to improve
for patched in absolute terms and especially relative to master
following further increases in work_mem, but not nearly as
dramatically as that first increment (unless we have lots of padding,
which makes the memtuples heap itself much smaller, so it happens one
step later). Master shows a slow decline at and past 32MB of work_mem.
If the test hardware had a larger L3 cache, we might expect to notice
a second big drop, but this hardware doesn't have the enormous L3
cache sizes of new Xeon processors (e.g. 32MB, 45MB).

> While it might look like I'm somehow opposed to this patch series, that's
> mostly because we tend to look only at the few cases that behave poorly.
>
> So let me be clear: I do think the patch seems to be a significant
> performance improvement for most of the queries, and I'm OK with accepting a
> few regressions (particularly if we agree those are pathological cases,
> unlikely to happen in real-world workloads).
>
> It's quite rare that a patch is a universal win without regressions, so it's
> important to consider how likely those regressions are and what's the net
> effect of the patch - and the patch seems to be a significant improvement in
> most cases (and regressions limited to pathological or rare corner cases).
>
> I don't think those are reasons not to push this into 9.6.

I didn't think that you opposed the patch. In fact, you did the right
thing by focussing on the low-end regressions, as I've said. I was
probably too concerned about Robert failing to consider that they were
not representative, particularly with regard to how small the
memtuples heap could be relative to the CPU cache; blame it on how
close I've become to this problem. I'm pretty confident that Robert
can be convinced that these do not matter enough to not commit the
patch. In any case, I'm pretty confident that I cannot fix any
remaining regressions.

> I haven't done any thorough investigation of the results yet, but in general
> it seems the results from both machines are quite similar - the numbers are
> different, but the speedup/slowdown patterns are mostly the same (with some
> exceptions that I'd guess are due to HW differences).

I agree. What we clearly see is the advantages of quicksort being
cache oblivious, especially relative to master's use of a heap. That
advantage becomes pronounced at slightly different points in each
case, but the overall picture is the same. This pattern demonstrates
why a cache oblivious algorithm is so useful in general -- we don't
have to care about tuning for that. As important as this is for serial
sorts, it's even more important for parallel sorts, where parallel
workers compete for memory bandwidth, and where it's practically
impossible to build a cost model for CPU cache size + memory use +
nworkers.

> The slowdown/speedup patterns (red/green cells in the spreadheets) are also
> similar to those collected originally. Some timings are much lower,
> presumably thanks to using the "OFFSET 1e10" pattern, but the patterns are
> the same.

I think it's notable that this made things more predictable, and made
the benefits clearer.

> The one thing that surprised me a bit is that
>
> replacement_sort_mem=64
>
> actually often made the results considerably worse in many cases. A common
> pattern is that the slowdown "spreads" to nearby cells - the are many
> queries where the 8MB case is 1:1 with master and 32MB is 1.5:1 (i.e. takes
> 1.5x more time), and setting replacement_sort_mem=64 just slows down the 8MB
> case.
>
> In general, replacement_sort_mem=64 seems to only affect the 8MB case, and
> in most cases it results in 100% slowdown (so 2x as long queries).

To be clear, for the benefit of other people: replacement_sort_mem=64
makes the patch never use a replacement selection heap, even at the
lowest tested work_mem setting of 8MB.

This is exactly what I expected. When replacement_sort_mem is the
proposed default of 16MB, it literally has zero impact on how the
patch behaves

Re: [HACKERS] Using quicksort for every external sort run

2016-04-02 Thread Peter Geoghegan

On Sat, Apr 2, 2016 at 3:20 PM, Greg Stark  wrote:
> These are the averages across all queries across all data sets for the
> run-time for the patch versus master (not patched 64 which I think is
> the replacement_sort_mem=64MB which appears to not be a win). So even
> in the less successful cases on average quicksort is faster than
> replacement selection.

It's actually replacement_sort_mem=64 (64KB -- effectively disabled).
So where that case does better or worse, which can only be when
work_mem=8MB in practice, that's respectively good or bad for
replacement selection. So, typically RS does better when there are
presorted inputs with a positive (not inverse/DESC) correlation, and
there is little work_mem. As I've said, this is where the CPU cache is
large enough to fit the entire memtuples heap.

"Padded" cases are mostly bad because they make the memtuples heap
relatively small in each case. So with work_mem=32MB, you get a
memtuples heap structure similar to work_mem=8MB. The padding pushes
things out a bit further, which favors master.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-02 Thread Peter Geoghegan

On Sat, Apr 2, 2016 at 3:20 PM, Greg Stark  wrote:
> There are also some weird cases in this list where there's a
> significant regression at 32MB but not at 8MB. I would like to see
> 16MB and perhaps 12MB and 24MB. They would help understand if these
> are just quirks or there's a consistent pattern.

I'll need to drill down to trace_sort output to see what happened there.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-02 Thread Greg Stark

On Sat, Apr 2, 2016 at 3:31 PM, Tomas Vondra
 wrote:

> So let me be clear: I do think the patch seems to be a significant
> performance improvement for most of the queries, and I'm OK with accepting a
> few regressions (particularly if we agree those are pathological cases,
> unlikely to happen in real-world workloads).

The ultra-short version of this is:

8MB: 0.98
32MB:   0.79
128MB: 0.63
512MB: 0.51
1GB: 0.42

These are the averages across all queries across all data sets for the
run-time for the patch versus master (not patched 64 which I think is
the replacement_sort_mem=64MB which appears to not be a win). So even
in the less successful cases on average quicksort is faster than
replacement selection.

But selecting just the cases where 8MB is significantly slower than
master it does look like the "padding" data sets are endemic.

On the one hand that's a very realistic use-case where I think a lot
of users find themselves. I know in my days as a web developer I
typically threw a lot of columns into my queries and through a lot of
joins and order by and then left it to the application to pick through
the recordsets that were returned for the columns that were of
interest. The tuples being sorted were probably huge.

On the other hand perhaps this is something better tackled by the
planner. If the planner can arrange sorts to happen when the rows are
narrower that would be a a bigger win than trying to move a lot of
data around like this. (In the extreme if it were possible to replace
unnecessary columns by the tid and then refetching them later though
that's obviously more than a little tricky to do effectively.)

There are also some weird cases in this list where there's a
significant regression at 32MB but not at 8MB. I would like to see
16MB and perhaps 12MB and 24MB. They would help understand if these
are just quirks or there's a consistent pattern.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-04-01 Thread Peter Geoghegan

On Thu, Feb 4, 2016 at 3:14 AM, Peter Geoghegan  wrote:
> Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:

This paper is available from
http://www.vldb.org/journal/VLDBJ4/P603.pdf (previously link is now
dead)

> The paper also has very good analysis of the economics of sorting:
>
> "Even for surprisingly large sorts, it is economical to perform the
> sort in one pass."

I suggest taking a look at "Figure 2. Replacement-selection sort vs.
QuickSort" in the paper. It confirms what I said recently about cache
size. The diagram is annotated: "The tournament tree of
replacement-selection sort at left has bad cache behavior, unless the
entire tournament fits in cache". I think we're well justified in
giving no weight at all to cases where the *entire* tournament tree
(heap) fits in cache, because it's not economical to use a
cpu-cache-sized work_mem setting. It simply makes no sense.

I understand the reluctance to give up on replacement selection. The
authors of this paper were themselves reluctant to do so. As they put
it:

"""
We were reluctant to abandon replacement-selection sort, because it has
stability and it generates long runs. Our first approach was to improve
replacement-selection sort's cache locality. Standard replacement-selection
sort has terrible cache behavior, unless the tournament fits in cache. The
cache thrashes on the bottom levels of the tournament. If you think of the
tournament as a tree, each replacement-selection step traverses a path from a
pseudo-random leaf of the tree to the root. The upper parts of the tree may be
cache resident, but the bulk of the tree is not.

We investigated a replacement-selection sort that clusters tournament nodes so
that most parent-child node pairs are contained in the same cache line. This
technique reduces cache misses by a factor of two or three. Nevertheless,
replacement-selection sort is still less attractive than QuickSort because:

1. The cache behavior demonstrates less locality than QuickSorts. Even when
QuickSort runs did not fit entirely in cache, the average compare-exchange
time did not increase significantly.

2. Tournament sort is more CPU-intensive than QuickSort. Knuth calculated a 2:1
ratio for the programs he wrote. We observed a 2.5:1 speed advantage for
QuickSort over the best tournament sort we wrote.

The key to achieving high execution speeds on fast processors is to minimize
the number of references that cannot be serviced by the on-board cache (4MB in
the case of the DEC 7000 AXP). As mentioned before, QuickSort's memory access
patterns are sequential and, thus, have good cache behavior

"""

This paper is co-authored by Jim Gray, a Turing award laureate, as
well as some other very notable researchers. The paper appeared in
"Readings in Database Systems, 4th edition", which was edited by by
Joseph Hellerstein and Michael Stonebraker. These days, the cheapest
consumer level CPUs have 4MB caches (in 1994, that was exceptional),
so if this analysis wasn't totally justified in 1994, when the paper
was written, it is today.

I've spent a lot of time analyzing this problem. I've been looking at
external sorting in detail for almost a year now. I've done my best to
avoid any low-end regressions. I am very confident that I cannot do
any better than I already have there, though. If various very
influential figures in the database research community could not do
better, then I have doubts that we can. I started with the intuition
that we should still use replacement selection myself, but that just
isn't well supported by benchmarking cases with sensible
work_mem:cache size ratios.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-30 Thread Peter Geoghegan

On Wed, Mar 30, 2016 at 4:22 AM, Greg Stark  wrote:
> I'm sorry I was intending to run those benchmarks again this past week
> but haven't gotten around to it. But my plan was to run them on a good
> server I borrowed, an i7 with 8MB cache. I can still go ahead with
> that but I can also try running it on the home server again too if you
> want (and AMD N36L with 1MB cache).

I don't want to suggest that people not test the very low end on very
high end hardware. That's fine, as long as it's put in context.
Considerations about the economics of cache sizes and work_mem
settings are crucial to testing the patch objectively. If everything
fits in cache anyway, then you almost eliminate the advantages
quicksort has, but you should be using an internal sort for anyway. I
think that this is just common sense.

I would like to see a low-end benchmark for low-end work_mem settings
too, though. Maybe you could repeat the benchmark I linked to, but
with a recent version of the patch, including commit 0011c0091e886b.
Compare that to the master branch just before 0011c0091e886b went in.
I'm curious about how the more recent memory context resetting stuff
that made it into 0011c0091e886b left us regression-wise.  Tomas
tested that, of course, but I have some concerns about how
representative his numbers are at the low end.

> But even for the smaller machines I don't think we should really be
> caring about regressions in the 4-8MB work_mem range. Earlier in the
> fuzzer work I was surprised to find out it can take tens of megabytes
> to compile a single regular expression (iirc it was about 30MB for a
> 64-bit machine) before you get errors. It seems surprising to me that
> a single operator would consume more memory than an ORDER BY clause. I
> was leaning towards suggesting we just bump up the default work_mem to
> 8MB or 16MB.

Today, it costs less than USD $40 for a new Raspberry Pi 2, which has
1GB of memory. I couldn't figure out exactly how much CPU cache that
model has, put I'm pretty sure it's no more than 256KB. Memory just
isn't that expensive; memory bandwidth is expensive. I agree that we
could easily justify increasing work_mem to 8MB, or even 16MB.

It seems almost silly to point it out, but: Increasing sort
performance has the effect of decreasing the duration of sorts, which
could effectively decrease memory use on the system. Increasing the
memory available to sorts could decrease the overall use of memory.
Being really frugal with memory is expensive, maybe even if your
primary concern is the expense of memory usage, which it probably
isn't these days.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-30 Thread Greg Stark

On Wed, Mar 30, 2016 at 7:23 AM, Peter Geoghegan  wrote:
> Anyway, what I liked about Greg's approach to finding regressions at
> the low end was that when testing, he used the cheapest possible VM
> available on Google's cloud platform. When testing the low end, he had
> low end hardware to go with the low end work_mem settings. This gave
> the patch the benefit of using quicksort to make good use of what I
> assume is a far smaller L2 cache; certainly nothing like 6MB or 12MB.
> I think Greg might have used a home server to test my patch in [1],
> actually, but I understand that it too was suitably low-end.

I'm sorry I was intending to run those benchmarks again this past week
but haven't gotten around to it. But my plan was to run them on a good
server I borrowed, an i7 with 8MB cache. I can still go ahead with
that but I can also try running it on the home server again too if you
want (and AMD N36L with 1MB cache).

But even for the smaller machines I don't think we should really be
caring about regressions in the 4-8MB work_mem range. Earlier in the
fuzzer work I was surprised to find out it can take tens of megabytes
to compile a single regular expression (iirc it was about 30MB for a
64-bit machine) before you get errors. It seems surprising to me that
a single operator would consume more memory than an ORDER BY clause. I
was leaning towards suggesting we just bump up the default work_mem to
8MB or 16MB.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-30 Thread Peter Geoghegan

On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra
 wrote:
> That may be easily due to differences between the CPUs and configuration.
> For example the Xeon uses a way older CPU with different amounts of CPU
> cache, and it's also a multi-socket system. And so on.

So, having searched past threads I guess this was your Xeon E5450,
which has a 12MB cache. I also see that you have an Intel Core
i5-2500K Processor, which has 6MB of L2 cache. This hardware is
mid-end, and the CPUs were discontinued in 2010 and 2013 respectively.

Now, the i5 has a smaller L2 cache, so if anything I'd expect it to do
worse than the Xeon, not better. But leaving that aside, I think there
is an issue that we don't want to lose sight of.  Which is: In most of
the regressions we were discussing today, perhaps the entire heap
structure can fit in L2 cache. This would be true for stuff like int4
CREATE INDEX builds, where a significant fraction of memory is used
for IndexTuples, which most or all comparisons don't have to read in
memory. This is the case with a CPU that was discontinued by the
manufacturer just over 5 years ago. I think this is why "padding"
cases can make the patch look not much better and occasionally worse
at the low end: Those keep the number of memtuples as a fraction of
work_mem very low, and so mask the problems with the replacement
selection heap.

When Greg Stark benchmarked the patch at the low end, to identify
regressions, he did find some slight regressions at the lowest
work_mem settings with many many passes, but they were quite small
[1]. Greg also did some good analysis of the performance
characteristics of external sorting today [2] that I recommend reading
if you missed. It's possible that those regressions have since been
fixed, because Greg did not apply/test the memory batching patch that
became commit 0011c0091e886b as part of this. It seems likely that
it's at least partially fixed, and it might even be better than master
overall, now.

Anyway, what I liked about Greg's approach to finding regressions at
the low end was that when testing, he used the cheapest possible VM
available on Google's cloud platform. When testing the low end, he had
low end hardware to go with the low end work_mem settings. This gave
the patch the benefit of using quicksort to make good use of what I
assume is a far smaller L2 cache; certainly nothing like 6MB or 12MB.
I think Greg might have used a home server to test my patch in [1],
actually, but I understand that it too was suitably low-end.

It's perfectly valid to bring economics into this; typically, an
external sort occurs only because memory isn't infinitely affordable,
or it isn't worth provisioning enough memory to be totally confident
that you can do every sort internally. With external sorting, the
constant factors are what researchers generally spend most of the time
worrying about. Knuth spends a lot of time discussing how the
characteristics of actual magnetic tape drives changed throughout the
1970s in TAOCP Volume III.

It's quite valid to ask if anyone would actually want to have an 8MB
work_mem setting on a machine that has 12MB of L2 cache, cache that an
external sort gets all to itself. Is that actually a practical setup
that anyone would want to use?

[1] 
http://www.postgresql.org/message-id/CAM-w4HOwt0C7ZndowHUuraw+xi+BhY5a6J008XoSq=r9z7h...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/CAM-w4HM4XW3u5kVEuUrr+L+KX3WZ=5JKk0A=djjzypkb-hy...@mail.gmail.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-29 Thread Peter Geoghegan

On Tue, Mar 29, 2016 at 6:02 PM, Tomas Vondra
 wrote:
> And why not? I mean, why should it be acceptable to slow down?

My point was that over 80% of execution time was spent in the
HashAggregate, which outputs tuples to the sort. That, and the huge
i5/Xeon inconsistency (in the extent to which this is regressed --
it's not at all, or it's regressed a lot) makes me suspicious that
there is something else going on. Possibly involving the scheduling of
I/O.

> That may be easily due to differences between the CPUs and configuration.
> For example the Xeon uses a way older CPU with different amounts of CPU
> cache, and it's also a multi-socket system. And so on.

We're talking about a huge relative difference with that HashAggregate
plan, though. I don't think that those relative differences are
explained by differing CPU characteristics. But I guess we'll find out
soon enough.

>> If there is ever a regression, it is only really sensible to talk
>> about it while looking at trace_sort output (and, I guess, the query
>> plan). I've asked Tomas for trace_sort output in all relevant cases.
>> There is no point in "flying blind" and speculating what the problem
>> was from a distance.
>
>
> The updated benchmarks are currently running. I'm out of office until
> Friday, and I'd like to process the results over the weekend. FWIW I'll have
> results for these cases:
>
> 1) unpatched (a414d96a)
> 2) patched, default settings
> 3) patched, replacement_sort_mem=64
>
> Also, I'll have trace_sort=on output for all the queries, so we can
> investigate further.

Thanks! That will tell us a lot more.

> Yeah. That was one of the goals of the benchmark, to come up with some
> tuning recommendations. On some systems significantly increasing memory GUCs
> may not be possible, though - say, on very small systems with very limited
> amounts of RAM.

Fortunately, such systems will probably mostly use external sorts for
CREATE INDEX cases, and there seems to be very little if any downside
there, at least according to your similarly, varied tests of CREATE
INDEX.

>> I don't think they are representative. Greg Stark characterized the
>> regressions as being fairly limited, mostly at the very low end. And
>> that was *before* all the memory fragmentation stuff made that
>> better. I haven't done any analysis of how much better that made the
>> problem *across the board* yet, but for int4 cases I could make 1MB
>> work_mem queries faster with gigabytes of data on my laptop. I
>> believe I tested various datum sort cases there, like "select
>> count(distinct(foo)) from bar"; those are a very pure test of the
>> patch.
>>
>
> Well, I'd guess those conclusions may be a bit subjective.

I think that the conclusion that we should do something or not do
something based on this information is subjective. OTOH, whether and
to what extent these tests are representative of real user workloads
seems much less subjective. This is not a criticism of the test cases
you came up with, which rightly emphasized possibly regressed cases. I
think everyone already understood that the picture was very positive
at the high end, in memory rich environments.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-29 Thread Tomas Vondra


Hi,

On 03/29/2016 09:43 PM, Peter Geoghegan wrote:

On Tue, Mar 29, 2016 at 9:11 AM, Robert Haas  wrote:

One test that kind of bothers me in particular is the "SELECT DISTINCT
a FROM numeric_test ORDER BY a" test on the high_cardinality_random
data set.  That's a wash at most work_mem values, but at 32MB it's
more than 3x slower.  That's very strange, and there are a number of
other results like that, where one particular work_mem value triggers
a large regression.  That's worrying.


That case is totally invalid as a benchmark for this patch. Here is
the query plan I get (doesn't matter if I run analyze) when I follow
Tomas' high_cardinality_random 10M instructions (including setting
work_mem to 32MB):

postgres=# explain analyze select distinct a from numeric_test order by a;
   QUERY
PLAN
───
  Sort  (cost=268895.39..270373.10 rows=591082 width=8) (actual
time=3907.917..4086.174 rows=999879 loops=1)
Sort Key: a
Sort Method: external merge  Disk: 18536kB
->  HashAggregate  (cost=206320.50..212231.32 rows=591082 width=8)
(actual time=3109.619..3387.599 rows=999879 loops=1)
  Group Key: a
  ->  Seq Scan on numeric_test  (cost=0.00..175844.40
rows=12190440 width=8) (actual time=0.025..601.295 rows=1000
loops=1)
  Planning time: 0.088 ms
  Execution time: 4120.656 ms
(8 rows)

Does that seem like a fair test of this patch?


And why not? I mean, why should it be acceptable to slow down?



I must also point out an inexplicable differences between the i5 and
Xeon in relation to this query. It took about took 10% less time on
the patched Xeon 10M case, not ~200% more (line 53 of the summary page
in each 10M case). So even if this case did exercise the patch well,
it's far from clear that it has even been regressed at all. It's far
easier to imagine that there was some problem with the i5 tests.


That may be easily due to differences between the CPUs and 
configuration. For example the Xeon uses a way older CPU with different 
amounts of CPU cache, and it's also a multi-socket system. And so on.



A complete do-over from Tomas would be best, here. He has already
acknowledged that the i5 CREATE INDEX results were completely invalid.
Pending a do-over from Tomas, I recommend ignoring the i5 tests
completely. Also, I should once again point out that many of the
work_mem cases actually had internal sorts at the high end, so once
the code in the patches simply wasn't exercised at all at the high end
(the 1024MB cases, where the numbers might be expected to get really
good).

If there is ever a regression, it is only really sensible to talk
about it while looking at trace_sort output (and, I guess, the query
plan). I've asked Tomas for trace_sort output in all relevant cases.
There is no point in "flying blind" and speculating what the problem
was from a distance.


The updated benchmarks are currently running. I'm out of office until 
Friday, and I'd like to process the results over the weekend. FWIW I'll 
have results for these cases:


1) unpatched (a414d96a)
2) patched, default settings
3) patched, replacement_sort_mem=64

Also, I'll have trace_sort=on output for all the queries, so we can 
investigate further.





Also, it's pretty clear that the patch has more large wins than it
does large losses, but it seems pretty easy to imagine people who
haven't tuned any GUCs writing in to say that 9.6 is way slower on
their workload, because those people are going to be at
work_mem=4MB, maintenance_work_mem=64MB. At those numbers, if
Tomas's data is representative, it's not hard to imagine that the
number of people who see a significant regression might be quite a
bit larger than the number who see a significant speedup.


Yeah. That was one of the goals of the benchmark, to come up with some 
tuning recommendations. On some systems significantly increasing memory 
GUCs may not be possible, though - say, on very small systems with very 
limited amounts of RAM.




I don't think they are representative. Greg Stark characterized the
regressions as being fairly limited, mostly at the very low end. And
that was *before* all the memory fragmentation stuff made that
better. I haven't done any analysis of how much better that made the
problem *across the board* yet, but for int4 cases I could make 1MB
work_mem queries faster with gigabytes of data on my laptop. I
believe I tested various datum sort cases there, like "select
count(distinct(foo)) from bar"; those are a very pure test of the
patch.



Well, I'd guess those conclusions may be a bit subjective.

regards


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make

Re: [HACKERS] Using quicksort for every external sort run

2016-03-29 Thread Peter Geoghegan

On Tue, Mar 29, 2016 at 12:43 PM, Peter Geoghegan  wrote:
> A complete do-over from Tomas would be best, here. He has already
> acknowledged that the i5 CREATE INDEX results were completely invalid.

The following analysis is all based on Xeon numbers, which as I've
said we should focus on pending a do-over from Tomas. Especially
important here is the larget set -- the 10M numbers from
results-xeon-10m.ods.

I think that abbreviation distorts things here. We also see distortion
from "padding" cases.

Rather a lot of "padding" is used, FWIW. From Tomas' script:

INSERT INTO numeric_test_padding SELECT a, repeat(md5(a::text),10)
FROM data_float ORDER BY a;

This makes the tests have TOAST overhead.

Some important observations on results-xeon-10m:

* There are almost no regressions for types that don't use
abbreviation. There might be one exception when there is both padding
and presorted input -- the 32MB
high_cardinality_almost_asc/high_cardinality_sorted/unique_sorted
"SELECT * FROM int_test_padding ORDER BY a", which takes 26% - 35%
longer (those are all basically the same cases). But it's a big win in
the high_cardinality_random, unique_random, and even unique_almost_asc
categories, or when DESC order was requested in all categories (I note
that there is certainly an emphasis on pre-sorted cases in the choice
of categories). Other than that, no regressions from non-abbreviated
types.

* No CREATE INDEX case is ever appreciably regressed, even with
maintenance_work_mem at 8MB, 1/4 of its default value of 64MB. (Maybe
we lose 1% - 3% with the other (results-xeon-1m.ods) cases, where
maintenance_work_mem is close to or actually high enough to get an
internal sort). It's a bit odd that "CREATE INDEX x ON
text_test_padding (a)" is about a wash for
high_cardinality_almost_asc, but I think that's just because we're
super I/O bound for this presorted case, but cannot make up for it
with quicksort's "bubble sort best case" precheck for presortedness,
so replacement selection does better in a way that might even result
in a clean draw. CREATE INDEX looks very good in general. I think
abbreviation might abort in one or two cases for text, but the picture
for the patch is still solid.

* "Padding" can really distort low-end cases, that become more about
moving big tuples around than actual sorting. If you really want to
see how high_cardinality_almost_asc queries like "SELECT * FROM
text_test_padding ORDER BY a" are testing the wrong thing, consider
the best and worst case for the master branch with any amount of
work_mem. The 10 million tuple high_cardinality_almost_asc case takes
40.16 seconds, 39.95 seconds, 40.98 seconds, and 41.28 seconds, and
42.1 seconds for respective work_mems of 8MB, 32MB, 128MB, 512MB, and
1024MB. This is a very narrow case because it totally deemphasizes
comparison cost and emphasizes moving tuples around, involves
abbreviation of text with a merge phase that cannot use abbreviation
that only the patch has due to RS best-case on master. The case is
seriously short changed by the memory batching refund thing in
practice. When is *high cardinality text* (not dates or something)
ever likely to be found in pre-sorted order for 10 million tuples in
the real world? Besides, we just stopped trusting strxfrm(), so the
case would probably be a wash now at worst.

* The more plausible padding + presorted + abbreviation case that is
sometimes regressed is "SELECT * FROM numeric_test_padding ORDER BY
a". But that's regressed a lot less than the aforementioned "SELECT *
FROM text_test_padding ORDER BY a" case, and only at the low end. It
is sometimes faster where the original case I mentioned is slower.

* Client overhead may distort things in the case of queries like
"SELECT * FROM foo ORDER BY bar". This could be worse for the patch,
which does relatively more computation during the final on-the-fly
merge phase (which is great when you can overlap that with I/O;
perhaps not when you get more icache misses with other computation).
Aside from just adding a lot of noise, this could unfairly make the
patch look a lot worse than master.

Now, I'm not saying all of this doesn't matter. But these are all
fairly narrow, pathological cases, often more about moving big tuples
around (in memory and on disk) than about sorting. These regressions
are well worth it. I don't think I can do any more than I already have
to fix these cases; it may be impossible. It's a very difficult thing
to come up with an algorithm that's unambiguously better in every
possible case. I bent over backwards to fix low-end regressions
already.

In memory rich environments with lots of I/O bandwidth, I've seen this
patch make CREATE INDEX ~2.5x faster for int4, on a logged table. More
importantly, the patch makes setting maintenance_work_mem easy. Users'
intuition for how sizing it ought to work now becomes more or less
correct: In general, for each individual utility command bound by
maintenance_work_mem, more

Re: [HACKERS] Using quicksort for every external sort run

2016-03-29 Thread Peter Geoghegan

On Tue, Mar 29, 2016 at 9:11 AM, Robert Haas  wrote:
> One test that kind of bothers me in particular is the "SELECT DISTINCT
> a FROM numeric_test ORDER BY a" test on the high_cardinality_random
> data set.  That's a wash at most work_mem values, but at 32MB it's
> more than 3x slower.  That's very strange, and there are a number of
> other results like that, where one particular work_mem value triggers
> a large regression.  That's worrying.

That case is totally invalid as a benchmark for this patch. Here is
the query plan I get (doesn't matter if I run analyze) when I follow
Tomas' high_cardinality_random 10M instructions (including setting
work_mem to 32MB):

postgres=# explain analyze select distinct a from numeric_test order by a;
  QUERY
PLAN
───
 Sort  (cost=268895.39..270373.10 rows=591082 width=8) (actual
time=3907.917..4086.174 rows=999879 loops=1)
   Sort Key: a
   Sort Method: external merge  Disk: 18536kB
   ->  HashAggregate  (cost=206320.50..212231.32 rows=591082 width=8)
(actual time=3109.619..3387.599 rows=999879 loops=1)
 Group Key: a
 ->  Seq Scan on numeric_test  (cost=0.00..175844.40
rows=12190440 width=8) (actual time=0.025..601.295 rows=1000
loops=1)
 Planning time: 0.088 ms
 Execution time: 4120.656 ms
(8 rows)

Does that seem like a fair test of this patch?

I must also point out an inexplicable differences between the i5 and
Xeon in relation to this query. It took about took 10% less time on
the patched Xeon 10M case, not ~200% more (line 53 of the summary page
in each 10M case). So even if this case did exercise the patch well,
it's far from clear that it has even been regressed at all. It's far
easier to imagine that there was some problem with the i5 tests.

A complete do-over from Tomas would be best, here. He has already
acknowledged that the i5 CREATE INDEX results were completely invalid.
Pending a do-over from Tomas, I recommend ignoring the i5 tests
completely. Also, I should once again point out that many of the
work_mem cases actually had internal sorts at the high end, so once
the code in the patches simply wasn't exercised at all at the high end
(the 1024MB cases, where the numbers might be expected to get really
good).

If there is ever a regression, it is only really sensible to talk
about it while looking at trace_sort output (and, I guess, the query
plan). I've asked Tomas for trace_sort output in all relevant cases.
There is no point in "flying blind" and speculating what the problem
was from a distance.

> Also, it's pretty clear that the patch has more large wins than it
> does large losses, but it seems pretty easy to imagine people who
> haven't tuned any GUCs writing in to say that 9.6 is way slower on
> their workload, because those people are going to be at work_mem=4MB,
> maintenance_work_mem=64MB.  At those numbers, if Tomas's data is
> representative, it's not hard to imagine that the number of people who
> see a significant regression might be quite a bit larger than the
> number who see a significant speedup.

I don't think they are representative. Greg Stark characterized the
regressions as being fairly limited, mostly at the very low end. And
that was *before* all the memory fragmentation stuff made that better.
I haven't done any analysis of how much better that made the problem
*across the board* yet, but for int4 cases I could make 1MB work_mem
queries faster with gigabytes of data on my laptop. I believe I tested
various datum sort cases there, like "select count(distinct(foo)) from
bar"; those are a very pure test of the patch.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-29 Thread Robert Haas

On Mon, Mar 28, 2016 at 11:18 PM, Peter Geoghegan  wrote:
> Note that amcheck V2, which I posted just now features tests for
> external sorting. The way these work requires discussion. The tests
> are motivated in part by the recent strxfrm() debacle, as well as by
> the need to have at least some test coverage for this patch. It's bad
> that external sorting currently has no test coverage. We should try
> and do better there as part of this overhaul to tuplesort.c.

Test coverage is good!

However, I don't see that you've responded to Tomas Vondra's report of
regressions.  Maybe you're waiting for more data from him, but we're
running out of time here.  I think what we need to decide is whether
these results are bad enough that the patch needs more work on the
regressed cases, or whether we're comfortable with some regressions in
low-memory configurations for the benefit of higher-memory
configurations.  I'm kind of on the fence about that, myself.

One test that kind of bothers me in particular is the "SELECT DISTINCT
a FROM numeric_test ORDER BY a" test on the high_cardinality_random
data set.  That's a wash at most work_mem values, but at 32MB it's
more than 3x slower.  That's very strange, and there are a number of
other results like that, where one particular work_mem value triggers
a large regression.  That's worrying.

Also, it's pretty clear that the patch has more large wins than it
does large losses, but it seems pretty easy to imagine people who
haven't tuned any GUCs writing in to say that 9.6 is way slower on
their workload, because those people are going to be at work_mem=4MB,
maintenance_work_mem=64MB.  At those numbers, if Tomas's data is
representative, it's not hard to imagine that the number of people who
see a significant regression might be quite a bit larger than the
number who see a significant speedup.

On the whole, I'm tempted to say this needs more work before we commit
to it, but I'd like to hear other opinions on that point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-28 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 6:54 PM, Peter Geoghegan  wrote:
> I've used amcheck [2] to test this latest revision -- the tool ought
> to not see any problems with any index created with the patch applied.
> Reviewers might find it helpful to use amcheck, too. As 9.6 is
> stabilized, I anticipate that amcheck will give us a fighting chance
> at early detection of any bugs that might have slipped into tuplesort,
> or a B-Tree operator class. Since we still don't even have one single
> test of the external sort code [3], it's just as well. If we wanted to
> test external sorting, maybe we'd do that by adding tests to amcheck,
> that are not run by default, much like test_decoding, which tests
> logical decoding but is not targeted by "make installcheck"; that
> would allow the tests to be fairly comprehensive without being
> annoying. Using amcheck neatly side-steps issues with the portability
> of "expected" pg_regress output when collatable type sorting is
> tested.

Note that amcheck V2, which I posted just now features tests for
external sorting. The way these work requires discussion. The tests
are motivated in part by the recent strxfrm() debacle, as well as by
the need to have at least some test coverage for this patch. It's bad
that external sorting currently has no test coverage. We should try
and do better there as part of this overhaul to tuplesort.c.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-24 Thread Peter Geoghegan

On Sun, Mar 20, 2016 at 11:01 PM, Peter Geoghegan  wrote:
> Allowing 0 tuple runs in rare cases seems like the simplest solution.
> After all, mergeprereadone() is expressly prepared for 0 tuple runs.
> It says "ensure that we have at least one tuple, if any are to be
> had". There is no reason to assume that it says this only because it
> imagines that no tuples might be found *only after* the first preread
> for the merge (by which I mean I don't think that only applies when a
> final on-the-fly merge reloads tuples from one particular tape
> following running out of tuples of the tape/run in memory).

I just realized that there is what amounts to an over-zealous
assertion in dumpbatch():

> +* When this edge case hasn't occurred, the first memtuple should not
> +* be found to be heapified (nor should any other memtuple).
> +*/
> +   Assert(state->memtupcount == 0 ||
> +  state->memtuples[0].tupindex == HEAP_RUN_NEXT);

The problem is that state->memtuples[0].tupindex won't have been
*reliably* initialized here. We could make sure that it is for the
benefit of this assertion, but I think it would be better to just
remove the assertion, which isn't testing very much over and above the
similar assertions that appears in the only dumpbatch() caller,
dumptuples().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-23 Thread Peter Geoghegan

On Wed, Mar 23, 2016 at 8:05 PM, Tomas Vondra
 wrote:
> FWIW, maintenance_work_mem was set to 1GB on the i5 machine and 256MB on the
> Xeon. Hmm, maybe that's why we see no difference for CREATE INDEX on the i5,
> and an improvement on the Xeon.

That would explain it.

> Not a big deal - it's easy enough to change the config and repeat the
> benchmark. Are there any particular replacement_sort_mem values that you
> think would be interesting to configure?

I would start with replacement_sort_mem=64. i.e., 64KB, effectively disabled

> I have to admit I'm a bit afraid we'll introduce a new GUC that only very
> few users will know how to set properly, and so most people will run with
> the default value or set it to something stupid.

I agree.

> Are you saying none of the queries triggers the memory context resets? What
> queries would trigger that (to test the theory)?

They will still do the context resetting and so on just the same, but
would use a heap for the first attempt. But replacement_sort_mem=64
would let us know that

> But if we care about 9.5 -> 9.6 regressions, then perhaps we should include
> that commit into the benchmark, because that's what the users will see? Or
> have I misunderstood the second part?

I think it's good that you didn't test the March 17 commit of the
memory batching to the master branch when testing the master branch.
You should continue to do that, because we care about regressions
against 9.5 only. The only issue insofar as what code was tested is
that replacement_sort_mem was not set to 64 (to effectively disable
any use of the heap by the patch). I would like to see if we can get
rid of replacement_sort_mem without causing any real regressions,
which I think the memory context reset stuff makes possible.

There was a new version of my quicksort patch posted after March 17,
but don't worry about it -- that's totally cosmetic. Some minor
tweaks.

> BTW which patch does the memory batching? A quick search through git log did
> not return any recent patches mentioning these terms.

Commit 0011c0091e886b874e485a46ff2c94222ffbf550. But, like I said,
avoid changing what you're testing as master; do not include that. The
patch set you were testing is fine. Nothing is missing.

> As I mentioned above, I haven't realized work_mem does not matter for CREATE
> INDEX, and maintenance_work_mem was set to a fixed value for the whole test.
> And the two machines used different values for this particular configuration
> value - Xeon used just 256MB, while i5 used 1GB. So while on i5 it was just
> a single chunk, on Xeon there were multiple batches. Hence the different
> behavior.

Makes sense. Obviously this should be avoided, though.

> No it doesn't add overhead. The script actually does
>
> COPY (query) TO '/dev/null'
>
> on the server for all queries (except for the CREATE INDEX, obviously), so
> there should be pretty much no overhead due to transferring rows to the
> client and so on.

That still adds overhead, because the output functions are still used
to create a textual representation of data. This was how Andres tested
the improvement to the timestamptz output function committed to 9.6,
for example.

>> If you really wanted to make the patch look good, a sort with 5GB of
>> work_mem is the best way, FWIW. The heap data structure used by the
>> master branch tuplesort.c will handle that very badly. You use no
>> temp_tablespaces here. I wonder if the patch would do better with
>> that. Sorting can actually be quite I/O bound with the patch
>> sometimes, where it's usually CPU/memory bound with the heap,
>> especially with lots of work_mem. More importantly, it would be more
>> informative if the temp_tablespace was not affected by I/O from
>> Postgres' heap.
>
>
> I'll consider testing that. However, I don't think there was any significant
> I/O on the machines - particularly not on the Xeon, which has 16GB of RAM.
> So the temp files should fit into that quite easily.

Right, but with a bigger sort, there might well be more I/O.
Especially for the merge. It might be that that holds back the patch
from doing even better than the master branch does.

> The i5 machine has only 8GB of RAM, but it has 6 SSD drives in raid0. So I
> doubt it was I/O bound.

These patches can sometimes be significantly I/O bound on my laptop,
where that didn't happen before. Sounds unlikely here, though.

>> I also like seeing a sample of "trace_sort = on" output. I don't
>> expect you to carefully collect that in every case, but it can tell
>> us a lot about what's really going on when benchmarking.
>
>
> Sure, I can collect that.

Just for the interesting cases. Or maybe just dump it all and let me
figure it out for myself. trace_sort output shows me how many runs
they are, how abbreviation did, how memory was used, and even if the
sort was I/O bound at various stages (it dumps some getrusage stats to
the log, too). You can usually tell exactly what happened for

Re: [HACKERS] Using quicksort for every external sort run

2016-03-23 Thread Tomas Vondra


Hi,

On 03/24/2016 03:00 AM, Peter Geoghegan wrote:

On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra
 wrote:

I've tested the patch you've sent on 2016/3/11, which I believe is the last
version. I haven't tuned the replacement_sort_mem at all. because
my understanding was that it's not a 9.6 material (per your
message). So my intent was to test the configuration people are
likely to use by default.


I meant that using replacement selection in a special way with
CREATE INDEX was not 9.6 material. But replacement_sort_mem is. And
so, any case with the (maintenance)_work_mem <= 16MB will have used a
heap for the first run.


FWIW, maintenance_work_mem was set to 1GB on the i5 machine and 256MB on 
the Xeon. Hmm, maybe that's why we see no difference for CREATE INDEX on 
the i5, and an improvement on the Xeon.




I'm sorry I did not make a point of telling you this. It's my fault.
The result in any case is that pre-sorted cases will be similar with
and without the patch, since replacement selection can thereby make
one long run. But on non-sorted cases, the patch helps less because
it is in use less -- with not so much data overall, possibly much
less (which I think explains why the 1M row tests seem so much less
interesting than the 10M row tests).


Not a big deal - it's easy enough to change the config and repeat the 
benchmark. Are there any particular replacement_sort_mem values that you 
think would be interesting to configure?


I have to admit I'm a bit afraid we'll introduce a new GUC that only 
very few users will know how to set properly, and so most people will 
run with the default value or set it to something stupid.




I worry that at the low end, replacement_sort_mem makes the patch
have one long run, but still some more other runs, so merging is
unbalanced. We should consider if the patch can beat the master
branch at the low end without using a replacement selection heap. It
would do better in at least some cases in low memory conditions,
possibly a convincing majority of cases. I had hoped that my recent
idea (since committed) of resetting memory contexts would help a lot
with regressions when work_mem is very low, and that particular
theory isn't really tested here.


Are you saying none of the queries triggers the memory context resets? 
What queries would trigger that (to test the theory)?





I'm not sure which commit are you referring to. The benchmark was
done on a414d96a (from 2016/3/10). However I'd expect that to
affect both sets of measurements, although it's possible that it
affects the patched version differently.


You did test the right patches. It just so happens that the master
branch now has the memory batching stuff now, so it doesn't get
credited with that. I think this is good, though, because we care
about 9.5 -> 9.6 regressions.


So there's a commit in master (but not in 9.5), adding memory batching, 
but it got committed before a414d96a so the benchmark does not measure 
it's impact (with respect to 9.5). Correct?


But if we care about 9.5 -> 9.6 regressions, then perhaps we should 
include that commit into the benchmark, because that's what the users 
will see? Or have I misunderstood the second part?


BTW which patch does the memory batching? A quick search through git log 
did not return any recent patches mentioning these terms.



Improvement ratio (master time/patched time) for Xeon 10 million row
case "SELECT * FROM int_test_padding ORDER BY a DESC":

For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47,
1024MB = 1.00

So, it gets faster than the master branch as more memory is
available, but then it goes to 1.00 -- a perfect draw. I think that
this happened simply because at that point, the sort was an internal
sort (even though similar CREATE INDEX case did not go internal at
the same point). The (internal) 1024MB case is not that much faster
than the 512MB external case, which is pretty good.


Indeed.



There are also "near draws", where the ratio is 0.98 or so. I think
that this is because abbreviation is aborted, which can be a problem
with synthetic data + text -- you get a very slow sort either way,


That is possible, yes. It's true that the worst regressions are on text, 
although there are a few on numeric too (albeit not as significant).



where most time is spent calling strcoll(), and cache
characteristics matter much less. Those cases seemingly take much
longer overall, so this theory makes sense. Unfortunately,
abbreviated keys for text that is not C locale text was basically
disabled across the board today due to a glibc problem. :-(


Yeah. Bummer :-(



Whenever I see that the patch is exactly as fast as the master
branch, I am skeptical. I am particularly skeptical of all i5
results (including 10M cases), because the patch seems to be almost
perfectly matched to the master branch for CREATE INDEX cases (which
are the best cases for the patch on your Xeon server) -- it's much
easier to

Re: [HACKERS] Using quicksort for every external sort run

2016-03-23 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra
 wrote:
> For example these two queries got almost 2x as slow for some data sets:
>
> SELECT a FROM numeric_test UNION SELECT a FROM numeric_test_padding
> SELECT a FROM text_test UNION SELECT a FROM text_test_padding
>
> I assume the slowdown is related to the batching (as it's only happening for
> low work_mem values), so perhaps there's an internal heuristics that we
> could tune?

Can you show trace_sort output for these cases? Both master, and patched?

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-23 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra
 wrote:
> I've tested the patch you've sent on 2016/3/11, which I believe is the last
> version. I haven't tuned the replacement_sort_mem at all. because my
> understanding was that it's not a 9.6 material (per your message). So my
> intent was to test the configuration people are likely to use by default.

I meant that using replacement selection in a special way with CREATE
INDEX was not 9.6 material. But replacement_sort_mem is. And so, any
case with the (maintenance)_work_mem <= 16MB will have used a heap for
the first run.

I'm sorry I did not make a point of telling you this. It's my fault.
The result in any case is that pre-sorted cases will be similar with
and without the patch, since replacement selection can thereby make
one long run. But on non-sorted cases, the patch helps less because it
is in use less -- with not so much data overall, possibly much less
(which I think explains why the 1M row tests seem so much less
interesting than the 10M row tests).

I worry that at the low end, replacement_sort_mem makes the patch have
one long run, but still some more other runs, so merging is
unbalanced. We should consider if the patch can beat the master branch
at the low end without using a replacement selection heap. It would do
better in at least some cases in low memory conditions, possibly a
convincing majority of cases. I had hoped that my recent idea (since
committed) of resetting memory contexts would help a lot with
regressions when work_mem is very low, and that particular theory
isn't really tested here.

> I'm not sure which commit are you referring to. The benchmark was done on
> a414d96a (from 2016/3/10). However I'd expect that to affect both sets of
> measurements, although it's possible that it affects the patched version
> differently.

You did test the right patches. It just so happens that the master
branch now has the memory batching stuff now, so it doesn't get
credited with that. I think this is good, though, because we care
about 9.5 -> 9.6 regressions.

Improvement ratio (master time/patched time) for Xeon 10 million row
case "SELECT * FROM int_test_padding ORDER BY a DESC":

For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47,
1024MB = 1.00

So, it gets faster than the master branch as more memory is available,
but then it goes to 1.00 -- a perfect draw. I think that this happened
simply because at that point, the sort was an internal sort (even
though similar CREATE INDEX case did not go internal at the same
point). The (internal) 1024MB case is not that much faster than the
512MB external case, which is pretty good.

There are also "near draws", where the ratio is 0.98 or so. I think
that this is because abbreviation is aborted, which can be a problem
with synthetic data + text -- you get a very slow sort either way,
where most time is spent calling strcoll(), and cache characteristics
matter much less. Those cases seemingly take much longer overall, so
this theory makes sense. Unfortunately, abbreviated keys for text that
is not C locale text was basically disabled across the board today due
to a glibc problem. :-(

Whenever I see that the patch is exactly as fast as the master branch,
I am skeptical. I am particularly skeptical of all i5 results
(including 10M cases), because the patch seems to be almost perfectly
matched to the master branch for CREATE INDEX cases (which are the
best cases for the patch on your Xeon server) -- it's much easier to
believe that there was a problem during the test, honestly, like
maintenance_work_mem wasn't set correctly. Those two things are so
different that I have a hard time imagining that they'd ever really
draw. I mean, it's possible, but it's more likely to be a problem with
testing. And, queries like "SELECT * FROM int_test_padding ORDER BY a
DESC" return all rows, which adds noise from all the client overhead.
In fact, you often see that adding more memory helps no case here, so
it seem a bit pointless. Maybe they should be written like "SELECT *
FROM (select * from int_test_padding ORDER BY a DESC OFFSET 1e10) ff"
instead. And maybe queries like "SELECT DISTINCT a FROM int_test ORDER
BY a" would be better as "SELECT COUNT(DISTINCT a) FROM int_test", in
order to test the datum/aggregate case. Just suggestions.

If you really wanted to make the patch look good, a sort with 5GB of
work_mem is the best way, FWIW. The heap data structure used by the
master branch tuplesort.c will handle that very badly. You use no
temp_tablespaces here. I wonder if the patch would do better with
that. Sorting can actually be quite I/O bound with the patch
sometimes, where it's usually CPU/memory bound with the heap,
especially with lots of work_mem. More importantly, it would be more
informative if the temp_tablespace was not affected by I/O from
Postgres' heap.

I also like seeing a sample of "trace_sort = on" output. I don't
expect you to carefully collect that

Re: [HACKERS] Using quicksort for every external sort run

2016-03-22 Thread Tomas Vondra


Hi,

On 03/22/2016 11:07 PM, Peter Geoghegan wrote:

On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra
 wrote:

Each query was executed 5x for each work_mem value (between 8MB and 1GB),
and then a median of the runs was computed and that's what's on the
"comparison". This compares a414d96ad2b without (master) and with the
patches applied (patched). The last set of columns is simply a "speedup"
where "<1.0" means the patched code is faster, while >1.0 means it's slower.
Values below 0.9 or 1.1 are using green or red background, to make the most
significant improvements or regressions clearly visible.

For the smaller data set (1M rows), things works pretty fine. There are
pretty much no red cells (so no significant regressions), but quite a few
green ones (with duration reduced by up to 50%). There are some results in
the 1.0-1.05 range, but considering how short the queries are, I don't think
this is a problem. Overall the total duration was reduced by ~20%, which is
nice.

For the 10M data sets, total speedup is also almost ~20%, and the speedups
for most queries are also very nice (often ~50%).


To be clear, you seem to mean that ~50% of the runtime of the query
was removed. In other words, the quicksort version is twice as fast.


Yes, that's what I meannt. Sorry for the inaccuracy.




But the number of regressions is considerably higher - there's a
small number of queries that got significantly slower for multiple
data sets, particularly for smaller work_mem values.


No time to fully consider these benchmarks right now, but: Did you
make sure to set replacement_sort_mem very low so that it was never
used when patched? And, was this on the latest version of the patch,
where memory contexts were reset (i.e. the version that got
committed recently)? You said something about memory batching, so
ISTM that you should set that to '64', to make sure you don't get one
longer run. That might mess with merging.


I've tested the patch you've sent on 2016/3/11, which I believe is the 
last version. I haven't tuned the replacement_sort_mem at all. because 
my understanding was that it's not a 9.6 material (per your message). So 
my intent was to test the configuration people are likely to use by default.


I'm not sure about the batching - that was merely a guess of what might 
be the problem.




Note that the master branch has the memory batching patch as of a
few days back, so it that's the problem at the low end, then that's
bad.


I'm not sure which commit are you referring to. The benchmark was done 
on a414d96a (from 2016/3/10). However I'd expect that to affect both 
sets of measurements, although it's possible that it affects the patched 
version differently.



But I don't think it is: I think that the regressions at the low end
are about abbreviated keys, particularly the numeric cases. There is
a huge gulf in the cost of those comparisons (abbreviated vs
authoritative), and it is legitimately a weakness of the patch that
it reduces the number in play. I think it's still well worth it, but
it is a downside. There is no reason why the authoritative numeric
comparator has to allocate memory, but right now that case isn't
optimized.


Yes, numeric and text are the most severely affected cases.



I find it weird that the patch is exactly the same as master in a
lot of cases. ISTM that with a case where you use 1GB of memory to
sort 1 million rows, you're so close to an internal sort that it
hardly matters (master will not need a merge step at all, most
likely). The patch works best with sorts that take tens of seconds,
and I don't think I see any here, nor any high memory tests where RS
flops. Now, I think you focused on regressions because that was what
was interesting, which is good. I just want to put that in context.


I don't think the tests on 1M rows are particularly interesting, and I 
don't see any noticeable regressions there. Perhaps you mean the tests 
on 10M rows instead?


Yes, you're correct - I was mostly looking for regressions. However, the 
worst cases of regressions are on relatively long sorts, e.g. slowing 
down from 35 seconds to 64 seconds, etc. So that's quite long, and it's 
surely using non-trivial amount of memory. Or am I missing something?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-22 Thread Peter Geoghegan

On Tue, Mar 22, 2016 at 2:27 PM, Tomas Vondra
 wrote:
> Each query was executed 5x for each work_mem value (between 8MB and 1GB),
> and then a median of the runs was computed and that's what's on the
> "comparison". This compares a414d96ad2b without (master) and with the
> patches applied (patched). The last set of columns is simply a "speedup"
> where "<1.0" means the patched code is faster, while >1.0 means it's slower.
> Values below 0.9 or 1.1 are using green or red background, to make the most
> significant improvements or regressions clearly visible.
>
> For the smaller data set (1M rows), things works pretty fine. There are
> pretty much no red cells (so no significant regressions), but quite a few
> green ones (with duration reduced by up to 50%). There are some results in
> the 1.0-1.05 range, but considering how short the queries are, I don't think
> this is a problem. Overall the total duration was reduced by ~20%, which is
> nice.
>
> For the 10M data sets, total speedup is also almost ~20%, and the speedups
> for most queries are also very nice (often ~50%).

To be clear, you seem to mean that ~50% of the runtime of the query
was removed. In other words, the quicksort version is twice as fast.

> But the number of
> regressions is considerably higher - there's a small number of queries that
> got significantly slower for multiple data sets, particularly for smaller
> work_mem values.

No time to fully consider these benchmarks right now, but: Did you
make sure to set replacement_sort_mem very low so that it was never
used when patched? And, was this on the latest version of the patch,
where memory contexts were reset (i.e. the version that got committed
recently)? You said something about memory batching, so ISTM that you
should set that to '64', to make sure you don't get one longer run.
That might mess with merging.

Note that the master branch has the memory batching patch as of a few
days back, so it that's the problem at the low end, then that's bad.
But I don't think it is: I think that the regressions at the low end
are about abbreviated keys, particularly the numeric cases. There is a
huge gulf in the cost of those comparisons (abbreviated vs
authoritative), and it is legitimately a weakness of the patch that it
reduces the number in play. I think it's still well worth it, but it
is a downside. There is no reason why the authoritative numeric
comparator has to allocate memory, but right now that case isn't
optimized

I find it weird that the patch is exactly the same as master in a lot
of cases. ISTM that with a case where you use 1GB of memory to sort 1
million rows, you're so close to an internal sort that it hardly
matters (master will not need a merge step at all, most likely). The
patch works best with sorts that take tens of seconds, and I don't
think I see any here, nor any high memory tests where RS flops. Now, I
think you focused on regressions because that was what was
interesting, which is good. I just want to put that in context.

Thanks
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-21 Thread Peter Geoghegan

On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas  wrote:
> OK, I have now committed 0001

I attach a revision of the external quicksort patch and supplementary
small patches, rebased on top of the master branch.

Changes:

1. As I mentioned on the "Improve memory management for external
sorts" -committers thread, we should protect against currentRun
integer overflow. This new revision does so.

I'm not sure if that change needs to be back-patched; I just don't
want to take any risks, and see this as low cost insurance. Really low
workMem sorts are now actually fast enough that this seems like
something that could happen on a misconfigured system.

2. Add explicit constants for special run numbers that replacement
selection needs to care about in particular.

I did this because change 1 reminded me of the "currentRun vs.
SortTuple.tupindex" run numbering subtleties. The explicit use of
certain run number constants seems to better convey some tricky
details, in part by letting me add a few documenting if obvious
assertions. It's educational to be able to grep for the these
constants (e.g., the new HEAP_RUN_NEXT constant) to jump to the parts
of the code that need to think about replacement selection. As things
were, that code relied on too much from too great a distance (arguably
this is true even in the master branch). This change in turn led to
minor wordsmithing to adjacent areas here and there, most of it
subtractive.

As an example of where this helps, ISTM that the assertion added to
the routine tuplesort_heap_insert() is now self-documenting, which
wasn't the case before.

3. There was one very tricky consideration around an edge-case that
required careful thought. This was an issue within my new function
dumpbatch(). It could previously perform what turns out to be a
superfluous selectnewtape() call when we take the dumpbatch()
"state->memtupcount == 0" early return path (see the last revision for
full details of that now-axed code path). Now, we accept that there
may on occasion be 0 tuple runs. In other words, we now never returned
early from within dumpbatch().

There was previously no explanation for why it was okay to have a
superfluous selectnewtape() call. However, needing to be certain that
any newly selected destTape tape will go on to receive a run is
implied for the general case by this existing selectnewtape() code
comment:

 * This is called after finishing a run when we know another run
 * must be started.  This implements steps D3, D4 of Algorithm D.

While the previous revision was correct anyway, I tried to explain why
it was correct in comments, and soon realized that that was a really
bad idea; the rationale was excessively complicated.

Allowing 0 tuple runs in rare cases seems like the simplest solution.
After all, mergeprereadone() is expressly prepared for 0 tuple runs.
It says "ensure that we have at least one tuple, if any are to be
had". There is no reason to assume that it says this only because it
imagines that no tuples might be found *only after* the first preread
for the merge (by which I mean I don't think that only applies when a
final on-the-fly merge reloads tuples from one particular tape
following running out of tuples of the tape/run in memory).

4. I updated the function beginmerge() to acknowledge an inconsistency
for pass-by-value datumsorts, which I mentioned in passing on this
thread a few days back. The specific change:

 @@ -2610,7 +2735,12 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)

 if (finalMergeBatch)
 {
 -   /* Free outright buffers for tape never actually allocated */
 +   /*
 +* Free outright buffers for tape never actually allocated.  The
 +* !state->tuples case is actually entitled to have at least this much
 +* of a refund ahead of its final merge, but we don't go out of our way
 +* to marginally improve that case.
 +*/
 FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);

It's not worth worrying about this case, since the savings are small
(especially now that maxTapes is capped). But it's worth acknowledging
that the "!state->tuples" case is being "short-changed", in the new
spirit of heavily scrutinizing where memory goes in tuplesort.c.

5. I updated the "Add MemoryContextStats() calls for debugging" patch.
I now formally propose that this debugging instrumentation be
committed.

This revised debugging instrumentation patch does not have the system
report anything about the memory context just because "trace_sort =
on". Rather, it does nothing on ordinary builds, where the macro
SHOW_MEMORY_STATS will not be defined (it also respects trace_sort).
This is about the same approach seen in postgres.c's
finish_xact_command(). ISTM that we ought to provide a way of
debugging memory use within tuplesort.c, since we now know that that
could be very important. Let's not forget where the useful places to
look for problems are.

6.

Re: [HACKERS] Using quicksort for every external sort run

2016-03-19 Thread Robert Haas

On Thu, Mar 10, 2016 at 9:54 PM, Peter Geoghegan  wrote:
> 1. Creates a separate memory context for tuplesort's copies of
> caller's tuples, which can be reset at key points, avoiding
> fragmentation. Every SortTuple.tuple is allocated there (with trivial
> exception); *everything else*, including the memtuples array, is
> allocated in the existing tuplesort context, which becomes the parent
> of this new "caller's tuples" context. Roughly speaking, that means
> that about half of total memory for the sort is managed by each
> context in common cases. Even with a high work_mem memory budget,
> memory fragmentation could previously get so bad that tuplesort would
> in effect claim a share of memory from the OS that is *significantly*
> higher than the work_mem budget allotted to its sort. And with low
> work_mem settings, fragmentation previously made palloc() thrash the
> sort, especially during non-final merging. In this latest revision,
> tuplesort now almost gets to use 100% of the memory that was requested
> from the OS by palloc() is cases tested.

I spent some time looking at this part of the patch yesterday and
today.  This is not a full review yet, but here are some things I
noticed:

- I think that batchmemtuples() is somewhat weird.  Normally,
grow_memtuples() doubles the size of the array each time it's called.
So if you somehow called this function when you still had lots of
memory available, it would just double the size of the array.
However, I think the expectation is that it's only going to be called
when availMem is less than half of allowedMem, in which case we're
going to get the special "last increment of memtupsize" behavior,
where we expand the memtuples array by some multiple between 1.0 and
2.0 based on allowedMem/memNowUsed.  And after staring at this for a
long time ... yeah, I think this does the right thing.  But it
certainly is hairy.

- It's not exactly clear what you mean when you say that the tuplesort
context contains "everything else".  I don't understand why that only
ends up containing half the memory ... what, other than the memtuples
array, ends up there?

- If I understand correctly, the point of the MemoryContextReset call
is: there wouldn't be any tuples in memory at that point anyway.  But
the OS-allocated chunks might be divided up into a bunch of small
chunks that then got stuck on freelists, and those chunks might not be
the right size for the next merge pass.  Resetting the context avoids
that problem by blowing up the freslists.  Right?  Clever.

- I haven't yet figured out why we use batching only for the final
on-the-fly merge pass, instead of doing it for all merges.  I expect
you have a reason.  I just don't know what it is.

- I have also not yet figured out why you chose to replace
state->datumTypByVal with state->tuples and reverse the sense.  I bet
there's a reason for this, too.  I don't know what it is, either.

That's as far as I've gotten thus far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-19 Thread Robert Haas

On Wed, Mar 16, 2016 at 9:42 PM, Peter Geoghegan  wrote:
> On Wed, Mar 16, 2016 at 3:31 PM, Robert Haas  wrote:
>> I spent some time looking at this part of the patch yesterday and
>> today.
>
> Thanks for getting back to it.

OK, I have now committed 0001, and separately, some comment
improvements - or at least, I think they are improvements - based on
this discussion.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-19 Thread Peter Geoghegan

On Wed, Mar 16, 2016 at 3:31 PM, Robert Haas  wrote:
> I spent some time looking at this part of the patch yesterday and
> today.

Thanks for getting back to it.

> - I think that batchmemtuples() is somewhat weird.  Normally,
> grow_memtuples() doubles the size of the array each time it's called.
> So if you somehow called this function when you still had lots of
> memory available, it would just double the size of the array.
> However, I think the expectation is that it's only going to be called
> when availMem is less than half of allowedMem, in which case we're
> going to get the special "last increment of memtupsize" behavior,
> where we expand the memtuples array by some multiple between 1.0 and
> 2.0 based on allowedMem/memNowUsed.

That's right. It might be possible for the simple doubling behavior to
happen under artificial conditions instead, for example when we have
enormous individual tuples, but if that does happen it's still
correct. I just didn't think it was worth worrying about giving back
more memory in such extreme edge-cases.

> And after staring at this for a
> long time ... yeah, I think this does the right thing.  But it
> certainly is hairy.

No arguments from me here. I think this is justified, though.

It's great that palloc() provides a simple, robust abstraction.
However, there are a small number of modules in the code, including
tuplesort.c, where we need to be very careful about memory management.
Probably no more than 5 and no less than 3. In these places, large
memory allocations are the norm. We ought to pay close attention to
memory locality, heap fragmentation, that memory is well balanced
among competing considerations, etc. It's entirely appropriate that
we'd go to significant lengths to get it right in these places using
somewhat ad-hoc techniques, simply because these are the places where
we'll get a commensurate benefit. Some people might call this adding
custom memory allocators, but I find that to be a loaded term because
it suggests intimate involvement from mcxt.c.

> - It's not exactly clear what you mean when you say that the tuplesort
> context contains "everything else".  I don't understand why that only
> ends up containing half the memory ... what, other than the memtuples
> array, ends up there?

I meant that the existing memory context "sortcontext" contains
everything else that has anything to do with sorting. Everything that
it contains in the master branch it continues to contain today, with
the sole exception of a vast majority of caller's tuples. So,
"sortcontext" continues to include everything you can think of:

* As you pointed out, the memtuples array.

* SortSupport state (assuming idiomatic usage of the API, at least).

* State specific to the cluster case.

* Transient state, specific to the index case (i.e. scankey memory)

* logtape.c stuff.

* Dynamically allocated stuff for managing tapes (see inittapes())

* For the sake of simplicity, a tiny number of remaining tuples (from
"overflow" allocations, or from when we need to free a tape's entire
batch when it is one tuple from exhaustion).

This is for tuples that the tuplesort caller needs to pfree() anyway,
per the tuplesort_get*tuple() API. It's just easier to put these
allocations in the "main" context, to avoid having to reason about any
consequences to calling MemoryContextReset() against our new tuple
context. This precaution is just future-proofing IMV.

I believe that this list is exhaustive.

> - If I understand correctly, the point of the MemoryContextReset call
> is: there wouldn't be any tuples in memory at that point anyway.  But
> the OS-allocated chunks might be divided up into a bunch of small
> chunks that then got stuck on freelists, and those chunks might not be
> the right size for the next merge pass.  Resetting the context avoids
> that problem by blowing up the freslists.  Right?

Your summary of the practical benefit is accurate. While I've
emphasized regressions at the low-end with this latest revision, it's
also true that resetting helps in memory rich environments, when we
switch from retail palloc() calls to the final merge step's batch
allocation, which palloc() seemed to do very badly with. It makes
sense that this abrupt change in the pattern of allocations could
cause significant heap memory fragmentation.

> Clever.

Thanks.

Introducing a separate memory context that is strictly used for caller
tuples makes it clear and obvious that it's okay to call
MemoryContextReset() when state->memtupcount == 0. It's not okay to
put anything in the new context that could break the calls to
MemoryContextReset().

You might not have noticed that a second MemoryContextReset() call
appears in the quicksort patch, which helps a bit too. I couldn't
easily make that work with the replacement selection heap, because
master's tupleosrt.c never fully empties its RS heap until the last
run. I can only perform the first call to MemoryContextReset() in the

Re: [HACKERS] Using quicksort for every external sort run

2016-03-19 Thread Peter Geoghegan

On Wed, Mar 16, 2016 at 6:42 PM, Peter Geoghegan  wrote:
>> - I think that batchmemtuples() is somewhat weird.  Normally,
>> grow_memtuples() doubles the size of the array each time it's called.
>> So if you somehow called this function when you still had lots of
>> memory available, it would just double the size of the array.
>> However, I think the expectation is that it's only going to be called
>> when availMem is less than half of allowedMem, in which case we're
>> going to get the special "last increment of memtupsize" behavior,
>> where we expand the memtuples array by some multiple between 1.0 and
>> 2.0 based on allowedMem/memNowUsed.
>
> That's right. It might be possible for the simple doubling behavior to
> happen under artificial conditions instead, for example when we have
> enormous individual tuples, but if that does happen it's still
> correct. I just didn't think it was worth worrying about giving back
> more memory in such extreme edge-cases.

Come to think of it, maybe the pass-by-value datum sort case should
also call batchmemtuples() too (or something similar). If you look at
how beginmerge() is called, you'll see that that doesn't happen.

Obviously this case is not entitiled to a "memtupsize *
STANDARDCHUNKHEADERSIZE" refund, since of course there never was any
overhead like that at any point. And, obviously this case has no need
for batch memory at all. However, it is entitled to get a refund for
non-used tapes (accounted for, but, it turns out, never allocated
tapes). It should then get the benefit of that refund by way of
growing memtuples through a similar "final, honestly, I really mean it
this time" call to grow_memtuples().

So, while the "memtupsize * STANDARDCHUNKHEADERSIZE refund" part
should still be batch-specific (i.e. used for the complement of
tuplesort cases, never the datum pass-by-val case), the new
grow_memtuples() thing should always happen with external sorts.

The more I think about it, the more I wonder if we should commit
something like the debugging patch 0004-* (enabled only when
trace_sort = on, of course). Close scrutiny of what tuplesort.c is
doing with memory is important.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-18 Thread Robert Haas

On Wed, Mar 16, 2016 at 9:42 PM, Peter Geoghegan  wrote:
>> - I haven't yet figured out why we use batching only for the final
>> on-the-fly merge pass, instead of doing it for all merges.  I expect
>> you have a reason.  I just don't know what it is.
>
> The most obvious reason, and possibly the only reason, is that I have
> license to lock down memory accounting in the final on-the-fly merge
> phase. Almost equi-sized runs are the norm, and code like this is no
> longer obligated to work:
>
> FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
>
> That's why I explicitly give up on "conventional accounting". USEMEM()
> and FREEMEM() calls become unnecessary for this case that is well
> locked down. Oh, and I know that I won't use most tapes, so I can give
> myself a FREEMEM() refund before doing the new grow_memtuples() thing.
>
> I want to make batch memory usable for runs, too. I haven't done that
> either for similar reasons. FWIW, I see no great reason to worry about
> non-final merges.

Fair enough.  My concern was mostly whether the code would become
simpler if we always did this when merging, instead of only on the
final merge.  But the final merge seems to be special in quite a few
respects, so maybe not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-18 Thread Peter Geoghegan

On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas  wrote:
> OK, I have now committed 0001, and separately, some comment
> improvements - or at least, I think they are improvements - based on
> this discussion.

Thanks!

Your changes look good to me. It's always interesting to learn what
wasn't so obvious to you when you review my patches. It's probably
impossible to stare at something like tuplesort.c for as long as I
have and get that balance just right.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Peter Geoghegan

On Sun, Feb 14, 2016 at 8:01 PM, Peter Geoghegan  wrote:
> The query I'm testing is: "reindex index pgbench_accounts_pkey;"
>
> Now, with a maintenance_work_mem of 5MB, the most recent revision of
> my patch takes about 54.2 seconds to complete this, as compared to
> master's 44.4 seconds. So, clearly a noticeable regression there of
> just under 20%. I did not see a regression with a 5MB
> maintenance_work_mem when pgbench scale was 100, though.

I've fixed this regression, and possibly all regressions where workMem
> 4MB. I've done so without resorting to making the heap structure
more complicated, or using a heap more often than when
replacement_sort_mem is exceeded by work_mem or maintenance_work_mem
(so replacement_sort_mem becomes something a bit different to what we
discussed, Robert -- more on that later). This seems like an
"everybody wins" situation, because in this revision the patch series
is now appreciably *faster* where the amount of memory available is
only a tiny fraction of the total input size.

Jeff Janes deserves a lot of credit for helping me to figure out how
to do this. I couldn't get over his complaint about the regression he
saw a few months back. He spoke of an "anti-sweetspot" in polyphase
merge, and how he suspected that to be the real culprit (after all,
most of his time was spent merging, with or without the patch
applied). He also said that reverting the memory batch/pool patch made
things go a bit faster, somewhat ameliorating his regression (when
just the quicksort patch was applied). This made no sense to me, since
I understood the memory batching patch to be orthogonal to the
quicksort thing, capable of being applied independently, and more or
less a strict improvement on master, no matter what the variables of
the sort are. Jeff's regressed case especially made no sense to me
(and, I gather, to him) given that the regression involved no
correlation, and so clearly wasn't reliant on generating far
fewer/longer runs than the patch (that's the issue we've discussed
more than any other now -- it's a red herring, it seems). As I
suspected out loud on February 14th, replacement selection mostly just
*masked* the real problem: the problem of palloc() fragmentation.
There doesn't seem to be much of an issue with the scheduling of
polyphase merging, once you fix palloc() fragmentation. I've created a
new revision, incorporating this new insight.

New Revision

Attached revision of patch series:

1. Creates a separate memory context for tuplesort's copies of
caller's tuples, which can be reset at key points, avoiding
fragmentation. Every SortTuple.tuple is allocated there (with trivial
exception); *everything else*, including the memtuples array, is
allocated in the existing tuplesort context, which becomes the parent
of this new "caller's tuples" context. Roughly speaking, that means
that about half of total memory for the sort is managed by each
context in common cases. Even with a high work_mem memory budget,
memory fragmentation could previously get so bad that tuplesort would
in effect claim a share of memory from the OS that is *significantly*
higher than the work_mem budget allotted to its sort. And with low
work_mem settings, fragmentation previously made palloc() thrash the
sort, especially during non-final merging. In this latest revision,
tuplesort now almost gets to use 100% of the memory that was requested
from the OS by palloc() is cases tested.

2. Loses the "quicksort with spillover" case entirely, making the
quicksort patch significantly simpler. A *lot* of code was thrown out.

This change is especially significant because it allowed me to remove
the cost model that Robert took issue with so vocally. "Quicksort with
spillover" was always far less important than the basic idea of using
quicksort for external sorts, so I'm not sad to see it go. And, I
thought that the cost model was pretty bad myself.

3. Fixes cost_sort(), making optimizer account for the fact that runs
are now about sort_mem-sized, not (sort_mem * 2)-sized.

While I was at it, I made cost_sort() more optimistic about the amount
of random I/O required relative to sequential I/O. This additional
change to cost_sort() was probably overdue.

4. Restores the ability of replacement selection to generate one run
and avoid any merging (previously, only one really long run and one
short run was possible, because at the time I conceptualized
replacement selection as being all about enabling "quicksort with
spillover", which quicksorted that second run in memory). This
only-one-run case is the case that Robert particularly cared about,
and it's fully restored when RS is in use (which can still only happen
for the first run, just never for the benefit of the now-axed
"quicksort with spillover" case).

5. Adds a new GUC, "replacement_sort_mem". The default setting is
16MB. Docs are included. If work_mem/maintenance_work_mem is less than
or equal to this, the first (and

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 10:39 AM, Greg Stark  wrote:
> I want to rerun these on a dedicated machine and with trace_sort
> enabled so that we can see how many merge passes were actually
> happening and how much I/O was actually happening.

Putting the results in context, by keeping trace_sort output with the
results is definitely a good idea here. Otherwise, it's almost
impossible to determine what happened after the fact. I have had
"trace_sort = on" in my dev postgresql.conf for some time now. :-)

When I produce my next revision, we should focus on regressions at the
low end, like the 4MB work_mem for multiple GB table size cases you
show here. So, I ask that any benchmarks that you or Tomas do look at
that first and foremost. It's clear that in high memory environments
the patch significantly improves performance, often by as much as
2.5x, and so that isn't really a concern anymore. I think we may be
able to comprehensively address Robert's concerns about regressions
with very little work_mem and lots of data by fixing a problem with
polyphase merge. More to come soon.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Greg Stark

On Thu, Mar 10, 2016 at 1:40 PM, Tomas Vondra 
wrote:
> I was thinking about running some benchmarks on this patch, but the
> thread is pretty huge so I want to make sure I'm not missing something
> and this is indeed the most recent version.

I also ran some preliminary benchmarks just before FOSDEM and intend to get
back to in after running different benchmarks. These are preliminary
because it was only a single run and on a machine that wasn't dedicated for
benchmarks. These were comparing the quicksort-all-runs patch against HEAD
at the time without the memory management optimizations which I think are
independent of the sort algorithm.

It looks to me like the interesting space to test is on fairly small
work_mem compared to the data size. There's a general slowdown on 4MB-8MB
work_mem when the data set is more than a gigabyte but but even in the
worst case it's only a 30% slowdown and the speedup in the more realistic
scenarios looks at least as big.

I want to rerun these on a dedicated machine and with trace_sort enabled so
that we can see how many merge passes were actually happening and how much
I/O was actually happening.

-- 
greg

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Peter Geoghegan

On Thu, Mar 10, 2016 at 5:40 AM, Tomas Vondra
 wrote:
> I was thinking about running some benchmarks on this patch, but the
> thread is pretty huge so I want to make sure I'm not missing something
> and this is indeed the most recent version.

Wait 24 - 48 hours, please. Big update coming.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-10 Thread Tomas Vondra

Hi,

On Mon, 2015-12-28 at 15:03 -0800, Peter Geoghegan wrote:
> On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan  wrote:
> > BTW, I'm not necessarily determined to make the new special-purpose
> > allocator work exactly as proposed. It seemed useful to prioritize
> > simplicity, and currently so there is one big "huge palloc()" with
> > which we blow our memory budget, and that's it. However, I could
> > probably be more clever about "freeing ranges" initially preserved for
> > a now-exhausted tape. That kind of thing.
> 
> Attached is a revision that significantly overhauls the memory patch,
> with several smaller changes.

I was thinking about running some benchmarks on this patch, but the
thread is pretty huge so I want to make sure I'm not missing something
and this is indeed the most recent version.

Is that the case?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-03-07 Thread Peter Geoghegan

On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark  wrote:
> I was thinking about this over the past couple weeks. I'm starting to
> think the quicksort runs gives at least the beginnings of a way
> forward on this front.

As I've already pointed out several times, I wrote a tool that makes
it easy to load sortbenchmark.org data into a PostgreSQL table:

https://github.com/petergeoghegan/gensort

(You should use the Python script that invokes the "gensort" utility
-- see its "--help" display for details).

This seems useful as a standard benchmark, since it's perfectly
deterministic, allowing the user to create arbitrarily large tables to
use for sort benchmarks. Still, it doesn't produce data that is any
way organic; sort data is uniformly distributed. Also, it produces a
table that really only has one attribute to sort on, a text attribute.

I suggest looking at real world data, too. I have downloaded UK land
registry data, which is a freely available dataset about property
sales in the UK since the 1990s, of which there have apparently been
about 20 million (I started with a 20 million line CSV file). I've
used COPY to load the data into one PostgreSQL table.

I attach instructions on how to recreate this, and some suggested
CREATE INDEX statements that seemed representative to me. There are a
variety of Postgres data types in use, including UUID, numeric, and
text. The final Postgres table is just under 3GB. I will privately
make available a URL that those CC'd here can use to download a custom
format dump of the table, which comes in at 1.1GB (ask me off-list if
you'd like to get that URL, but weren't CC'd here). This URL is
provided as a convenience for reviewers, who can skip my detailed
instructions.

An expensive rollup() query on the "land_registry_price_paid_uk" table
is interesting. Example:

select date_trunc('year', transfer_date), county, district, city,
sum(price) from land_registry_price_paid_uk group by rollup (1,
county, district, city);

Performance is within ~5% of an *internal* sort with the patch series
applied, even though ~80% of time is spent copying and sorting
SortTuples overall in the internal sort case (the internal case cannot
overlap sorting and aggregate processing, since it has no final merge
step). This is a nice demonstration of how this work has significantly
blurred the line between internal and external sorts.

-- 
Peter Geoghegan
Instructions

CSV File

The land registry file from http://data.gov.uk is 3.2GB. A CSV file that can be
loaded into PostgreSQL that has organic data. No registration required. See
https://theodi.org/blog/the-status-of-csvs-on-datagovuk for details on
downloaded the file pp-complete.csv.

SQL
---

begin;
create table land_registry_price_paid_uk(
  transaction uuid,
  price numeric,
  transfer_date date,
  postcode text,
  property_type char(1),
  newly_built boolean,
  duration char(1),
  paon text,
  saon text,
  street text,
  locality text,
  city text,
  district text,
  county text,
  ppd_category_type char(1));

copy land_registry_price_paid_uk FROM '/home/pg/Downloads/pp-complete.csv' with 
(format csv, freeze true, encoding 'win1252', header false, null '', quote '"', 
force_null (postcode, saon, paon, street, locality, city, district));
commit;

Resulting table
---

postgres=# \dt+
  List of relations
 Schema âName â Type  â Owner â  Size   â 
Description 
âââââââââ¼ââââââââââââââââââââââââââââââ¼ââââââââ¼ââââââââ¼ââââââââââ¼âââââââââââââ
 public â land_registry_price_paid_uk â table â pgâ 2779 MB â 
(1 row)

Interesting Indexes
===

Many attribute index (Low cardinality leading attributes):

postgres=# create index on land_registry_price_paid_uk_suffix(county, district, 
city, locality, street);

UUID pk index (UUID type, high cardinality):

postgres=# create index on land_registry_price_paid_uk (transaction);

Price index (numeric, moderate cardinality):

postgres=# create index on land_registry_price_paid_uk (price);

Preview
===

pg@hamster:~$ head ~/Downloads/pp-complete.csv
"{0C7ADEF5-878D-4066-B785-003ED74A}","163000","2003-02-21 00:00","UB5 
4PJ","T","N","F","106","","READING 
ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A"
"{35F67271-ABD4-40DA-AB09-0085B9D3}","247500","2005-07-15 00:00","TA19 
9DD","D","N","F","58","","ADAMS MEADOW","ILMINSTER","ILMINSTER","SOUTH 
SOMERSET","SOMERSET","A"
"{B20B1C74-E8E1-4137-AB3E-011DF342}","32","2010-09-10 00:00","W4 
1DZ","F","N","L","58","","WHELLOCK ROAD","","LONDON","EALING","GREATER 
LONDON","A"
"{7D6B0915-C56B-4275-AF9B-0156BCE7}","104000","1997-08-27 00:00","NE61 
2BH","D","N","F","17","","WESTGATE","MORPETH","MORPETH","CASTLE 
MORPETH","NORTHUMBERLAND","A"

Re: [HACKERS] Using quicksort for every external sort run

2016-02-15 Thread Greg Stark

On Mon, Feb 15, 2016 at 8:43 PM, Jim Nasby  wrote:
> On 2/7/16 8:57 PM, Peter Geoghegan wrote:
>>
>> It seems less than ideal that DBAs have to be so
>> conservative in sizing work_mem.
>
>
> +10

I was thinking about this over the past couple weeks. I'm starting to
think the quicksort runs gives at least the beginnings of a way
forward on this front. Keeping in mind that we know how many tapes we
can buffer in memory and the number is likely to be relatively large
-- on the order of 100+ is typical, what if do something like the
following rough sketch:

Give users two knobs, a lower bound "sort in memory using quicksort"
memory size and an upper bound "absolutely never use more than this"
which they can set to a substantial fraction of physical memory. Then
when we overflow the lower bound we start generating runs, the first
one being of that length. Each run we generate we double (or increase
by 50% or something) until we hit the maximum. That means that the
first few runs may be shorter than necessary but we have enough tapes
available that that doesn't hurt much and we'll eventually get to a
large enough run size that we won't run out of tapes and can still do
a single final (on the fly) merge.

In fact what's really attractive about this idea is that it might give
us a reasonable spot to do some global system resource management.
Each time we want to increase the run size we check some shared memory
counter of how much memory is in use and refuse to increase if there's
too much in use (or if we're using too large a fraction of it or some
other heuristic). The key point is that since we don't need to decide
up front at the beginning of the sort and we don't need to track it
continuously there is neither too little nor too much contention on
this shared memory variable. Also the behaviour would be not too
chaotic if there's a user-tunable minimum and the other activity in
the system only controls how more memory it can steal from the global
pool on top of that.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-02-15 Thread Jim Nasby


On 2/7/16 8:57 PM, Peter Geoghegan wrote:

It seems less than ideal that DBAs have to be so
conservative in sizing work_mem.


+10
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-02-14 Thread Peter Geoghegan

On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan  wrote:
>> I'm not even sure this is necessary. The idea of missing out on
>> producing a single sorted run sounds bad but in practice since we
>> normally do the final merge on the fly there doesn't seem like there's
>> really any difference between reading one tape or reading two or three
>> tapes when outputing the final results. There will be the same amount
>> of I/O happening and a 2-way or 3-way merge for most data types should
>> be basically free.
>
> I basically agree with you, but it seems possible to fix the
> regression (generally misguided though those regressed cases are).
> It's probably easiest to just fix it.

Here is a benchmark on my laptop:

$ pgbench -i -s 500 --unlogged

This results in a ~1GB accounts PK:

postgres=# \di+ pgbench_accounts_pkey
List of relations
─[ RECORD 1 ]──
Schema  │ public
Name│ pgbench_accounts_pkey
Type│ index
Owner   │ pg
Table   │ pgbench_accounts
Size│ 1071 MB
Description │

The query I'm testing is: "reindex index pgbench_accounts_pkey;"

Now, with a maintenance_work_mem of 5MB, the most recent revision of
my patch takes about 54.2 seconds to complete this, as compared to
master's 44.4 seconds. So, clearly a noticeable regression there of
just under 20%. I did not see a regression with a 5MB
maintenance_work_mem when pgbench scale was 100, though. And, with the
default maintenance_work_mem of 64MB, it's a totally different story
-- my patch takes about 28.3 seconds, whereas master takes 48.5
seconds (i.e. longer than with 5MB). My patch needs a 56-way final
merge with the 64MB maintenance_work_mem case, and 47 distinct merge
steps, plus a final on-the-fly merge for the 5MB maintenance_work_mem
case. So, a huge amount of merging, but RS still hardly pays for
itself. With the regressed case for my patch, we finish sorting *runs*
about 15 seconds in to a 54.2 second operation -- very early. So it
isn't "quicksort vs replacement selection", so much as "polyphase
merge vs replacement selection". There is a good reason to think that
we can make progress on fixing that regression by doubling down on the
general strategy of improving cache characteristics, and being
cleverer about memory use during non-final merging, too.

I looked at what it would take to make the heap a smaller part of
memtuples, along the lines Robert and I talked about, and I think it's
non-trivial because it needs to make the top of the heap something
other than memtuples[0]. I'd need to change the heap code, which
already has 3 reasons for existing (RS, merging, and top-N heap). I'll
find it really hard to justify the effort, and especially the risk of
adding bugs, for a benefit that there is *scant* evidence for. My
guess is that the easiest, and most sensible way to fix the ~20%
regression seen here is to introduce batch memory allocation to
non-final merge steps, which is where most time was spent. (For
simplicity, that currently only happens during the final merge phase,
but I could revisit that -- seems not that hard).

Now, I accept that the cost model has to go. So, what I think would be
best is if we still added a GUC, like the replacement_sort_mem
suggestion that Robert made. This would be a threshold for using what
is currently called "quicksort with spillover". There'd be no cost
model. Jeff Janes also suggested something like this.

The only regression that I find concerning is the one reported by Jeff
Janes [1]. That didn't even involve a correlation, though, so no
reason to think that it would be at all helped by what Robert and I
talked about. It seemed like the patch happened to have the effect of
tickling a pre-existing problem with polyphase merge -- what Jeff
called an "anti-sweetspot". Jeff had a plausible theory for why that
is.

So, what if we try to fix polyphase merge? That would be easier. We
could look at the tape buffer size, and the number of tapes, as well
as memory access patterns. We might even make more fundamental changes
to polyphase merge, since we don't use the more advanced variant that
I think correlation is a red herring. Knuth suggests that his
algorithm 5.4.3, cascade merge, has more efficient distribution of
runs.

The bottom line is that there will always be some regression
somewhere. I'm not sure what the guiding principle for when that
becomes unacceptable is, but you did seem sympathetic to the idea that
really low work_mem settings (e.g. 4MB) with really big inputs were
not too concerning [2]. I'm emphasizing Jeff's case now because I,
like you [2], am much more worried about maintenance_work_mem default
cases with regressions than anything else, and that *was* such a case.

Like Jeff Janes, I don't care about his other regression of about 5%
[3], which involved a 4MB work_mem + 100 million tuples. The other
case (the one I do care about) was 64MB +  400 million tuples, and was
a much bigger regression, which is suggestive of

Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Peter Geoghegan

On Fri, Feb 5, 2016 at 9:31 AM, Robert Haas  wrote:
> Peter, please weigh in and let me know if I've gotten anything
> incorrect here or if you think of other concerns afterwards.

Right. Let me give you the executive summary first: I continue to
believe, following thinking about the matter in detail, that this is a
sensible compromise, that weighs everyone's concerns. It is pretty
close to a win-win. I just need you to confirm what I say here in
turn, so we're sure that we understand each other perfectly.

> The basic idea is that we will add a new GUC with a name like
> replacement_sort_mem that will have a default value in the range of
> 20-30MB; or possibly we will hardcode this value, but for purposes of
> this email I'm going to assume it's a GUC.  If the value of work_mem
> or maintenance_work_mem, whichever applies, is smaller than the value
> of replacement_sort_mem, then the latter has no effect.

By "no effect", you must mean that we always use a heap for the entire
first run (albeit for the tail, with a hybrid quicksort/heap
approach), but still use quicksort for every subsequent run, when it's
clearly established that we aren't going to get one huge run. Is that
correct?

It was my understanding, based on your emphasis on producing only a
single run, as well as your recent remarks on this thread about the
first run being special, that you are really only interested in the
presorted case, where one run is produced. That is, you are basically
not interested in preserving the general ability of replacement
selection to double run size in the event of a uniform distribution.
(That particular doubling property of replacement selection is now
technically lost by virtue of using this new hybrid model *anyway*,
although it will still make runs longer in general).

You don't want to change the behavior of the current patch for the
second or subsequent run; that should remain a quicksort, pure and
simple. Do I have that right?

BTW, parallel sort should probably never use a heap anyway (ISTM that
that will almost certainly be based on external sorts in the end). A
heap is not really compatible with the parallel heap scan model.

> One thing I just thought of (after the call) is that it might be
> better for this GUC to be in units of tuples rather than in units of
> memory; it's not clear to me why the optimal heap size should be
> dependent on the tuple size, so we could have a threshold like 300,000
> tuples or whatever.

I think you're right that a number of tuples is the logical way to
express the heap size (as a GUC unit). I think that the ideal setting
for the GUC is large enough to recognize significant correlations in
input data, which may be clustered, but no larger (at least while
things don't all fit in L1 cache, or maybe L2 cache). We should "go
for broke" with replacement selection -- we don't aim for anything
less than ending up with 1 run by using the heap (merging 2 or 3 runs
rather than 4 or 6 is far less useful, maybe harmful, when one of them
is much larger). Therefore, I don't expect that we'll be practically
disadvantaged by having fewer "hands to juggle" tuples here (we'll
simply almost always have enough in practice -- more on that later).
FWIW I don't think that any benchmark we've seen so far justifies
doing less than "going for broke" with RS, even if you happen to have
a very conservative perspective.

One advantage of a GUC is that you can set it to zero, and always get
a simple hybrid sort-merge strategy if that's desirable. I think that
it might not matter much with multi-gigabyte work_mem settings anyway,
though; you'll just see a small blip. Big (maintenance_)work_mem was
by far my greatest concern in relation to using a heap in general, so
I'm left pretty happy by this plan, I think. Lots of people can afford
a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna
be the most important case overall, by far.

> 2. If (maintenance_)work_mem fills up completely, we will quicksort
> all of the data we have in memory.  We will then regard the tail end
> of that sorted data, in an amount governed by replacement_sort_mem, as
> a heap, and use it to perform replacement selection until no tuples
> remain for the current run.  Meanwhile, the rest of the sorted data
> remains in memory untouched.  Logically, we're constructing a run of
> tuples which is split between memory and disk: the head of the run
> (what fits in all of (maintenance_)work_mem except for
> replacement_sort_mem) is in memory, and the tail of the run is on
> disk.

I went back and forth on this during our call, but I now think that I
was right that there will need to be changes in order to make the tail
of the run a heap (*not* the quicksorted head), because routines like
tuplesort_heap_siftup() assume that state->memtuples[0] is the head of
the heap. This is currently assumed by the master branch for both the
currentRun/nextRun replacement selection heap, as well as the

Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Greg Stark

On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas  wrote:
> On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan  wrote:
> > It was my understanding, based on your emphasis on producing only a
> > single run, as well as your recent remarks on this thread about the
> > first run being special, that you are really only interested in the
> > presorted case, where one run is produced. That is, you are basically
> > not interested in preserving the general ability of replacement
> > selection to double run size in the event of a uniform distribution.
>...
> > You don't want to change the behavior of the current patch for the
> > second or subsequent run; that should remain a quicksort, pure and
> > simple. Do I have that right?
>
> Yes.

I'm not even sure this is necessary. The idea of missing out on
producing a single sorted run sounds bad but in practice since we
normally do the final merge on the fly there doesn't seem like there's
really any difference between reading one tape or reading two or three
tapes when outputing the final results. There will be the same amount
of I/O happening and a 2-way or 3-way merge for most data types should
be basically free.

On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas  wrote:
> 3. At this point, we have one sorted tape per worker.  Perform a final
> merge pass to get the final result.

I don't even think you have to merge until you get one tape per
worker. You can statically decide how many tapes you can buffer in
memory based on work_mem and merge until you get N/workers tapes so
that a single merge in the gather node suffices. I would expect that
to nearly always mean the workers are only responsible for generating
the initial sorted runs and the single merge pass is done in the
gather node on the fly as the tuples are read.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Robert Haas

On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan  wrote:
> Right. Let me give you the executive summary first: I continue to
> believe, following thinking about the matter in detail, that this is a
> sensible compromise, that weighs everyone's concerns. It is pretty
> close to a win-win. I just need you to confirm what I say here in
> turn, so we're sure that we understand each other perfectly.

Makes sense to me.

>> The basic idea is that we will add a new GUC with a name like
>> replacement_sort_mem that will have a default value in the range of
>> 20-30MB; or possibly we will hardcode this value, but for purposes of
>> this email I'm going to assume it's a GUC.  If the value of work_mem
>> or maintenance_work_mem, whichever applies, is smaller than the value
>> of replacement_sort_mem, then the latter has no effect.
>
> By "no effect", you must mean that we always use a heap for the entire
> first run (albeit for the tail, with a hybrid quicksort/heap
> approach), but still use quicksort for every subsequent run, when it's
> clearly established that we aren't going to get one huge run. Is that
> correct?

Yes.

> It was my understanding, based on your emphasis on producing only a
> single run, as well as your recent remarks on this thread about the
> first run being special, that you are really only interested in the
> presorted case, where one run is produced. That is, you are basically
> not interested in preserving the general ability of replacement
> selection to double run size in the event of a uniform distribution.
> (That particular doubling property of replacement selection is now
> technically lost by virtue of using this new hybrid model *anyway*,
> although it will still make runs longer in general).
>
> You don't want to change the behavior of the current patch for the
> second or subsequent run; that should remain a quicksort, pure and
> simple. Do I have that right?

Yes.

> BTW, parallel sort should probably never use a heap anyway (ISTM that
> that will almost certainly be based on external sorts in the end). A
> heap is not really compatible with the parallel heap scan model.

I don't think I agree with this part, though I think it's unimportant
as far as the current patch is concerned.  My initial thought is that
parallel sort should work like this:

1. Each worker reads and sorts its input tuples just as it would in
non-parallel mode.

2. If, at the conclusion of the sort, the input tuples are still in
memory (quicksort) or partially in memory (quicksort with spillover),
then write them all to a tape.  If they are on multiple tapes, merge
those to a single tape.  If they are on a single tape, do nothing else
at this step.

3. At this point, we have one sorted tape per worker.  Perform a final
merge pass to get the final result.

The major disadvantage of this is that if the input hasn't been
relatively evenly partitioned across the workers, the work of sorting
will fall disproportionately on those that got more input.  We could,
in the future, make the logic more sophisticated.  For example, if
worker A is still reading the input and dumping sorted runs, worker B
could start merging those runs.  Or worker A could read tuples into a
DSM instead of backend-private memory, and worker B could then sort
them to produce a run.  While such optimizations are clearly
beneficial, I would not try to put them into a first parallel sort
patch.  It's too complicated.

>> One thing I just thought of (after the call) is that it might be
>> better for this GUC to be in units of tuples rather than in units of
>> memory; it's not clear to me why the optimal heap size should be
>> dependent on the tuple size, so we could have a threshold like 300,000
>> tuples or whatever.
>
> I think you're right that a number of tuples is the logical way to
> express the heap size (as a GUC unit). I think that the ideal setting
> for the GUC is large enough to recognize significant correlations in
> input data, which may be clustered, but no larger (at least while
> things don't all fit in L1 cache, or maybe L2 cache). We should "go
> for broke" with replacement selection -- we don't aim for anything
> less than ending up with 1 run by using the heap (merging 2 or 3 runs
> rather than 4 or 6 is far less useful, maybe harmful, when one of them
> is much larger). Therefore, I don't expect that we'll be practically
> disadvantaged by having fewer "hands to juggle" tuples here (we'll
> simply almost always have enough in practice -- more on that later).
> FWIW I don't think that any benchmark we've seen so far justifies
> doing less than "going for broke" with RS, even if you happen to have
> a very conservative perspective.
>
> One advantage of a GUC is that you can set it to zero, and always get
> a simple hybrid sort-merge strategy if that's desirable. I think that
> it might not matter much with multi-gigabyte work_mem settings anyway,
> though; you'll just see a small blip. Big

Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Peter Geoghegan

On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan  wrote:
>> I'm not even sure this is necessary. The idea of missing out on
>> producing a single sorted run sounds bad but in practice since we
>> normally do the final merge on the fly there doesn't seem like there's
>> really any difference between reading one tape or reading two or three
>> tapes when outputing the final results. There will be the same amount
>> of I/O happening and a 2-way or 3-way merge for most data types should
>> be basically free.
>
> I basically agree with you, but it seems possible to fix the
> regression (generally misguided though those regressed cases are).
> It's probably easiest to just fix it.

On a related note, we should probably come up with a way of totally
supplanting the work_mem model with something smarter in the next
couple of years. Something that treats memory as a shared resource
even when it's allocated privately, per-process. This external sort
stuff really smooths out the cost function of sorts. ISTM that that
makes the idea of dynamic memory budgets (in place of a one size fits
all work_mem) seem desirable for the first time. That said, I really
don't have a good sense of how to go about moving in that direction at
this point. It seems less than ideal that DBAs have to be so
conservative in sizing work_mem.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Peter Geoghegan

On Sun, Feb 7, 2016 at 10:51 AM, Greg Stark  wrote:
>> > You don't want to change the behavior of the current patch for the
>> > second or subsequent run; that should remain a quicksort, pure and
>> > simple. Do I have that right?
>>
>> Yes.
>
> I'm not even sure this is necessary. The idea of missing out on
> producing a single sorted run sounds bad but in practice since we
> normally do the final merge on the fly there doesn't seem like there's
> really any difference between reading one tape or reading two or three
> tapes when outputing the final results. There will be the same amount
> of I/O happening and a 2-way or 3-way merge for most data types should
> be basically free.

I basically agree with you, but it seems possible to fix the
regression (generally misguided though those regressed cases are).
It's probably easiest to just fix it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-02-05 Thread Robert Haas

On Thu, Feb 4, 2016 at 6:14 AM, Peter Geoghegan  wrote:
> The economics of using 4MB or even 20MB to sort 10GB of data are
> already preposterously bad for everyone that runs a database server,
> no matter how budget conscious they may be. I can reluctantly accept
> that we need to still use a heap with very low work_mem settings to
> avoid the risk of a regression (in the event of a strong correlation)
> on general principle, but I'm well justified in proposing "just don't
> do that" as the best practical advice.
>
> I thought I had your agreement on that point, Robert; is that actually the 
> case?

Peter and I spent a few hours talking on Skype this morning about this
point and I believe we have agreed on an algorithm that I think will
address all of my concerns and hopefully also be acceptable to him.
Peter, please weigh in and let me know if I've gotten anything
incorrect here or if you think of other concerns afterwards.

The basic idea is that we will add a new GUC with a name like
replacement_sort_mem that will have a default value in the range of
20-30MB; or possibly we will hardcode this value, but for purposes of
this email I'm going to assume it's a GUC.  If the value of work_mem
or maintenance_work_mem, whichever applies, is smaller than the value
of replacement_sort_mem, then the latter has no effect.  However, if
replacement_sort_mem is the smaller value, then the amount of memory
that can be used for a heap with replacement selection is limited to
replacement_sort_mem: we can use more memory than that in total for
the sort, but the amount that can be used for a heap is restricted to
that value.  The way we do this is explained in more detail below.
One thing I just thought of (after the call) is that it might be
better for this GUC to be in units of tuples rather than in units of
memory; it's not clear to me why the optimal heap size should be
dependent on the tuple size, so we could have a threshold like 300,000
tuples or whatever.   But that's a secondary issue and I might be
wrong about it: the point is that in order to have a chance of
winning, a heap used for replacement selection needs to be not very
big at all by the standards of modern hardware, so the plan is to
limit it to a size at which it may have a chance.

Here's how that will work, assuming Peter and I understand each other:

1. We start reading the input data.  If we reach the end of the input
data before (maintenance_)work_mem is exhausted, then we can simply
quicksort the data and we're done.  This is no different than what we
already do today.

2. If (maintenance_)work_mem fills up completely, we will quicksort
all of the data we have in memory.  We will then regard the tail end
of that sorted data, in an amount governed by replacement_sort_mem, as
a heap, and use it to perform replacement selection until no tuples
remain for the current run.  Meanwhile, the rest of the sorted data
remains in memory untouched.  Logically, we're constructing a run of
tuples which is split between memory and disk: the head of the run
(what fits in all of (maintenance_)work_mem except for
replacement_sort_mem) is in memory, and the tail of the run is on
disk.

3. If we reach the end of input before replacement selection runs out
of tuples for the current run, and if it finds no tuples for the next
run prior to that time, then we are done.  All of the tuples form a
single run and we can return the tuples in memory first followed by
the tuples on disk.  This case is highly likely to be a huge win over
what we have today, because (a) some portion of the tuples were sorted
via quicksort rather than heapsort and that's faster, (b) the tuples
that were sorted using a heap were sorted using a small heap rather
than a big one, and (c) we only wrote out the minimal number of tuples
to tape instead of, as we would have done today, all of them.

4. If we reach this step, then replacement selection with a small heap
wasn't able to sort the input in a single run.  We have a bunch of
sorted data in memory which is the head of the same run whose tail is
already on disk; we now spill all of these tuples to disk.  That
leaves only the heapified tuples in memory.  We just ignore the fact
that they are a heap and treat them as unsorted.  We repeatedly do the
following: read tuples until work_mem is full, sort them, and dump the
result to disk as a run.  When all runs have been created, we merge
runs just as we do today.

This algorithm seems very likely to beat what we do today in
practically all cases.  The benchmarking Peter and others have already
done shows that building runs with quicksort rather than replacement
selection can often win even if the larger number of tapes requires a
multi-pass merge.  The only cases where it didn't seem to be a clear
win involved data that was already in sorted order, or very close to
it.  But with this algorithm, presorted input is fine: we'll quicksort
some of it (which is faster than replacement

Re: [HACKERS] Using quicksort for every external sort run

2016-02-04 Thread Peter Geoghegan

On Sat, Jan 30, 2016 at 5:29 AM, Robert Haas  wrote:
>> I meant use "quicksort with spillover" simply because an estimated
>> 90%+ of all tuples have already been consumed. Don't consider the
>> tuple width, etc.
>
> Hmm, it's a thought.

To be honest, it's a bit annoying that this is one issue we're stuck
on, because "quicksort with spillover" is clearly of less importance
overall. (This is a distinct issue from the issue of not using a
replacement selection style heap for the first run much of the time,
which seems to be a discussion about whether and to what extent the
*traditional* advantages of replacement selection hold today, as
opposed to a discussion about a very specific crossover point in my
patch.)

>> Really? Just try it with a heap that is not tiny. Performance tanks.
>> The fact that replacement selection can produce one long run then
>> becomes a liability, not a strength. With a work_mem of something like
>> 1GB, it's *extremely* painful.
>
> I'm not sure exactly what you think I should try.  I think a couple of
> people have expressed the concern that your patch might regress things
> on data that is all in order, but I'm not sure if you think I should
> try that case or some case that is not-quite-in-order.  "I don't see
> that you've justified that statement" is referring to the fact that
> you presented no evidence in your original post that it's important to
> sometimes use quicksorting even for run #1.  If you've provided some
> test data illustrating that point somewhere, I'd appreciate a pointer
> back to it.

I think that the answer to what you should try is simple: Any case
involving a large heap (say, a work_mem of 1GB). No other factor like
correlation seems to change the conclusion about that being generally
bad.

If you have a correlation, then that is *worse* if "quicksort with
spillover" always has us use a heap for the first run, because it
prolongs the pain of using the cache inefficient heap (note that this
is an observation about "quicksort with spillover" in particular, and
not replacement selection in general). The problem you'll see is that
there is a large heap which is __slow__ to spill from, and that's
pretty obvious with or without a correlation. In general it seems
unlikely that having one long run during the merge (i.e. no merge --
seen by having the heap build one long run because we got "lucky" and
"quicksort with spillover" encountered a correlation) can ever hope to
make up for this.

It *could* still make up for it if:

1. There isn't much to make up for in the first place, because the
heap is CPU cache resident. Testing this with a work_mem that is the
same size as CPU L3 cache seems a bit pointless to me, and I think
we've seen that a few times.

and:

2. There are many passes required without a replacement selection
heap, because the volume of data is just so much greater than the low
work_mem setting. Replacement selection makes the critical difference
because there is a correlation, perhaps strong enough to make it one
or two runs rather than, say, 10 or 20 or 100.

I've already mentioned many times that linear growth in the size of
work_mem sharply reduces the need for additional passes during the
merge phase (the observation about quadratic growth that I won't
repeat). These days, it's hard to recommend anything other than "use
more memory" to someone trying to use 4MB to sort 10GB of data. Yeah,
it would also be faster to use replacement selection for the first run
in the hope of getting lucky (actually lucky this time; no quotes),
but it's hard to imagine that that's going to be a better option, no
matter how frugal the user is. Helping users recognize when they could
use more memory effectively seems like the best strategy. That was the
idea behind multipass_warning, but you didn't like that (Greg Stark
was won over on the multipass_warning warning, though). I hope we can
offer something roughly like that at some point (a view?), because it
makes sense.

> How about always starting with replacement selection, but limiting the
> amount of memory that can be used with replacement selection to some
> small value?  It could be a separate GUC, or a hard-coded constant
> like 20MB if we're fairly confident that the same value will be good
> for everyone.  If the tuples aren't in order, then we'll pretty
> quickly come to the end of the first run and switch to quicksort.

This seems acceptable, although note that we don't have to decide
until we reach the work_mem limit, and not before.

If you want to use a heap for the first run, I'm not excited about the
idea, but if you insist then I'm glad that you at least propose to
limit it to the kind of cases that we *actually* saw regressed (i.e.
low work_mem settings -- like the default work_mem setting, 4MB).
We've seen no actual case with a larger work_mem that is advantaged by
using a heap, even *with* a strong correlation (this is actually
*worst of all*); that's where I am

Re: [HACKERS] Using quicksort for every external sort run

2016-02-04 Thread Peter Geoghegan

On Thu, Feb 4, 2016 at 1:46 AM, Peter Geoghegan  wrote:
> It wasn't my original insight that replacement selection has become
> all but obsolete. It took me a while to come around to that point of
> view.

Nyberg et al may have said it best in 1994, in the Alphasort Paper [1]:

"By comparison, OpenVMS sort uses a pure replacement-selection sort to
generate runs (Knuth, 1973). Replacement-selection is best for a
memory-constrained environment. On average, replacement-selection
generates runs that are twice as large as available memory, while the
QuickSort runs are typically less than half of available memory.
However, in a memory-rich environment, QuickSort is faster because it
is simpler, makes fewer exchanges on average, and has superior address
locality to exploit processor caching. "

(I believe that the authors state that "QuickSort runs are typically
less than half of available memory" because of the use of explicit
asynchronous I/O in each thread, which doesn't apply to us).

The paper also has very good analysis of the economics of sorting:

"Even for surprisingly large sorts, it is economical to perform the
sort in one pass."

Of course, memory capacities have scaled enormously in the 20 years
since this analysis was performed, so the analysis applies even at the
very low end these days. The high capacity memory system that they
advocate to get a one pass sort (instead of having faster disks) had
100MB of memory, which is of course tiny by contemporary standards. If
you pay Heroku $7 a month, you get a "Hobby Tier" database with 512MB
of memory. The smallest EC2 instance size, the t2.nano, costs about
$1.10 to run for one week, and has 0.5GB of memory.

The economics of using 4MB or even 20MB to sort 10GB of data are
already preposterously bad for everyone that runs a database server,
no matter how budget conscious they may be. I can reluctantly accept
that we need to still use a heap with very low work_mem settings to
avoid the risk of a regression (in the event of a strong correlation)
on general principle, but I'm well justified in proposing "just don't
do that" as the best practical advice.

I thought I had your agreement on that point, Robert; is that actually the case?

[1] http://www.cs.berkeley.edu/~rxin/db-papers/alphasort.pdf

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-30 Thread Robert Haas

On Sat, Jan 30, 2016 at 2:25 AM, Peter Geoghegan  wrote:
> On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas  wrote:
>> I don't quite know what you mean by these numbers.  Add a generic,
>> conservative threshold to what?
>
> I meant use "quicksort with spillover" simply because an estimated
> 90%+ of all tuples have already been consumed. Don't consider the
> tuple width, etc.

Hmm, it's a thought.

>> Thinking about this some more, I really think we should think hard
>> about going back to the strategy which you proposed and discarded in
>> your original post: always generate the first run using replacement
>> selection, and every subsequent run by quicksorting.  In that post you
>> mention powerful advantages of this method: "even without a strong
>> logical/physical correlation, the algorithm tends to produce runs that
>> are about twice the size of work_mem. (It's also notable that
>> replacement selection only produces one run with mostly presorted
>> input, even where input far exceeds work_mem, which is a neat trick.)"
>>  You went on to dismiss that strategy, saying that "despite these
>> upsides, replacement selection is obsolete, and should usually be
>> avoided."  But I don't see that you've justified that statement.
>
> Really? Just try it with a heap that is not tiny. Performance tanks.
> The fact that replacement selection can produce one long run then
> becomes a liability, not a strength. With a work_mem of something like
> 1GB, it's *extremely* painful.

I'm not sure exactly what you think I should try.  I think a couple of
people have expressed the concern that your patch might regress things
on data that is all in order, but I'm not sure if you think I should
try that case or some case that is not-quite-in-order.  "I don't see
that you've justified that statement" is referring to the fact that
you presented no evidence in your original post that it's important to
sometimes use quicksorting even for run #1.  If you've provided some
test data illustrating that point somewhere, I'd appreciate a pointer
back to it.

> A compromise that may be acceptable is to always do a "quicksort with
> spillover" when there is a very low work_mem setting and the estimate
> of the number of input tuples is less than 10x of what we've seen so
> far. Maybe less than 20MB. That will achieve the same thing.

How about always starting with replacement selection, but limiting the
amount of memory that can be used with replacement selection to some
small value?  It could be a separate GUC, or a hard-coded constant
like 20MB if we're fairly confident that the same value will be good
for everyone.  If the tuples aren't in order, then we'll pretty
quickly come to the end of the first run and switch to quicksort.  If
we do end up using replacement selection for the whole sort, the
smaller heap is an advantage.  What I like about this sort of thing is
that it adds no reliance on any estimate; it's fully self-tuning.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Mithun Cy

On Fri, Jan 29, 2016 at 5:11 PM, Mithun Cy 
wrote
>
>
>
> >I just ran some tests on above patch. Mainly to compare
> >how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for
> sorting.
> >I have 8GB of ram and ssd storage.
>
>
> *Key length 520*
>
>
>
>
> *Number of records* 320 640 1280 2560
>
> 1.7 GB 3.5GB 7 GB 14GB
>
>
>
>
>
> *CASE 1*
>
>
>
> *RS* 23654.677 35172.811 44965.442 106420.155
> *Qsort* 14100.362 40612.829 101068.107 334893.391
>
>
>
>
>
> *CASE 2*
>
>
>
> *RS* 13427.378 36882.898 98492.644 310670.15
> *Qsort* 12475.133 32559.074 100772.531 322080.602
>
>
>
>
>
> *CASE 3*
>
>
>
> *RS* 17202.966 45163.234 122323.299 337058.856
> *Qsort* 12530.726 23343.753 59431.315 152862.837
>
>
>
> *CASE 1* *RS* 128822.735
>
> *Qsort* 90857.496
> *CSAE 2* *RS* *105631.775*
>
> *Qsort* *105938.334*
> *CASE 3* *RS* *152301.054*
>
> *Qsort* *149649.347*
>
>
Sorry forgot to mention above data in table is in unit of ms, returned by
psql client.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Mithun Cy

On Tue, Dec 29, 2015 at 4:33 AM, Peter Geoghegan  wrote:
>Attached is a revision that significantly overhauls the memory patch,
>with several smaller changes.

I just ran some tests on above patch. Mainly to compare
how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for
sorting.
I have 8GB of ram and ssd storage.

Settings and Results.

Work_mem= DEFAULT (4mb).
key width = 520.

CASE 1. Data is pre-sorted as per  sort key order.

CASE 2. Data is sorted in opposite order of sort key.

CASE 3. Data is randomly distributed.

*Key length 520*

*Number of records* 320 640 1280 2560

1.7 GB 3.5GB 7 GB 14GB

*CASE 1*

*RS* 23654.677 35172.811 44965.442 106420.155
*Qsort* 14100.362 40612.829 101068.107 334893.391

*CASE 2*

*RS* 13427.378 36882.898 98492.644 310670.15
*Qsort* 12475.133 32559.074 100772.531 322080.602

*CASE 3*

*RS* 17202.966 45163.234 122323.299 337058.856
*Qsort* 12530.726 23343.753 59431.315 152862.837

If data is sorted as same as sort key order then current code performs
better than proposed patch
as sort size increases.

It appears new algo do not seem have any major impact if rows are presorted
in opposite order.

For randomly distributed order quick sort performs well when compared to
current sort method (RS).

==
Now Increase the work_mem to 64MB and for 14 GB of data to sort.

CASE 1: We can see Qsort is able to catchup with current sort method(RS).
CASE 2:  No impact.
CASE 3: RS is able to catchup with Qsort.

*CASE 1* *RS* 128822.735

*Qsort* 90857.496
*CSAE 2* *RS* *105631.775*

*Qsort* *105938.334*
*CASE 3* *RS* *152301.054*

*Qsort* *149649.347*

I think for long keys both old (RS) and new (Qsort) sort method has its own
characteristics
based on data distribution. I think work_mem is the key If properly set new
method(Qsort) will
be able to fit most of the cases. If work_mem is not tuned right it, there
are cases it can regress.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Test Queries.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Peter Geoghegan

On Fri, Jan 29, 2016 at 3:41 AM, Mithun Cy  wrote:
> I just ran some tests on above patch. Mainly to compare
> how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for 
> sorting.
> I have 8GB of ram and ssd storage.
>
> Settings and Results.
> 
> Work_mem= DEFAULT (4mb).
> key width = 520.

> If data is sorted as same as sort key order then current code performs better 
> than proposed patch
> as sort size increases.
>
> It appears new algo do not seem have any major impact if rows are presorted 
> in opposite order.
>
> For randomly distributed order quick sort performs well when compared to 
> current sort method (RS).
>
>
> ==
> Now Increase the work_mem to 64MB and for 14 GB of data to sort.
>
> CASE 1: We can see Qsort is able to catchup with current sort method(RS).
> CASE 2:  No impact.
> CASE 3: RS is able to catchup with Qsort.

I think that the basic method you're using to do these tests may have
additional overhead:

-- sort in ascending order.
CREATE FUNCTION test_orderby_asc( ) RETURNS int
AS $$
#print_strict_params on
DECLARE
gs int;
jk text;
BEGIN
SELECT string_4k, generate_series INTO  jk, gs
FROM so order by string_4k, generate_series;

RETURN gs;
END
$$ LANGUAGE plpgsql;

Anyway, these test cases all remove much of the advantage of increased
cache efficiency.  No comparisons are *ever* resolved using the
leading attribute, which calls into question why anyone would sort on
that. It's 512 bytes, so artificially makes the comparisons themselves
the bottleneck, as opposed to cache efficiency. You can't even fit the
second attribute in the same cacheline as the first in the "tuple
proper" (MinimalTuple).

You are using a 4MB work_mem setting, but you almost certainly have a
CPU with an L3 cache size that's a multiple of that, even with cheap
consumer grade hardware. You have 8GB of ram; a 4MB work_mem setting
is very small setting (I mean in an absolute sense, less so than
relative to the size of data, although especially relative to the
data).

You mentioned "CASE 3: RS is able to catchup with Qsort", which
doesn't make much sense to me. The only way I think that is possible
is by making the increased work_mem sufficient to have much longer
runs, because there is in fact somewhat of a correlation in the data,
and an increased work_mem makes the critical difference, allowing
perhaps one long run to be used -- there is now enough memory to
"juggle" tuples without ever needing to start a new run. But, how
could that be? You said case 3 was totally random data, so I'd only
expect incremental improvement. It could also be some weird effect
from polyphase merge. A discontinuity.

I also don't understand why the patch ("Qsort") can be so much slower
between case 1 and case 3 on 3.5GB+ sizes, but not the 1.7GB size.
Even leaving aside the differences between "RS" and "Qsort", it makes
no sense to me that *both* are faster with random data ("CASE 3") than
with presorted data ("CASE 1").

Another weird thing is that the traditional best case for replacement
selection ("RS") is a strong correlation, and a traditional worst case
is an inverse correlation, where run size is bound strictly by memory.
But you show just the opposite here -- the inverse correlation is
faster with RS in the 1.7 GB data case. So, I have no idea what's
going on here, and find it all very confusing.

In order for these numbers to be useful, they need more detail --
"trace_sort" output. There are enough confounding factors in general,
and especially here, that not having that information makes raw
numbers very difficult to interpret.

> I think for long keys both old (RS) and new (Qsort) sort method has its own 
> characteristics
> based on data distribution. I think work_mem is the key If properly set new 
> method(Qsort) will
> be able to fit most of the cases. If work_mem is not tuned right it, there 
> are cases it can regress.

work_mem is impossible to tune right with replacement selection.
That's a key advantage of the proposed new approach.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Robert Haas

On Wed, Jan 27, 2016 at 8:20 AM, Peter Geoghegan  wrote:
> Correct me if I'm wrong, but I think that the only outstanding issue
> with all patches posted here so far is the "quicksort with spillover"
> cost model. Hopefully this can be cleared up soon. As I've said, I am
> very receptive to other people's suggestions about how that should
> work.

I feel like this could be data driven.  I mean, the cost model is
based mainly on the tuple width and the size of the SortTuple array.
So, it should be possible to tests of both algorithms on 32, 64, 96,
128, ... byte tuples with a SortTuple array that is 256MB, 512MB,
768MB, 1GB, ...  Then we can judge how closely the cost model comes to
mimicking the actual behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Peter Geoghegan

On Fri, Jan 29, 2016 at 9:24 AM, Robert Haas  wrote:
> I feel like this could be data driven.  I mean, the cost model is
> based mainly on the tuple width and the size of the SortTuple array.
> So, it should be possible to tests of both algorithms on 32, 64, 96,
> 128, ... byte tuples with a SortTuple array that is 256MB, 512MB,
> 768MB, 1GB, ...  Then we can judge how closely the cost model comes to
> mimicking the actual behavior.

You would also need to represent how much of the input actually ended
up being sorted with the heap in each case. Maybe that could be tested
at 50% (bad for "quicksort with spillover"), 25% (better), and 5%
(good).

An alternative approach that might be acceptable is to add a generic,
conservative 90% threshold (so 10% of tuples sorted by heap).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Robert Haas

On Fri, Jan 29, 2016 at 12:46 PM, Peter Geoghegan  wrote:
> On Fri, Jan 29, 2016 at 9:24 AM, Robert Haas  wrote:
>> I feel like this could be data driven.  I mean, the cost model is
>> based mainly on the tuple width and the size of the SortTuple array.
>> So, it should be possible to tests of both algorithms on 32, 64, 96,
>> 128, ... byte tuples with a SortTuple array that is 256MB, 512MB,
>> 768MB, 1GB, ...  Then we can judge how closely the cost model comes to
>> mimicking the actual behavior.
>
> You would also need to represent how much of the input actually ended
> up being sorted with the heap in each case. Maybe that could be tested
> at 50% (bad for "quicksort with spillover"), 25% (better), and 5%
> (good).
>
> An alternative approach that might be acceptable is to add a generic,
> conservative 90% threshold (so 10% of tuples sorted by heap).

I don't quite know what you mean by these numbers.  Add a generic,
conservative threshold to what?

Thinking about this some more, I really think we should think hard
about going back to the strategy which you proposed and discarded in
your original post: always generate the first run using replacement
selection, and every subsequent run by quicksorting.  In that post you
mention powerful advantages of this method: "even without a strong
logical/physical correlation, the algorithm tends to produce runs that
are about twice the size of work_mem. (It's also notable that
replacement selection only produces one run with mostly presorted
input, even where input far exceeds work_mem, which is a neat trick.)"
 You went on to dismiss that strategy, saying that "despite these
upsides, replacement selection is obsolete, and should usually be
avoided."  But I don't see that you've justified that statement.  It
seems pretty easy to construct cases where this technique regresses,
and a large percentage of those cases are precisely those where
replacement selection would have produced a single run, avoiding the
merge step altogether.  I think those cases are extremely important.
I'm quite willing to run somewhat more slowly than in other cases to
be certain of not regressing the case of completely or
almost-completely ordered input.  Even if that didn't seem like a
sufficient reason unto itself, I'd be willing to go that way just so
we don't have to depend on a cost model that might easily go wrong due
to bad input even if it were theoretically perfect in every other
respect (which I'm pretty sure is not true here anyway).

I also have another idea that might help squeeze more performance out
of your approach and avoid regressions.  Suppose that we add a new GUC
with a name like sort_mem_stretch_multiplier or something like that,
with a default value of 2.0 or 4.0 or whatever we think is reasonable.
When we've written enough runs that a polyphase merge will be
required, or when we're actually performing a polyphase merge, the
amount of memory we're allowed to use increases by this multiple.  The
idea is: we hope that people will set work_mem appropriately and
consequently won't experience polyphase merges at all, but it might.
However, it's almost certain not to happen very frequently.
Therefore, using extra memory in such cases should be acceptable,
because while you might have every backend in the system using 1 or
more copies of work_mem for something if the system is very busy, it
is extremely unlikely that you will have more than a handful of
processes doing polyphase merges.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Greg Stark

On 29 Jan 2016 11:58 pm, "Robert Haas"  wrote:
> It
> seems pretty easy to construct cases where this technique regresses,
> and a large percentage of those cases are precisely those where
> replacement selection would have produced a single run, avoiding the
> merge step altogether.

Now that avoiding the merge phase altogether didn't necessarily represent
any actual advantage.

We don't find out we've avoided the merge phase until the entire run has
been spiked to disk. Then we need to read it back in from disk to serve up
those tuples.

If we have tapes to merge but can do then in a single pass we do that
lazily and merge as needed when we serve up the tuples. I doubt there's any
speed difference in reading two sequential streams with our buffering over
one especially in the midst of a quiet doing other i/o. And N extra
comparisons is less than the quicksort advantage.

If we could somehow predict that it'll be a single output run that would be
a huge advantage. But having to spill all the tuples and then find out
isn't really helpful.

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Greg Stark

On 30 Jan 2016 8:27 am, "Greg Stark"  wrote:
>
>
> On 29 Jan 2016 11:58 pm, "Robert Haas"  wrote:
> > It
> > seems pretty easy to construct cases where this technique regresses,
> > and a large percentage of those cases are precisely those where
> > replacement selection would have produced a single run, avoiding the
> > merge step altogether.
>
> Now that avoiding the merge phase altogether didn't necessarily represent
any actual advantage.
>
> We don't find out we've avoided the merge phase until the entire run has
been spiked to disk.

Hm, sorry about the phone typos. I thought I proofread it as I went but
obviously not that effectively...

Re: [HACKERS] Using quicksort for every external sort run

2016-01-29 Thread Peter Geoghegan

On Fri, Jan 29, 2016 at 2:58 PM, Robert Haas  wrote:
> I don't quite know what you mean by these numbers.  Add a generic,
> conservative threshold to what?

I meant use "quicksort with spillover" simply because an estimated
90%+ of all tuples have already been consumed. Don't consider the
tuple width, etc.

> Thinking about this some more, I really think we should think hard
> about going back to the strategy which you proposed and discarded in
> your original post: always generate the first run using replacement
> selection, and every subsequent run by quicksorting.  In that post you
> mention powerful advantages of this method: "even without a strong
> logical/physical correlation, the algorithm tends to produce runs that
> are about twice the size of work_mem. (It's also notable that
> replacement selection only produces one run with mostly presorted
> input, even where input far exceeds work_mem, which is a neat trick.)"
>  You went on to dismiss that strategy, saying that "despite these
> upsides, replacement selection is obsolete, and should usually be
> avoided."  But I don't see that you've justified that statement.

Really? Just try it with a heap that is not tiny. Performance tanks.
The fact that replacement selection can produce one long run then
becomes a liability, not a strength. With a work_mem of something like
1GB, it's *extremely* painful.

> It seems pretty easy to construct cases where this technique regresses,
> and a large percentage of those cases are precisely those where
> replacement selection would have produced a single run, avoiding the
> merge step altogether.

...*and* where many passes are otherwise required (otherwise, the
merge is still cheap enough to leave us ahead). Typically with very
small work_mem settings, like 4MB, and far larger data volumes. It's
easy to construct those cases, but that doesn't mean that they
particularly matter. Using 4MB of work_mem to sort 10GB of data is
penny wise and pound foolish. The cases we've seen regressed are
mostly a concern because misconfiguration happens.

A compromise that may be acceptable is to always do a "quicksort with
spillover" when there is a very low work_mem setting and the estimate
of the number of input tuples is less than 10x of what we've seen so
far. Maybe less than 20MB. That will achieve the same thing.

> I'm quite willing to run somewhat more slowly than in other cases to
> be certain of not regressing the case of completely or
> almost-completely ordered input.  Even if that didn't seem like a
> sufficient reason unto itself, I'd be willing to go that way just so
> we don't have to depend on a cost model that might easily go wrong due
> to bad input even if it were theoretically perfect in every other
> respect (which I'm pretty sure is not true here anyway).

The consequences of being wrong either way are not severe (note that
making one long run isn't a goal of the cost model currently).

> I also have another idea that might help squeeze more performance out
> of your approach and avoid regressions.  Suppose that we add a new GUC
> with a name like sort_mem_stretch_multiplier or something like that,
> with a default value of 2.0 or 4.0 or whatever we think is reasonable.
> When we've written enough runs that a polyphase merge will be
> required, or when we're actually performing a polyphase merge, the
> amount of memory we're allowed to use increases by this multiple.  The
> idea is: we hope that people will set work_mem appropriately and
> consequently won't experience polyphase merges at all, but it might.
> However, it's almost certain not to happen very frequently.
> Therefore, using extra memory in such cases should be acceptable,
> because while you might have every backend in the system using 1 or
> more copies of work_mem for something if the system is very busy, it
> is extremely unlikely that you will have more than a handful of
> processes doing polyphase merges.

I'm not sure that that's practical. Currently, tuplesort decides on a
number of tapes ahead of time. When we're constrained on those, the
stretch multiplier would apply, but I think that that could be
invasive because the number of tapes ("merge order" + 1) was a
function of non-stretched work_mem.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2016-01-27 Thread Peter Geoghegan

On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas  wrote:
>> But yes, let me concede more clearly: the cost model is based on
>> frobbing. But at least it's relatively honest about that, and is
>> relatively simple. I think it might be possible to make it simpler,
>> but I have a feeling that anything we can come up with will basically
>> have the same quality that you so dislike. I don't know how to do
>> better. Frankly, I'd rather be roughly correct than exactly wrong.
>
> Sure, but the fact that the model has huge discontinuities - perhaps
> most notably a case where adding a single tuple to the estimated
> cardinality changes the crossover point by a factor of two - suggests
> that you are probably wrong.  The actual behavior does not change
> sharply when the size of the SortTuple array crosses 1GB, but the
> estimates do.

Here is some fairly interesting analysis of Quicksort vs. Heapsort,
from Bentley, coauthor of our own Quicksort implementation:

https://youtu.be/QvgYAQzg1z8?t=16m15s

(This link picks up at the right point to see the comparison, complete
with an interesting graph).

It probably doesn't tell you much that you didn't already know, at
least at this exact point, but it's nice to see Bentley's graph. This
perhaps gives you some idea of why my "quicksort with spillover" cost
model had a cap of MaxAllocSize of SortTuples, past which we always
needed a very compelling case. That was my rough guess of where the
Heapsort graph takes a sharp upward turn. Before then, Bentley shows
that it's close enough to a straight line.

Correct me if I'm wrong, but I think that the only outstanding issue
with all patches posted here so far is the "quicksort with spillover"
cost model. Hopefully this can be cleared up soon. As I've said, I am
very receptive to other people's suggestions about how that should
work.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-28 Thread Peter Geoghegan

On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan  wrote:
> BTW, I'm not necessarily determined to make the new special-purpose
> allocator work exactly as proposed. It seemed useful to prioritize
> simplicity, and currently so there is one big "huge palloc()" with
> which we blow our memory budget, and that's it. However, I could
> probably be more clever about "freeing ranges" initially preserved for
> a now-exhausted tape. That kind of thing.

Attached is a revision that significantly overhauls the memory patch,
with several smaller changes.

We can now grow memtuples to rebalance the size of the array
(memtupsize) against the need for memory for tuples. Doing this makes
a big difference with a 500MB work_mem setting in this datum sort
case, as my newly expanded trace_sort instrumentation shows:

LOG:  grew memtuples 1.40x from 9362286 (219429 KB) to 13107200
(307200 KB) for final merge
LOG:  tape 0 initially used 34110 KB of 34110 KB batch (1.000) and
13107200 slots remaining
LOG:  tape 1 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG:  tape 2 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining
LOG:  tape 3 initially used 34110 KB of 34110 KB batch (1.000) and has
1533 slots remaining
LOG:  tape 4 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG:  tape 5 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining

This is a big improvement. With the new batchmemtuples() call
commented out (i.e. no new grow_memtuples() call), the LOG output
around the same point is:

LOG:  tape 0 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 1 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 2 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 3 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 4 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 5 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining

(I actually added a bit more detail to what you see here during final clean-up)

Obviously we're using memory a lot more efficiently here as compared
to my last revision (or the master branch -- it always has palloc()
overhead, of course). With no grow_memtuples, we're not wasting ~1530
slots per tape anymore (which is a tiny fraction of 1% of the total),
but we are wasting 50% of all batch memory, or almost 30% of all
work_mem.

Note that this improvement is possible despite the fact that memory is
still MAXALIGN()'d -- I'm mostly just clawing back what I can, having
avoided much STANDARDCHUNKHEADERSIZE overhead for the final on-the-fly
merge. I tend to think that the bigger problem here is that we use so
many memtuples when merging in the first place though (e.g. 60% in the
above case), because memtuples are much less useful than something
like a simple array of pointers when merging; I can certainly see why
you'd need 6 memtuples here, for the merge heap, but the other ~13
million seem mostly unnecessary. Anyway, what I have now is as far as
I want to go to accelerate merging for 9.6, since parallel CREATE
INDEX is where the next big win will come from. As wasteful as this
can be, I think it's of secondary importance.

With this revision, I've given up on the idea of trying to map
USEMEM()/FREEMEM() to "logical" allocations and deallocations that
consume from each tape's batch. The existing merge code in the master
branch is concerned exclusively with making each tape's use of memory
fair; each tape only gets so many "slots" (memtuples), and so much
memory, and that's it (there is never any shuffling of those resource
budgets between tapes). I get the same outcome from simply only
allowing tapes to get memory from their own batch allocation, which
isn't much complexity, because only READTUP() routines regularly need
memory. We detect when memory has been exhausted within
mergeprereadone() in a special way, not using LACKMEM() at all -- this
seems simpler. (Specifically, we use something called overflow
allocations for this purpose. This means that there are still a very
limited number of retail palloc() calls.)

This new version somewhat formalizes the idea that batch allocation
may one day have uses beyond the final on-the-fly merge phase, which
makes a lot of sense. We should really be saving a significant amount
of memory when initially sorting runs, too. This revision also
pfree()s tape memory early if the tape is exhausted early, which will
help a lot when there is a logical/physical correlation.

Overall, I'm far happier with how memory is managed in this revision,
mostly because it's easier to reason about. trace_sort now closely
monitors where memory goes, and I think that's a good idea in general.
That makes production performance problems a lot easier to reason
about --

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Robert Haas

On Tue, Dec 22, 2015 at 8:10 PM, Peter Geoghegan  wrote:
>> Sure, there are arbitrary numbers all over the code, driven by
>> empirical observations about what factors are important to model.  But
>> this is not that.  You don't have a thing called seq_page_cost and a
>> thing called cpu_tuple_cost and then say, well, empirically the ratio
>> is about 100:1, so let's make the former 1 and the latter 0.01.  You
>> just have some numbers, and it's not clear what, if anything, they
>> actually represent.
>
> What I find difficult to accept about what you say here is that at
> *this* level, something like cost_sort() has little to recommend it.
> It costs a sort of a text attribute at the same level as the cost of
> sorting the same tuples using an int4 attribute (based on the default
> cpu_operator_cost for C functions -- without any attempt to
> differentiate text and int4).
>
> Prior to 9.5, sorting text took about 5 - 10 times longer that this
> similar int4 sort. That's a pretty big difference, and yet I recall no
> complaints. The cost of a comparison in a sort can hardly be
> considered in isolation, anyway -- cache efficiency is at least as
> important.
>
> Of course, the point is that the goal of a cost model is not to
> simulate reality as closely as possible -- it's to produce a good
> outcome for performance purposes under realistic assumptions.
> Realistic assumptions include that you can't hope to account for
> certain differences in cost. Avoiding a terrible outcome is very
> important, but the worst case for useselection() is no worse than
> today's behavior (or a lost opportunity to do better than today's
> behavior).

I agree with that.  So, the question for any given cost model is: does
it model the effects that matter?

If you think that the cost of sorting integers vs. sorting text
matters to the crossover point, then that should be modeled here.  If
it doesn't matter, then don't include it.

The point is, nobody can tell WHAT effects this is modeling.
Increasing the tuple size makes the crossover go up.  Or down.

> Recently, the paper that was posted to the list about the Postgres
> optimizer stated formally what I know I had a good intuitive sense of
> for a long time: that better selectivity estimates are much more
> important than better cost models in practice. The "empirical
> observations" driving something like DEFAULT_EQ_SEL are very weak --
> but what are you gonna do?

This analogy is faulty.  It's true that when we run across a qual
whose selectivity we cannot estimate in any meaningful way, we have to
just take a stab in the dark and hope for the best.  Similarly, if we
have no information about what the crossover point for a given sort
is, we'd have to take some arbitrary estimate, like 75%, and hope for
the best.  But in this case, we DO have information.  We have an
estimated row count and an estimated row width.  And those values are
not being ignored, they are getting used.  The problem is that they
are being used in an arbitrary way that is not justified by any chain
of reasoning.

> But yes, let me concede more clearly: the cost model is based on
> frobbing. But at least it's relatively honest about that, and is
> relatively simple. I think it might be possible to make it simpler,
> but I have a feeling that anything we can come up with will basically
> have the same quality that you so dislike. I don't know how to do
> better. Frankly, I'd rather be roughly correct than exactly wrong.

Sure, but the fact that the model has huge discontinuities - perhaps
most notably a case where adding a single tuple to the estimated
cardinality changes the crossover point by a factor of two - suggests
that you are probably wrong.  The actual behavior does not change
sharply when the size of the SortTuple array crosses 1GB, but the
estimates do.  That means that either the estimates are wrong for
44,739,242 tuples or they are wrong for 44,739,243 tuples.  The
behavior cannot be right in both cases unless that one extra tuple
changes the behavior radically, or unless the estimate doesn't matter
in the first place.

> By the way, I think that there needs to be a little work done to
> cost_sort() too, which so far I've avoided.

Yeah, I agree, but that can be a separate topic.

>> I agree, but that's not what I proposed.  You don't want to keep
>> re-sorting to incorporate new tuples into the run, but if you've got
>> 1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the
>> first 1000 tuples, (b) read in 10 more tuples, dumping the first 10
>> tuples from run 0 to disk, (c) quicksort the last 10 tuples to create
>> run 1, and then (d) merge run 0 [which is mostly in memory] with run 1
>> [which is entirely in memory].  In other words, yes, quicksorting
>> doesn't let you add things to the sort incrementally, but you can
>> still write out the run incrementally, writing only as many tuples as
>> you need to dump to get the rest of the input data into

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Michael Paquier

On Thu, Dec 24, 2015 at 8:44 AM, Peter Geoghegan  wrote:
> [long blahblah]

(Patch moved to next CF, work is going on. Thanks to people here to be active)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Peter Geoghegan

On Wed, Dec 23, 2015 at 1:03 PM, Jeff Janes  wrote:
> The regression I found when building an index on a column of
> 400,000,000 md5(random()::text) with 64MB maintenance_work_mem was not
> hard to find at all.  I still don't understand what is going on with
> it, but it is reproducible.  Perhaps it is very unlikely and I just
> got very lucky in finding it immediately after switching to that
> data-type for my tests, but I wouldn't assume that on current
> evidence.

Well, that is a lot of tuples to sort with such a small amount of memory.

I have a new theory. Maybe part of the problem here is that in very
low memory conditions, the tape overhead really is kind of wasteful,
and we're back to having to worry about per-tape overhead (6 tapes may
have been far too miserly as a universal number back before that was
fixed [1], but that doesn't mean that the per-tape overhead is
literally zero). You get a kind of thrashing, perhaps. Also, more
tapes results in more random I/O, and that's an added cost, too; the
cure may be worse than the disease.

I also think that this might be a problem in your case:

 * In this calculation we assume that each tape will cost us about 3 blocks
 * worth of buffer space (which is an underestimate for very large data
 * volumes, but it's probably close enough --- see logtape.c).

I wonder, what's the situation here like with the attached patch
applied on top of what you were testing? I think that we might be
better off with more merge steps when under enormous memory pressure
at the low end, in order to be able to store more tuples per tape (and
do more sorting using quicksort). I also think that under conditions
such as you describe, this code may play havoc with memory accounting:

/*
 * Decrease availMem to reflect the space needed for tape buffers; but
 * don't decrease it to the point that we have no room for tuples. (That
 * case is only likely to occur if sorting pass-by-value Datums; in all
 * other scenarios the memtuples[] array is unlikely to occupy more than
 * half of allowedMem.  In the pass-by-value case it's not important to
 * account for tuple space, so we don't care if LACKMEM becomes
 * inaccurate.)
 */
tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD;

if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
USEMEM(state, tapeSpace);

Remember, this is after the final grow_memtuples() call that uses your
intelligent resizing logic [2], so we'll USEMEM() in a way that
effectively makes some non-trivial proportion of our optimal memtuples
sizing unusable. Again, that could be really bad for cases like yours,
with very little memory relatively to data volume.

Thanks

[1] Commit df700e6b4
[2] Commit 8ae35e918
-- 
Peter Geoghegan
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cf1cdcb..bd8c0ee 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -201,6 +201,7 @@ typedef enum
  * tape during a preread cycle (see discussion at top of file).
  */
 #define MINORDER		6		/* minimum merge order */
+#define MAXORDER		250		/* maximum merge order */
 #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
 #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)

@@ -2066,8 +2067,15 @@ tuplesort_merge_order(int64 allowedMem)
 	mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) /
 		(MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD);

-	/* Even in minimum memory, use at least a MINORDER merge */
+	/*
+	 * Even in minimum memory, use at least a MINORDER merge.  At the same
+	 * time, cap the maximum merge order quasi-arbitrarily.  With more than
+	 * several hundred tapes, the overhead per-tape is likely to become a
+	 * concern, since that will only happen in the event of severe memory
+	 * pressure.
+	 */
 	mOrder = Max(mOrder, MINORDER);
+	mOrder = Min(mOrder, MAXORDER);

 	return mOrder;
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Peter Geoghegan

On Wed, Dec 23, 2015 at 1:03 PM, Jeff Janes  wrote:
> If we do think it is important to almost never cause regressions at
> the default maintenance_work_mem (I am agnostic on the importance of
> that), then I think we have more work to do here.  I just don't know
> what that work is.

My next revision will use grow_memtuples() in advance of the final
on-the-fly merge step, in a way that considers that we won't be losing
out to palloc() overhead (so it'll mostly be the memory patch that is
revised). This can make a large difference to the number of slots
(memtuples) available. I think I measured a 6% or 7% additional
improvement for a case with a fairly small number of runs to merge. It
might help significantly more when there are more runs to merge.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Peter Geoghegan

On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas  wrote:
> The point is, nobody can tell WHAT effects this is modeling.
> Increasing the tuple size makes the crossover go up.  Or down.

There are multiple, competing considerations.

> This analogy is faulty.  It's true that when we run across a qual
> whose selectivity we cannot estimate in any meaningful way, we have to
> just take a stab in the dark and hope for the best.  Similarly, if we
> have no information about what the crossover point for a given sort
> is, we'd have to take some arbitrary estimate, like 75%, and hope for
> the best.  But in this case, we DO have information.  We have an
> estimated row count and an estimated row width.  And those values are
> not being ignored, they are getting used.  The problem is that they
> are being used in an arbitrary way that is not justified by any chain
> of reasoning.

There is a chain of reasoning. It's not particularly satisfactory that
it's so fuzzy, certainly, but the competing considerations here are
substantive (and include erring towards not proceeding with
replacement selection/"quicksort with spillover" when the benefits are
low relative to the costs, which, to repeat myself, is itself novel).

I am more than open to suggestions on alternatives. As I said, I don't
particularly care for my current approach, either. But doing something
analogous to cost_sort() for our private "Do we quicksort with
spillover?"/useselection() model is going to be strictly worse than
what I have proposed.

Any cost model will have to be sensitive to different types of CPU
costs at the level that matters here -- such as the size of the heap,
and its cache efficiency. That's really important, but very
complicated, and variable enough that erring against using replacement
selection seems like a good idea with bigger heaps especially. That
(cache efficiency) is theoretically the only difference that matters
here (other than I/O, of course, but avoiding I/O is only the upside
of proceeding, and if we only weigh that then the cost model always
gives the same answer).

Perhaps you can suggest an alternative model that weighs these
factors. Most sorts are less than 1GB, and it seems worthwhile to
avoid I/O at the level where an internal sort is just out of reach.
Really big CREATE INDEX sorts are not really what I have in mind with
"quicksort with spillover".

This cost_sort() code seems pretty bogus to me, FWIW:

/* Assume 3/4ths of accesses are sequential, 1/4th are not */
startup_cost += npageaccesses *
(seq_page_cost * 0.75 + random_page_cost * 0.25);

I think we can afford to be a lot more optimistic about the proportion
of sequential accesses.

>> Merging is still sorting. The 10 tuples are not very cheap to merge
>> against the 1000 tuples, because you'll probably still end up reading
>> most of the 1000 tuples to do so.
>
> You're going to read all of the 1000 tuples no matter what, because
> you need to return them, but you will also need to make comparisons on
> most of them, unless the data distribution is favorable.   Assuming no
> special good luck, it'll take something close to X + Y - 1 comparisons
> to do the merge, so something around 1009 comparisons here.
> Maintaining the heap property is not free either, but it might be
> cheaper.

I'm pretty sure that it's cheaper. Some of the really good cases for
"quicksort with spillover" where only a little bit slower than a fully
internal sort when the work_mem threshold was just crossed.

>> Another factor is that the heap could be useful for other stuff in the
>> future. As Simon Riggs pointed out, for deduplicating values as
>> they're read in by tuplesort. (Okay, that's really the only other
>> thing, but it's a good one).
>
> Not sure how that would work?

Tuplesort would have license to discard tuples with matching existing
values, because the caller gave it permission to. This is something
that you can easily imagine occurring with ordered set aggregates, for
example. It would work in a way not unlike a top-N heapsort does
today. This would work well when it can substantially lower the use of
memory (initially heapification when the threshold is crossed would
probably measure the number of duplicates, and proceed only when it
looked like a promising strategy).

By the way, I think the heap currently does quite badly with many
duplicated values. That case seemed significantly slower than a
similar case with high cardinality tuples.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Jeff Janes

On Mon, Dec 14, 2015 at 7:22 PM, Peter Geoghegan  wrote:
> On Mon, Dec 14, 2015 at 6:58 PM, Greg Stark  wrote:
>> I ran sorts with various parameters on my small NAS server.
>
> ...
>
>> without the extra memory optimizations.
>
> Thanks for taking the time to benchmark the patch!
>
> While I think it's perfectly fair that you didn't apply the final
> on-the-fly merge "memory pool" patch, I also think that it's quite
> possible that the regression you see at the very low end would be
> significantly ameliorated or even eliminated by applying that patch,
> too. After all, Jeff Janes had a much harder time finding a
> regression, probably because he benchmarked all patches together.

The regression I found when building an index on a column of
400,000,000 md5(random()::text) with 64MB maintenance_work_mem was not
hard to find at all.  I still don't understand what is going on with
it, but it is reproducible.  Perhaps it is very unlikely and I just
got very lucky in finding it immediately after switching to that
data-type for my tests, but I wouldn't assume that on current
evidence.

If we do think it is important to almost never cause regressions at
the default maintenance_work_mem (I am agnostic on the importance of
that), then I think we have more work to do here.  I just don't know
what that work is.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Peter Geoghegan

On Wed, Dec 23, 2015 at 1:16 PM, Robert Haas  wrote:
> On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan  wrote:
>> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas  wrote:
>>> The point is, nobody can tell WHAT effects this is modeling.
>>> Increasing the tuple size makes the crossover go up.  Or down.
>>
>> There are multiple, competing considerations.
>
> Please explain what they are and how they lead you to believe that the
> cost factors you have chosen are good ones.

Alright.

I've gone on at length about how I'm blurring the distinction between
internal and external sorting, or about how modern hardware
characteristics allow that. There are several reasons for that. Now,
we all know that main memory sizes have increased dramatically since
the 1970s, and storage characteristics are very different, and that
CPU caching effects have become very important, and that everyone has
lots more data.

There is one thing that hasn't really become bigger in all that time,
though: the width of tuples. So, as I go into in comments within
useselection(), that's the main reason why avoiding I/O isn't all that
impressive, especially at the high end. It's just not that big of a
cost at the high end. Beyond that, as linear costs go, palloc() is a
much bigger concern to me at this point. I think we can waste a lot
less time by amortizing that more extensively (to say nothing of the
saving in memory). This is really obvious by just looking at
trace_sort output with my patch applied when dealing with many runs,
sorting millions of tuples: There just isn't that much time spent on
I/O at all, and it's well hidden by foreground processing that is CPU
bound. With smaller work_mem sizes and far fewer tuples, a case much
more common within sort nodes (as opposed to utility statements), this
is less true. Sorting 1,000 or 10,000 tuples is an entirely different
thing to sorting 1,000,000 tuples.

So, first of all, the main consideration is that saving I/O turns out
to not matter that much at the high end. That's why we get very
conservative past the fairly arbitrary MaxAllocSize memtuples
threshold (which has a linear relationship to the number of tuples --
*not* the amount of memory used or disk space that may be used).

A second consideration is how much I/O we can save -- one would hope
it would be a lot, certainly the majority, to make up for the downside
of using a cache inefficient technique. That is a different thing to
the number of memtuples. If you had really huge tuples, there would be
a really big saving in I/O, often without a corresponding degradation
in cache performance (since there still many not be that many
memtuples, which is more the problem for the heap than anything else).
This distinction is especially likely to matter for the CLUSTER case,
where wide heap tuples (including heap tuple headers, visibility info)
are kind of along for the ride, which is less true elsewhere,
particularly for the CREATE INDEX case.

The cache inefficiency of spilling incrementally from a heap isn't so
bad if we only end up sorting a small number of tuples that way. So as
the number of tuples that we end up actually sorting that way
increases, the cache inefficiency becomes worse, while at the same
time, we save less I/O. The former is a bigger problem than the
latter, by a wide margin, I believe.

This code is an attempt to credit cases with really wide tuples:

/*
 * Starting from a threshold of 90%, refund 7.5% per 32 byte
 * average-size-increment.
 */
increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
crossover = 0.90 - (increments * 0.075);

Most cases won't get too many "increments" of credit (although CLUSTER
sorts will probably get relatively many).

A third consideration is that we should be stingy about giving too
much credit to wider tuples because the cache inefficiency hurts more
as we achieve mere linear savings in I/O. So, most of the savings off
a 99.99% theoretical baseline threshold are fixed (you usually save
9.99% off that up-front).

A forth consideration is that the heap seems to do really badly past
1GB in general, due to cache characteristics. This is certainly not
something that I know how to model well.

I don't blame you for calling this voodoo, because to some extent it
is. But I remind you that the consequences of making the wrong
decision here are still better than the status quo today -- probably
far better, overall. I also remind you that voodoo code is something
you'll find in well regarded code bases at times. Have you ever
written networking code? Packet switching is based on some handwavy
observations about the real world. Practical implementations often
contain voodoo magic numbers. So, to answer your earlier question:
Yes, maybe it wouldn't be so bad, all things considered, to let
someone complain about this if they have a real-world problem with it.
The complexity of what we're talking about makes me

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Robert Haas

On Wed, Dec 23, 2015 at 3:31 PM, Peter Geoghegan  wrote:
> On Wed, Dec 23, 2015 at 9:37 AM, Robert Haas  wrote:
>> The point is, nobody can tell WHAT effects this is modeling.
>> Increasing the tuple size makes the crossover go up.  Or down.
>
> There are multiple, competing considerations.

Please explain what they are and how they lead you to believe that the
cost factors you have chosen are good ones.

My point here is: even if I were to concede that your cost model
yields perfect answers in every case, the patch needs to give at least
some hint as to why.  Right now, it really doesn't.

>>> Another factor is that the heap could be useful for other stuff in the
>>> future. As Simon Riggs pointed out, for deduplicating values as
>>> they're read in by tuplesort. (Okay, that's really the only other
>>> thing, but it's a good one).
>>
>> Not sure how that would work?
>
> Tuplesort would have license to discard tuples with matching existing
> values, because the caller gave it permission to. This is something
> that you can easily imagine occurring with ordered set aggregates, for
> example. It would work in a way not unlike a top-N heapsort does
> today. This would work well when it can substantially lower the use of
> memory (initially heapification when the threshold is crossed would
> probably measure the number of duplicates, and proceed only when it
> looked like a promising strategy).

It's not clear to me how having a heap helps with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-23 Thread Peter Geoghegan

On Wed, Dec 23, 2015 at 7:48 PM, Peter Geoghegan  wrote:
> I wonder, what's the situation here like with the attached patch
> applied on top of what you were testing? I think that we might be
> better off with more merge steps when under enormous memory pressure
> at the low end, in order to be able to store more tuples per tape (and
> do more sorting using quicksort).

Actually, now that I look into it, I think your 64MB work_mem setting
would have 234 tapes in total, so my patch won't do anything for your
case. Maybe change MAXORDER to 100 within the patch, to see where that
leaves things? I want to see if there is any improvement.

234 tapes means that approximately 5.7MB of memory would go to just
using tapes (for accounting purposes, which is mostly my concern
here). However, for a case like this, where you're well short of being
able to do everything in one pass, there is no benefit to having more
than about 6 tapes (I guess that's probably still true these days).
That 5.7MB of tape space for accounting purposes (and also in reality)
may not only increase the amount of random I/O required, and not only
throw off the memtuples estimate within grow_memtuples() (its balance
against everything else), but also decrease the cache efficiency in
the final on-the-fly merge (the efficiency in accessing tuples).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Robert Haas

On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan  wrote:
>> This looks like voodoo to me.  I assume you tested it and maybe it
>> gives correct answers, but it's got to be some kind of world record
>> for number of arbitrary constants per SLOC, and there's no real
>> justification for any of it.  The comments say, essentially, well, we
>> do this because it works.  But suppose I try it on some new piece of
>> hardware and it doesn't work well.  What do I do?  Email the author
>> and ask him to tweak the arbitrary constants?
>
> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
> something, though.
>
> MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
> And that part (the MaxAllocSize part of my patch) only defines a point
> after which we require a really favorable case for replacement
> selection/quicksort with spillover to proceed. It's a safety valve. We
> try to err on the side of not using replacement selection.

Sure, there are arbitrary numbers all over the code, driven by
empirical observations about what factors are important to model.  But
this is not that.  You don't have a thing called seq_page_cost and a
thing called cpu_tuple_cost and then say, well, empirically the ratio
is about 100:1, so let's make the former 1 and the latter 0.01.  You
just have some numbers, and it's not clear what, if anything, they
actually represent.  In the space of 7 lines of code, you introduce 9
nameless constants:

The crossover point is clamped to a minimum of 40% [constant #1] and a
maximum of 85% [constant #2] when the size of the SortTuple array is
no more than MaxAllocSize.  Between those bounds, the crossover point
is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
[constant #5] of estimated average tuple size.  On the other hand,
when the estimated average tuple size exceeds MaxAllocSize, the
crossover point is either 90% [constant #6] or 95% [constant #7]
depending on whether the average tuple size is greater than 32 bytes
[constant #8].  But if the row count hit is less than 50% [constant
#9] of the rows we've already seen, then we ignore it and do not use
selection.

You make no attempt to justify why any of these numbers are correct,
or what underlying physical reality they represent.  The comment which
describes the manner in which crossover point is computed for
SortTuple volumes under 1GB says "Starting from a threshold of 90%,
refund 7.5% per 32 byte average-size-increment."  That is a precise
restatement of what the code does, but it doesn't attempt to explain
why it's a good idea.  Perhaps the reader should infer that the
crossover point drops as the tuples get bigger, except that in the
over-1GB case, a larger tuple size causes the crossover point to go
*up* while in the under-1GB case, a larger tuple size causes the
crossover point to go *down*.  Concretely, if we're sorting 44,739,242
224-byte tuples, the estimated crossover point is 40%.  If we're
sorting 44,739,243 244-byte tuples, the estimated crossover point is
95%.  That's an extremely sharp discontinuity, and it seems very
unlikely that any real system behaves that way.

I'm prepared to concede that constant #9 - ignoring the input row
estimate if we've already seen twice that many rows - probably doesn't
need a whole lot of justification here, and what justification it does
need is provided by the fact that (we think) replacement selection
only wins when there are going to be less than 2 quicksorted runs.
But the other 8 constants here have to have reasons why they exist,
what they represent, and why they have the values they do, and that
explanation needs to be something that can be understood by people
besides you.  The overall cost model needs some explanation of the
theory of operation, too.

In my opinion, reasoning in terms of a crossover point is a strange
way of approaching the problem.  What would be more typical at least
in our code, and I suspect in general, is do a cost estimate of using
selection and a cost estimate of not using selection and compare them.
Replacement selection has a CPU cost and an I/O cost, each of which is
estimable based on the tuple count, chosen comparator, and expected
I/O volume.  Quicksort has those same costs, in different amounts.  If
those respective costs are accurately estimated, then you can pick the
strategy with the lower cost and expect to win.

>> I wonder if there's a way to accomplish what you're trying to do here
>> that avoids the need to have a cost model at all.  As I understand it,
>> and please correct me wherever I go off the rails, the situation is:
>>
>> 1. If we're sorting a large amount of data, such that we can't fit it
>> all in memory, we will need to produce a number of sorted runs and
>> then merge those runs.  If we generate each run using a heap with
>> replacement selection, rather than quicksort, we will produce runs
>> that are, on the

Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Peter Geoghegan

On Tue, Dec 22, 2015 at 2:57 PM, Robert Haas  wrote:
> On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan  wrote:
>> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
>> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
>> something, though.
>>

> Sure, there are arbitrary numbers all over the code, driven by
> empirical observations about what factors are important to model.  But
> this is not that.  You don't have a thing called seq_page_cost and a
> thing called cpu_tuple_cost and then say, well, empirically the ratio
> is about 100:1, so let's make the former 1 and the latter 0.01.  You
> just have some numbers, and it's not clear what, if anything, they
> actually represent.

What I find difficult to accept about what you say here is that at
*this* level, something like cost_sort() has little to recommend it.
It costs a sort of a text attribute at the same level as the cost of
sorting the same tuples using an int4 attribute (based on the default
cpu_operator_cost for C functions -- without any attempt to
differentiate text and int4).

Prior to 9.5, sorting text took about 5 - 10 times longer that this
similar int4 sort. That's a pretty big difference, and yet I recall no
complaints. The cost of a comparison in a sort can hardly be
considered in isolation, anyway -- cache efficiency is at least as
important.

Of course, the point is that the goal of a cost model is not to
simulate reality as closely as possible -- it's to produce a good
outcome for performance purposes under realistic assumptions.
Realistic assumptions include that you can't hope to account for
certain differences in cost. Avoiding a terrible outcome is very
important, but the worst case for useselection() is no worse than
today's behavior (or a lost opportunity to do better than today's
behavior).

Recently, the paper that was posted to the list about the Postgres
optimizer stated formally what I know I had a good intuitive sense of
for a long time: that better selectivity estimates are much more
important than better cost models in practice. The "empirical
observations" driving something like DEFAULT_EQ_SEL are very weak --
but what are you gonna do?

> The crossover point is clamped to a minimum of 40% [constant #1] and a
> maximum of 85% [constant #2] when the size of the SortTuple array is
> no more than MaxAllocSize.  Between those bounds, the crossover point
> is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment
> [constant #5] of estimated average tuple size.  On the other hand,
> when the estimated average tuple size exceeds MaxAllocSize, the
> crossover point is either 90% [constant #6] or 95% [constant #7]
> depending on whether the average tuple size is greater than 32 bytes
> [constant #8].  But if the row count hit is less than 50% [constant
> #9] of the rows we've already seen, then we ignore it and do not use
> selection.
>
> You make no attempt to justify why any of these numbers are correct,
> or what underlying physical reality they represent.

Just like selfuncs.h for the most part, then.

> The comment which
> describes the manner in which crossover point is computed for
> SortTuple volumes under 1GB says "Starting from a threshold of 90%,
> refund 7.5% per 32 byte average-size-increment."  That is a precise
> restatement of what the code does, but it doesn't attempt to explain
> why it's a good idea.  Perhaps the reader should infer that the
> crossover point drops as the tuples get bigger, except that in the
> over-1GB case, a larger tuple size causes the crossover point to go
> *up* while in the under-1GB case, a larger tuple size causes the
> crossover point to go *down*.  Concretely, if we're sorting 44,739,242
> 224-byte tuples, the estimated crossover point is 40%.  If we're
> sorting 44,739,243 244-byte tuples, the estimated crossover point is
> 95%.  That's an extremely sharp discontinuity, and it seems very
> unlikely that any real system behaves that way.

Again, the goal of the cost model is not to model reality as such.
This cost model is conservative about using replacement selection. It
makes sense when you consider that there tends to be a lot fewer
external sorts on a realistic workload -- if we can cut that number in
half, which seems quite possible, that's pretty good, especially from
a DBA's practical perspective. I want to buffer DBAs against suddenly
incurring more I/O, but not at the risk of having a far longer sort
for the first run. Or with minimal exposure to that risk. The cost
model weighs the cost of the hint being wrong to some degree (which is
indeed novel). I think it makes sense in light of the cost and
benefits in this case, although I will add that I'm not entirely
comfortable with it. I just don't imagine that there is a solution
that I will be fully comfortable with. There may be one that
superficially looks correct, but I see little point in that.

> I'm prepared to concede that constant #9 -

Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Peter Geoghegan

On Tue, Dec 22, 2015 at 9:10 AM, Robert Haas  wrote:
> So I was looking at the 0001 patch

Thanks. I'm going to produce a revision of 0002 shortly, so perhaps
hold off on that one. The big change there will be to call
grow_memtuples() to allow us to increase the number of slots without
palloc() overhead spuriously being weighed (since the memory for the
final on-the-fly merge phase doesn't have palloc() overhead). Also,
will incorporate what Jeff and you wanted around terminology.

> This looks like voodoo to me.  I assume you tested it and maybe it
> gives correct answers, but it's got to be some kind of world record
> for number of arbitrary constants per SLOC, and there's no real
> justification for any of it.  The comments say, essentially, well, we
> do this because it works.  But suppose I try it on some new piece of
> hardware and it doesn't work well.  What do I do?  Email the author
> and ask him to tweak the arbitrary constants?

That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and
DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do
something, though.

MaxAllocHugeSize is used fairly arbitrarily in pg_stat_statements.c.
And that part (the MaxAllocSize part of my patch) only defines a point
after which we require a really favorable case for replacement
selection/quicksort with spillover to proceed. It's a safety valve. We
try to err on the side of not using replacement selection.

> I wonder if there's a way to accomplish what you're trying to do here
> that avoids the need to have a cost model at all.  As I understand it,
> and please correct me wherever I go off the rails, the situation is:
>
> 1. If we're sorting a large amount of data, such that we can't fit it
> all in memory, we will need to produce a number of sorted runs and
> then merge those runs.  If we generate each run using a heap with
> replacement selection, rather than quicksort, we will produce runs
> that are, on the average, about twice as long, which means that we
> will have fewer runs to merge at the end.
>
> 2. Replacement selection is slower than quicksort on a per-tuple
> basis.  Furthermore, merging more runs isn't necessarily any slower
> than merging fewer runs.  Therefore, building runs via replacement
> selection tends to lose even though it tends to reduce the number of
> runs to merge.  Even when having a larger number of runs results in an
> increase in the number merge passes, we save so much time building the
> runs that we often (maybe not always) still come out ahead.

I'm with you so far. I'll only add: doing multiple passes ought to be
very rare anyway.

> 3. However, when replacement selection would result in a single run,
> and quicksort results in multiple runs, using quicksort loses.  This
> is especially true when we the amount of data we have is between one
> and two times work_mem.  If we fit everything into one run, we do not
> need to write any data to tape, but if we overflow by even a single
> tuple, we have to write a lot of data to tape.

No, this is where you lose me. I think that it's basically not true
that replacement selection can ever be faster than quicksort, even in
the cases where the conventional wisdom would have you believe so
(e.g. what you say here). Unless you have very little memory relative
to data size, or something along those lines. The conventional wisdom
obviously needs some revision, but it was perfectly correct in the
1970s and 1980s.

However, where replacement selection can still help is avoiding I/O
*entirely*. If we can avoid spilling 95% of tuples in the first place,
and quicksort the remaining (heapified) tuples that were not spilled,
and merge an in-memory run with an on-tape run, then we can win big.
Quicksort is not amenable to incremental spilling at all. I call this
"quicksort with spillover" (it is a secondary optimization that the
patch adds). This shows up in EXPLAIN ANALYZE, and avoids a stark
discontinuity in the cost function of sorts. That could really help
with admission control, and simplifying the optimizer, making merge
joins less scary. So with the patch, "quicksort with spillover" and
"replacement selection" are almost synonymous, except that we
acknowledge the historic importance of replacement selection to some
degree. The patch completely discards the conventional use of
replacement selection -- it just preserves its priority queue (heap)
implementation where incrementalism is thought to be particularly
useful (avoiding I/O entirely).

But this comparison has nothing to do with comparing the master branch
with my patch, since the master branch never attempts to avoid I/O
having committed to an external sort. It uses replacement selection in
a way that is consistent with the conventional wisdom, wisdom which
has now been shown to be obsolete.

BTW, I think that abandoning incrementalism (replacement selection)
will have future benefits for memory management. I bet we can get away
with one big palloc()

Re: [HACKERS] Using quicksort for every external sort run

2015-12-22 Thread Robert Haas

On Sun, Dec 6, 2015 at 7:25 PM, Peter Geoghegan  wrote:
> On Tue, Nov 24, 2015 at 4:33 PM, Peter Geoghegan  wrote:
>> So, the bottom line is: This patch seems very good, is unlikely to
>> have any notable downside (no case has been shown to be regressed),
>> but has yet to receive code review. I am working on a new version with
>> the first two commits consolidated, and better comments, but that will
>> have the same code, unless I find bugs or am dissatisfied. It mostly
>> needs thorough code review, and to a lesser extent some more
>> performance testing.
>
> I'm currently spending a lot of time working on parallel CREATE INDEX.
> I should not delay posting a new version of my patch series any
> further, though. I hope to polish up parallel CREATE INDEX to be able
> to show people something in a couple of weeks.
>
> This version features consolidated commits, the removal of the
> multipass_warning parameter, and improved comments and commit
> messages. It has almost entirely unchanged functionality.
>
> The only functional changes are:
>
> * The function useselection() is taught to distrust an obviously bogus
> caller reltuples hint (when it's already less than half of what we
> know to be the minimum number of tuples that the sort must sort,
> immediately after LACKMEM() first becomes true -- this is probably a
> generic estimate).
>
> * Prefetching only occurs when writing tuples. Explicit prefetching
> appears to hurt in some cases, as David Rowley has shown over on the
> dedicated thread. But it might still be that writing tuples is a case
> that is simple enough to benefit consistently, due to the relatively
> uniform processing that memory latency can hide behind for that case
> (before, the same prefetching instructions were used for CREATE INDEX
> and for aggregates, for example).
>
> Maybe we should consider trying to get patch 0002 (the memory
> pool/merge patch) committed first, something Greg Stark suggested
> privately. That might actually be an easier way of integrating this
> work, since it changes nothing about the algorithm we use for merging
> (it only improves memory locality), and so is really an independent
> piece of work (albeit one that makes a huge overall difference due to
> the other patches increasing the time spent merging in absolute terms,
> and especially as a proportion of the total).

So I was looking at the 0001 patch and came across this code:

+/*
+ * Crossover point is somewhere between where memtuples is between 40%
+ * and all-but-one of total tuples to sort.  This weighs approximate
+ * savings in I/O, against generic heap sorting cost.
+ */
+avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+/*
+ * Starting from a threshold of 90%, refund 7.5% per 32 byte
+ * average-size-increment.
+ */
+increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+crossover = 0.90 - (increments * 0.075);
+
+/*
+ * Clamp, making either outcome possible regardless of average size.
+ *
+ * 40% is about the minimum point at which "quicksort with spillover"
+ * can still occur without a logical/physical correlation.
+ */
+crossover = Max(0.40, Min(crossover, 0.85));
+
+/*
+ * The point where the overhead of maintaining the heap invariant is
+ * likely to dominate over any saving in I/O is somewhat arbitrarily
+ * assumed to be the point where memtuples' size exceeds MaxAllocSize
+ * (note that overall memory consumption may be far greater).  Past
+ * this point, only the most compelling cases use replacement selection
+ * for their first run.
+ *
+ * This is not about cache characteristics so much as the O(n log n)
+ * cost of sorting larger runs dominating over the O(n) cost of
+ * writing/reading tuples.
+ */
+if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+crossover = avgTupleSize > 32 ? 0.90 : 0.95;

This looks like voodoo to me.  I assume you tested it and maybe it
gives correct answers, but it's got to be some kind of world record
for number of arbitrary constants per SLOC, and there's no real
justification for any of it.  The comments say, essentially, well, we
do this because it works.  But suppose I try it on some new piece of
hardware and it doesn't work well.  What do I do?  Email the author
and ask him to tweak the arbitrary constants?

The dependency on MaxAllocSize seems utterly bizarre to me.  If we
decide to modify our TOAST infrastructure so that we support datums up
to 2GB in size, or alternatively datums of up to only 512MB in size,
do you expect that to change the behavior of tuplesort.c?  I bet not,
but that's a major reason why MaxAllocSize is defined the way it is.

I wonder if there's a way to accomplish what you're trying to do here
that avoids the need to have a cost model at all.  As I understand it,
and please correct me wherever I go off the rails, the situation is:

Re: [HACKERS] Using quicksort for every external sort run

2015-12-18 Thread Robert Haas

On Sat, Dec 12, 2015 at 5:28 PM, Peter Geoghegan  wrote:
> On Sat, Dec 12, 2015 at 12:10 AM, Jeff Janes  wrote:
>> I have a question about the terminology used in this patch.  What is a
>> tuple proper?  What is it in contradistinction to?  I would think that
>> a tuple which is located in its own palloc'ed space is the "proper"
>> one, leaving a tuple allocated in the bulk memory pool to be
>> called...something else.  I don't know what the
>> non-judgmental-sounding antonym of postpositive "proper" is.
>
> "Tuple proper" is a term that appears 5 times in tuplesort.c today. As
> it says at the top of that file:
>
> /*
>  * The objects we actually sort are SortTuple structs.  These contain
>  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
>  * which is a separate palloc chunk --- we assume it is just one chunk and
>  * can be freed by a simple pfree().  SortTuples also contain the tuple's
>  * first key column in Datum/nullflag format, and an index integer.

I see only three.  In each case, "the tuple proper" could be replaced
by "the tuple itself" or "the actual tuple" without changing the
meaning, at least according to my understanding of the meaning.  If
that's causing confusion, perhaps we should just change the existing
wording.

Anyway, I agree with Jeff that this terminology shouldn't creep into
function and structure member names.

I don't really like the term "memory pool" either.  We're growing a
bunch of little special-purpose allocators all over the code base
because of palloc's somewhat dubious performance and memory usage
characteristics, but if any of those are referred to as memory pools
it has thus far escaped my notice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-18 Thread Peter Geoghegan

On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas  wrote:
> I don't really like the term "memory pool" either.  We're growing a
> bunch of little special-purpose allocators all over the code base
> because of palloc's somewhat dubious performance and memory usage
> characteristics, but if any of those are referred to as memory pools
> it has thus far escaped my notice.

BTW, I'm not necessarily determined to make the new special-purpose
allocator work exactly as proposed. It seemed useful to prioritize
simplicity, and currently so there is one big "huge palloc()" with
which we blow our memory budget, and that's it. However, I could
probably be more clever about "freeing ranges" initially preserved for
a now-exhausted tape. That kind of thing.

With the on-the-fly merge memory patch, I'm improving locality of
access (for each "tuple proper"/"tuple itself"). If I also happen to
improve the situation around palloc() fragmentation at the same time,
then so much the better, but that's clearly secondary.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-18 Thread Peter Geoghegan

On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas  wrote:
> Anyway, I agree with Jeff that this terminology shouldn't creep into
> function and structure member names.

Okay.

> I don't really like the term "memory pool" either.  We're growing a
> bunch of little special-purpose allocators all over the code base
> because of palloc's somewhat dubious performance and memory usage
> characteristics, but if any of those are referred to as memory pools
> it has thus far escaped my notice.

It's a widely accepted term: https://en.wikipedia.org/wiki/Memory_pool

But, sure, I'm not attached to it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-18 Thread Peter Geoghegan

On Fri, Dec 18, 2015 at 12:50 PM, Robert Haas  wrote:
>> BTW, I'm not necessarily determined to make the new special-purpose
>> allocator work exactly as proposed. It seemed useful to prioritize
>> simplicity, and currently so there is one big "huge palloc()" with
>> which we blow our memory budget, and that's it. However, I could
>> probably be more clever about "freeing ranges" initially preserved for
>> a now-exhausted tape. That kind of thing.
>
> What about the case where we think that there will be a lot of data
> and have a lot of work_mem available, but then the user sends us 4
> rows because of some mis-estimation?

The memory patch only changes the final on-the-fly merge phase. There
is no estimate involved there.

I continue to use whatever "slots" (memtuples) are available for the
final on-the-fly merge. However, I allocate all remaining memory that
I have budget for at once. My remarks about the efficient use of that
memory was only really about each tape's use of their part of that
over time.

Again, to emphasize, this is only for the final on-the-fly merge phase.

>> With the on-the-fly merge memory patch, I'm improving locality of
>> access (for each "tuple proper"/"tuple itself"). If I also happen to
>> improve the situation around palloc() fragmentation at the same time,
>> then so much the better, but that's clearly secondary.
>
> I don't really understand this comment.

I just mean that I wrote the memory patch with memory locality in
mind, not palloc() fragmentation or other overhead.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-18 Thread Robert Haas

On Fri, Dec 18, 2015 at 2:57 PM, Peter Geoghegan  wrote:
> On Fri, Dec 18, 2015 at 10:12 AM, Robert Haas  wrote:
>> I don't really like the term "memory pool" either.  We're growing a
>> bunch of little special-purpose allocators all over the code base
>> because of palloc's somewhat dubious performance and memory usage
>> characteristics, but if any of those are referred to as memory pools
>> it has thus far escaped my notice.
>
> BTW, I'm not necessarily determined to make the new special-purpose
> allocator work exactly as proposed. It seemed useful to prioritize
> simplicity, and currently so there is one big "huge palloc()" with
> which we blow our memory budget, and that's it. However, I could
> probably be more clever about "freeing ranges" initially preserved for
> a now-exhausted tape. That kind of thing.

What about the case where we think that there will be a lot of data
and have a lot of work_mem available, but then the user sends us 4
rows because of some mis-estimation?

> With the on-the-fly merge memory patch, I'm improving locality of
> access (for each "tuple proper"/"tuple itself"). If I also happen to
> improve the situation around palloc() fragmentation at the same time,
> then so much the better, but that's clearly secondary.

I don't really understand this comment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-14 Thread Greg Stark

I ran sorts with various parameters on my small NAS server. This is a
fairly slow CPU and limited memory machine with lots of disk so I thought
it would actually make a good test case for smaller servers. The following
is the speedup (for values < 100%) or slowdown (values > 100%) for the
first patch only, the "quicksort all runs" without the extra memory
optimizations.

At first glance it's a clear pattern that the extra runs does cause a
slowdown whenever it causes more polyphase merges which is bad news. But on
further inspection look just how low work_mem had to be to have a
significant effect. Only the 4MB and 8MB work_mem cases were significantly
impacted and only when sorting over a GB of data (which was 2.7 - 7GB with
the tuple overhead). The savings when work_mem was 64MB or 128MB was
substantial.

Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB64%70%93%110%133%137%
3457MB1336 MB64%67%90%92%137%120%2765MB1069 MB68%66%84%95%111%137%1383MB535
MB66%70%72%92%99%96%691MB267 MB65%69%70%86%99%98%346MB134 MB65%69%73%67%90%
87%

The raw numbers in seconds. I've only run the test once so far on the NAS
and there are some other things running on it so I really should rerun it a
few more times at least.

HEAD:
Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB1068.07963.231041.94
1246.541654.352472.793457MB1336
MB529.34482.3450.77555.76657.341027.572765MB1069
MB404.02394.36348.31414.48507.38657.171383MB535 MB196.48194.26173.48182.57
214.42258.05691MB267 MB95.9393.7987.7380.493.67105.24346MB134 MB45.644.24
42.3944.2246.1749.85

With the quicksort patch:
Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB683.6679.0969.41366.2
2193.63379.33457MB1336 MB339.1325.1404.9509.8902.21229.12765MB1069 MB275.3
260.1292.4395.4561.9898.71383MB535 MB129.9136.4124.6167.5213.2247.1691MB267
MB62.364.361.469.292.3103.2346MB134 MB29.830.730.929.441.643.4

Re: [HACKERS] Using quicksort for every external sort run

2015-12-14 Thread Peter Geoghegan

On Mon, Dec 14, 2015 at 6:58 PM, Greg Stark  wrote:
> I ran sorts with various parameters on my small NAS server.

...

> without the extra memory optimizations.

Thanks for taking the time to benchmark the patch!

While I think it's perfectly fair that you didn't apply the final
on-the-fly merge "memory pool" patch, I also think that it's quite
possible that the regression you see at the very low end would be
significantly ameliorated or even eliminated by applying that patch,
too. After all, Jeff Janes had a much harder time finding a
regression, probably because he benchmarked all patches together.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-14 Thread Peter Geoghegan

On Mon, Dec 14, 2015 at 7:22 PM, Peter Geoghegan  wrote:
> Thanks for taking the time to benchmark the patch!

Also, I should point out that you didn't add work_mem past the point
where the master branch will get slower, while the patch continues to
get faster. This seems to happen fairly reliably, certainly if
work_mem is sized at about 1GB, and often at lower settings. With the
POWER7 "Hydra" server, external sorting for a CREATE INDEX operation
could put any possible maintenance_work_mem setting to good use -- my
test case got faster with a 15GB maintenance_work_mem setting (the
server has 64GB of ram). I think I tried 25GB as a
maintenance_work_mem setting next, but started to get OOM errors at
that point.

Again, I point this out because I want to account for why my numbers
were better (for the benefit of other people -- I think you get this,
and are being fair).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Jeff Janes

On Sun, Dec 13, 2015 at 3:40 PM, Peter Geoghegan  wrote:
> On Sat, Dec 12, 2015 at 4:41 PM, Jeff Janes  wrote:
>
 Also, if I am reading this correctly, when we refill a pool from a
 logical tape we still transform each tuple as it is read from the disk
 format to the memory format.  This inflates the size quite a bit, at
 least for single-datum tuples.  If we instead just read the disk
 format directly into the pool, and converted them into the in-memory
 format when each tuple came due for the merge heap, would that destroy
 the locality of reference you are seeking to gain?
>>>
>>> Are you talking about alignment?
>>
>> Maybe alignment, but also the size of the SortTuple struct itself,
>> which is not present on tape but is present in memory if I understand
>> correctly.
>>
>> When reading 128kb (32 blocks) worth of in-memory pool, it seems like
>> it only gets to read 16 to 18 blocks of tape to fill them up, in the
>> case of building an index on single column 32-byte random md5 digests.
>> I don't exactly know where all of that space goes, I'm taking an
>> experimentalist approach.
>
> I'm confused.
>
> readtup_datum(), just like every other READTUP() variant, has the new
> function tupproperalloc() as a drop-in replacement for the master
> branch palloc() + USEMEM() calls.

Right, I'm not comparing what your patch does to what the existing
code does.  I'm comparing it to what it could be doing.  Only call
READTUP when you need to go from the pool to the heap, not when you
need to go from tape to the pool.  If you store the data in the pool
the same way they are stored on tape, then we no longer need memtuples
at all.  There is already a "mergetupcur" per tape pointing to the
first tuple of the tape, and since they are now stored contiguously
that is all that is needed, once you are done with one tuple the
pointer is left pointing at the next one.

The reason for memtuples is to handle random access.  Since we are no
longer doing random access, we no longer need it.

We could free memtuples, re-allocate just enough to form the binary
heap for the N-way merge, and use all the rest of that space (which
could be a significant fraction of work_mem) as part of the new pool.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Peter Geoghegan

On Sat, Dec 12, 2015 at 4:41 PM, Jeff Janes  wrote:
> Those usages make sense to me, as they are locally self-contained and
> it is clear what they are in contradistinction to.   But your usage is
> spread throughout (even in function names, not just comments) and
> seems to contradict the current usage as yours are not separately
> palloced, as the "proper" ones described here are.  I think that
> "proper" only works when the same comment also defines the
> alternative, rather than as some file-global description.  Maybe
> "pooltuple" rather than "tupleproper"

I don't think of it that way. The "tuple proper" is the thing that the
client passes to their tuplesort -- the thing they are actually
interested in having sorted. Like an IndexTuple for CREATE INDEX
callers, for example. SortTuple is just an internal implementation
detail. (That appears all over the file tuplesort.c, just as my new
references to "tuple proper" do. But neither appear elsewhere.)

>>> Also, if I am reading this correctly, when we refill a pool from a
>>> logical tape we still transform each tuple as it is read from the disk
>>> format to the memory format.  This inflates the size quite a bit, at
>>> least for single-datum tuples.  If we instead just read the disk
>>> format directly into the pool, and converted them into the in-memory
>>> format when each tuple came due for the merge heap, would that destroy
>>> the locality of reference you are seeking to gain?
>>
>> Are you talking about alignment?
>
> Maybe alignment, but also the size of the SortTuple struct itself,
> which is not present on tape but is present in memory if I understand
> correctly.
>
> When reading 128kb (32 blocks) worth of in-memory pool, it seems like
> it only gets to read 16 to 18 blocks of tape to fill them up, in the
> case of building an index on single column 32-byte random md5 digests.
> I don't exactly know where all of that space goes, I'm taking an
> experimentalist approach.

I'm confused.

readtup_datum(), just like every other READTUP() variant, has the new
function tupproperalloc() as a drop-in replacement for the master
branch palloc() + USEMEM() calls.

It is true that tupproperalloc() (and a couple of other places
relating to preloading) know *a little* about the usage pattern --
tupproperalloc() accepts a "tape number" argument to know what
partition within the large pool/buffer to use for each logical
allocation. However, from the point of view of correctness,
tupproperalloc() should function as a drop-in replacement for palloc()
+ USEMEM() calls in the context of the various READTUP() routines.

I have done nothing special with any particular READTUP() routine,
including readtup_datum() (all READTUP() routines have received the
same treatment). Nothing else was changed in those routines, including
how tuples are stored on tape. The datum case does kind of store the
SortTuples on tape today in one very limited sense, which is that the
length is stored fairly naively (that's already available from the
IndexTuple in the case of writetup_index(), for example, but length
must be stored explicitly for the datum case).

My guess is you're confusion comes from the fact that the memtuples
array (the array of SortTuple) is also factored in to memory
accounting, but that grows at geometric intervals, whereas the
existing READTUP() retail palloc() calls (and their USEMEM() memory
accounting calls) occur in drips and drabs. It's probably the case
that the sizing of the memtuples array and the amount of memory we use
for that rather than retail palloc()/"tuple proper" memory is a kind
of arbitrary (why should the needs be the same when SortTuples are
merge step "slots"?), but I don't think that's the biggest problem in
this general area at all.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Jeff Janes

On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark  wrote:
> On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes  wrote:
>>
>>
>> Then in the next (final) merge, it is has to read in this huge
>> fragmented tape run emulation, generating a lot of random IO to read
>> it.
>
> This seems fairly plausible. Logtape.c is basically implementing a
> small filesystem and doesn't really make any attempt to avoid
> fragmentation. The reason it does this is so that we can reuse blocks
> and avoid needing to store 2x disk space for the temporary space. I
> wonder if we're no longer concerned about keeping the number of tapes
> down if it makes sense to give up on this goal too and just write out
> separate files for each tape letting the filesystem avoid
> fragmentation. I suspect it would also be better for filesystems like
> ZFS and SSDs where rewriting blocks can be expensive.

During my testing I actually ran into space problems, where the index
I was building and the temp files used to do the sort for it could not
coexist, and I was wondering if there wasn't a way to free up some of
those temp files as the index was growing. So I don't think we want to
throw caution to the wind here.

(Also, I think it does make *some* attempt to reduce fragmentation,
but it could probably do more.)

>
>
>> With the patched code, the average length of reads on files in
>> pgsql_tmp between lseeks or changing to a different file descriptor is
>> 8, while in the unpatched code it is 14.
>
> I don't think Peter did anything to the scheduling of the merges so I
> don't see how this would be different. It might just have hit a
> preexisting case by changing the number and size of tapes.

Correct.  (There was a small additional increase with the memory pool,
but it was small enough that I am not worried about it).  But, this
changing number and size of tapes was exactly what Robert was worried
about, so I don't want to just dismiss it without further
investigation.

>
> I also don't think the tapes really ought to be so unbalanced. I've
> noticed some odd things myself -- like what does a 1-way merge mean
> here?

I noticed some of those (although in my case they were always the
first merges which were one-way) and I just attributed it to the fact
that the algorithm doesn't know how many runs there will be up front,
and so can't optimally distribute them among the tapes.

But it does occur to me that we are taking the tape analogy rather too
far in that case.  We could say that we have only 223 tape *drives*,
but that each run is a separate tape which can be remounted amongst
the drives in any combination, as long as only 223 are active at one
time.

I started looking into this at one time, before I got sidetracked on
the fact that the memory usage pattern would often leave a few bytes
less than half of work_mem completely unused.  Once that memory usage
got fixed, I never returned to the original examination.  And it would
be a shame to sink more time into it now, when we are trying to avoid
these polyphase merges altogether.

So, is a sometimes-regression at 64MB really a blocker to substantial
improvement most of the time at 64MB, and even more so at more
realistic modern settings for large index building?

> Fwiw attached are two patches for perusal. One is a trivial patch to
> add the size of the tape to trace_sort output. I guess I'll just apply
> that without discussion.

+1 there.  Having this in place would make evaluating the other things
be easier.

Cheers,

Jeff

On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark  wrote:
> On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes  wrote:
>>
>>
>> Then in the next (final) merge, it is has to read in this huge
>> fragmented tape run emulation, generating a lot of random IO to read
>> it.
>
> This seems fairly plausible. Logtape.c is basically implementing a
> small filesystem and doesn't really make any attempt to avoid
> fragmentation. The reason it does this is so that we can reuse blocks
> and avoid needing to store 2x disk space for the temporary space. I
> wonder if we're no longer concerned about keeping the number of tapes
> down if it makes sense to give up on this goal too and just write out
> separate files for each tape letting the filesystem avoid
> fragmentation. I suspect it would also be better for filesystems like
> ZFS and SSDs where rewriting blocks can be expensive.
>
>
>> With the patched code, the average length of reads on files in
>> pgsql_tmp between lseeks or changing to a different file descriptor is
>> 8, while in the unpatched code it is 14.
>
> I don't think Peter did anything to the scheduling of the merges so I
> don't see how this would be different. It might just have hit a
> preexisting case by changing the number and size of tapes.
>
> I also don't think the tapes really ought to be so unbalanced. I've
> noticed some odd things myself -- like what does a 1-way merge mean
> here?
>
> LOG:  finished

Re: [HACKERS] Using quicksort for every external sort run

2015-12-13 Thread Peter Geoghegan

On Sun, Dec 13, 2015 at 7:31 PM, Jeff Janes  wrote:
> The reason for memtuples is to handle random access.  Since we are no
> longer doing random access, we no longer need it.
>
> We could free memtuples, re-allocate just enough to form the binary
> heap for the N-way merge, and use all the rest of that space (which
> could be a significant fraction of work_mem) as part of the new pool.

Oh, you're talking about having the final on-the-fly merge use a
tuplestore-style array of pointers to "tuple proper" memory (this was
how tuplesort.c worked in all cases about 15 years ago, actually).

I thought about that. It's not obvious how we'd do without
SortTuple.tupindex during the merge phase, since it sometimes
represents an offset into memtuples (the SortTuple array). See "free
list" management within mergeonerun().

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-12 Thread Jeff Janes

On Sun, Dec 6, 2015 at 4:25 PM, Peter Geoghegan  wrote:

> Maybe we should consider trying to get patch 0002 (the memory
> pool/merge patch) committed first, something Greg Stark suggested
> privately. That might actually be an easier way of integrating this
> work, since it changes nothing about the algorithm we use for merging
> (it only improves memory locality), and so is really an independent
> piece of work (albeit one that makes a huge overall difference due to
> the other patches increasing the time spent merging in absolute terms,
> and especially as a proportion of the total).

I have a question about the terminology used in this patch.  What is a
tuple proper?  What is it in contradistinction to?  I would think that
a tuple which is located in its own palloc'ed space is the "proper"
one, leaving a tuple allocated in the bulk memory pool to be
called...something else.  I don't know what the
non-judgmental-sounding antonym of postpositive "proper" is.

Also, if I am reading this correctly, when we refill a pool from a
logical tape we still transform each tuple as it is read from the disk
format to the memory format.  This inflates the size quite a bit, at
least for single-datum tuples.  If we instead just read the disk
format directly into the pool, and converted them into the in-memory
format when each tuple came due for the merge heap, would that destroy
the locality of reference you are seeking to gain?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-12 Thread Ants Aasma

On Sat, Dec 12, 2015 at 12:41 AM, Greg Stark  wrote:
> On Wed, Dec 9, 2015 at 2:44 AM, Peter Geoghegan  wrote:
>> On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark  wrote:
>>
>> I guess you mean insertion sort. What's the theoretical justification
>> for the change?
>
> Well my thinking was that hard coding a series of comparisons would be
> faster than a loop doing a O(n^2) algorithm even for small constants.
> And sort networks are perfect for hard coded sorts because they do the
> same comparisons regardless of the results of previous comparisons so
> there are no branches. And even better the comparisons are as much as
> possible independent of each other -- sort networks are typically
> measured by the depth which assumes any comparisons between disjoint
> pairs can be done in parallel. Even if it's implemented in serial the
> processor is probably parallelizing some of the work.
>
> So I implemented a quick benchmark outside Postgres based on sorting
> actual SortTuples with datum1 defined to be random 64-bit integers (no
> nulls). Indeed the sort networks perform faster on average despite
> doing more comparisons. That makes me think the cpu is indeed doing
> some of the work in parallel.

The open coded version you shared bloats the code by 37kB, I'm not
sure it is pulling it's weight, especially given relatively heavy
comparators. A quick index creation test on int4's profiled with perf
shows about 3% of CPU being spent in the code being replaced. Any
improvement on that is going to be too small to easily quantify.

As the open coding doesn't help with eliminating control flow
dependencies, so my idea is to encode the sort network comparison
order in an array and use that to drive a simple loop. The code size
would be pretty similar to insertion sort and the loop overhead should
mostly be hidden by the CPU OoO machinery. Probably won't help much,
but would be interesting and simple enough to try out. Can you share
you code for the benchmark so I can try it out?

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-12 Thread Greg Stark

On Sat, Dec 12, 2015 at 7:42 PM, Ants Aasma  wrote:
> As the open coding doesn't help with eliminating control flow
> dependencies, so my idea is to encode the sort network comparison
> order in an array and use that to drive a simple loop. The code size
> would be pretty similar to insertion sort and the loop overhead should
> mostly be hidden by the CPU OoO machinery. Probably won't help much,
> but would be interesting and simple enough to try out. Can you share
> you code for the benchmark so I can try it out?

I can. But the further results showing the number of comparisons is
higher than for insertion sort have dampened my enthusiasm for the
change. I'm assuming that even if it's faster for a simple integer or
sort it'll be much slower for anything that requires calling out to
the datatype comparator. I also hadn't actually measured what
percentage of the sort was being spent in the insertion sort. I had
guessed it would be higher.

The test is attached. qsort_tuple.c is copied from tuplesort (with the
ifdef for NOPRESORT added, but you could skip that if you want.).
Compile with something like:

gcc -DNOPRESORT -O3 -DCOUNTS -Wall -Wno-unused-function simd-sort-test.c


-- 
greg
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifdef COUNTS
unsigned nswaps;
unsigned ncmps;
#define countswap (nswaps++)
#define countcmp  (ncmps++)
#else
#define countswap ((void)0)
#define countcmp  ((void)0)
#endif

typedef void *Tuplesortstate;
typedef long long int Datum;
typedef int  bool;

typedef struct
{
	void	   *tuple;			/* the tuple proper */
	Datum		datum1;			/* value of first key column */
	bool		isnull1;		/* is first key column NULL? */
	int			tupindex;		/* see notes above */
} SortTuple;

typedef SortTuple elem_t;

typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
	Tuplesortstate *state);

typedef struct SortSupportData *SortSupport;

static int comparator(Datum a, Datum b, SortSupport ssup) {
	if (b > a)
		return -1;
	else if (b < a)
		return 1;
	else return 0;
};

struct SortSupportData {
	bool ssup_nulls_first;
	bool ssup_reverse;
	int (*comparator)(Datum,Datum,SortSupport);
};

struct SortSupportData ssup = {0, 0, comparator};

static inline int
ApplySortComparator(Datum datum1, bool isNull1,
	Datum datum2, bool isNull2,
	SortSupport ssup)
{
	int			compare;

	countcmp;

	if (isNull1)
	{
		if (isNull2)
			compare = 0;		/* NULL "=" NULL */
		else if (ssup->ssup_nulls_first)
			compare = -1;		/* NULL "<" NOT_NULL */
		else
			compare = 1;		/* NULL ">" NOT_NULL */
	}
	else if (isNull2)
	{
		if (ssup->ssup_nulls_first)
			compare = 1;		/* NOT_NULL ">" NULL */
		else
			compare = -1;		/* NOT_NULL "<" NULL */
	}
	else
	{
		compare = (*ssup->comparator) (datum1, datum2, ssup);
		if (ssup->ssup_reverse)
			compare = -compare;
	}

	return compare;
}

#define CHECK_FOR_INTERRUPTS() do {} while (0)

#define Min(a,b) ((a)<(b)?(a):(b))
#define Max(a,b) ((a)<(b)?(b):(a))

#include "qsort_tuple.c"

#define mycmp(a,b)	\
	(ApplySortComparator(list[a].datum1, list[a].isnull1, \
		 list[b].datum1, list[b].isnull1, \
		 ))

#define myswap(a,b)	\
	do {			\
		elem_t _tmp;\
		_tmp = list[a];\
		list[a] = list[b];			\
		list[b] = _tmp;\
		countswap;	\
	} while (0)

#define myswapif(a,b)\
	do {			\
		if (mycmp(a,b) > 0)			\
			myswap(a,b);\
	} while (0)


static int insertion_ok(unsigned N) {
	return N<1000;
}

static void insertion(elem_t *list, unsigned N) {
	if (N > 1000) {
		printf("insertion sort not feasible for large N\n");
		exit(1);
	}

	for (unsigned pm = 1; pm < N; pm++)
		for (unsigned pl = pm; pl && mycmp(pl-1, pl) > 0; pl--)
			myswap(pl, pl-1);
}

static int sort_networks_ok(unsigned N) {
	return N<=8;
}

static void sort_networks(elem_t *list, unsigned N) {
	if (N > 8) {
		printf("sort_networks only implemented for N in 0..8\n");
		exit(1);
	}
	switch(N) {
	case 0: 
	case 1:
		break;
		
	case 2:
		myswapif(0,1);
		break;
		
	case 3:
		myswapif(0,1); myswapif(1,2);
		myswapif(0,1);
		break;
		
	case 4:
		myswapif(0,1); myswapif(2,3);
		myswapif(1,3); myswapif(0,2);
		myswapif(1,2);
		break;
		
	case 5:
		myswapif(1,2); myswapif(3,4);
		myswapif(1,3); myswapif(0,2);
		myswapif(2,4); myswapif(0,3);
		myswapif(0,1); myswapif(2,3);
		myswapif(1,2);
		break;
		
	case 6:
		myswapif(0,1); myswapif(2,3); myswapif(4,5);
		myswapif(0,2); myswapif(3,5); myswapif(1,4);
		myswapif(0,1); myswapif(2,3); myswapif(4,5);
		myswapif(1,2); myswapif(3,4);
		myswapif(2,3);
		break;
		
	case 7:
		myswapif(1,2); myswapif(3,4); myswapif(5,6);
		myswapif(0,2); myswapif(4,6); myswapif(3,5);
		myswapif(2,6); myswapif(1,5); myswapif(0,4);
		myswapif(2,5); myswapif(0,3);
		myswapif(2,4); myswapif(1,3);
		myswapif(0,1); myswapif(2,3); myswapif(4,5);
		break;
		
	case 8:
		myswapif(0, 1); myswapif(2, 3); myswapif(4, 5); myswapif(6, 7);
		myswapif(0, 3); myswapif(1, 2); myswapif(4, 7); myswapif(5, 6);

Re: [HACKERS] Using quicksort for every external sort run

2015-12-12 Thread Jeff Janes

On Sat, Dec 12, 2015 at 2:28 PM, Peter Geoghegan  wrote:
> On Sat, Dec 12, 2015 at 12:10 AM, Jeff Janes  wrote:
>> I have a question about the terminology used in this patch.  What is a
>> tuple proper?  What is it in contradistinction to?  I would think that
>> a tuple which is located in its own palloc'ed space is the "proper"
>> one, leaving a tuple allocated in the bulk memory pool to be
>> called...something else.  I don't know what the
>> non-judgmental-sounding antonym of postpositive "proper" is.
>
> "Tuple proper" is a term that appears 5 times in tuplesort.c today. As
> it says at the top of that file:
>
> /*
>  * The objects we actually sort are SortTuple structs.  These contain
>  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
>  * which is a separate palloc chunk --- we assume it is just one chunk and
>  * can be freed by a simple pfree().  SortTuples also contain the tuple's
>  * first key column in Datum/nullflag format, and an index integer.

Those usages make sense to me, as they are locally self-contained and
it is clear what they are in contradistinction to.   But your usage is
spread throughout (even in function names, not just comments) and
seems to contradict the current usage as yours are not separately
palloced, as the "proper" ones described here are.  I think that
"proper" only works when the same comment also defines the
alternative, rather than as some file-global description.  Maybe
"pooltuple" rather than "tupleproper"

>
>> Also, if I am reading this correctly, when we refill a pool from a
>> logical tape we still transform each tuple as it is read from the disk
>> format to the memory format.  This inflates the size quite a bit, at
>> least for single-datum tuples.  If we instead just read the disk
>> format directly into the pool, and converted them into the in-memory
>> format when each tuple came due for the merge heap, would that destroy
>> the locality of reference you are seeking to gain?
>
> Are you talking about alignment?

Maybe alignment, but also the size of the SortTuple struct itself,
which is not present on tape but is present in memory if I understand
correctly.

When reading 128kb (32 blocks) worth of in-memory pool, it seems like
it only gets to read 16 to 18 blocks of tape to fill them up, in the
case of building an index on single column 32-byte random md5 digests.
I don't exactly know where all of that space goes, I'm taking an
experimentalist approach.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 181 matches

Mail list logo