Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/31/2015 02:01 AM, Peter Geoghegan wrote: What prevents the tuple at the top of the in-memory heap at the point of tuplesort_performsort() (say, one of the ones added to the heap as our glut of memory was*partially* consumed) being less than the last/greatest tuple on tape? If the answer is nothing, a merge step is clearly required. Oh, ok, I was confused on how the heap works. I think I explained this badly, by referencing a secondary reason why we must do a merge. I will now do a better job of explaining why a merge of in-memory and on disk tuples is necessary, for the benefit of other people (I think you get it). The main reason why a merge step is required is that the memtuples array will contain some tuples that were classified as belonging to a second run. Therefore, those tuples may well be lower than the highest on-tape tuples in terms of sort order (in fact, they may be lower than any on-tape tuple). I cannot simply return all tape tuples followed by all in-memory tuples to the caller, and so I must merge, and so only !randomAccess callers may get a quicksort with spillover. I can only get away with **avoiding dumping all tuples** and just merging instead because I reject this determination that a second *traditional/tape* run is needed. I am therefore free of any obligation to merge this would-be traditional second run separately. Another way of explaining it is that I do an all-in-memory merge of some part of the first run, and all the second run (by quicksorting). I then merge this with the original chunk of the first run that is sorted on tape (that was sorted by incremental spilling from the heap). The next version of the patch (the patch may be split in two -- quicksort with spillover, and merge sort optimization) will make sure that any comparisons that go into maintaining the heap invariant are not wasted on the second run, since it will always be quicksorted. We only need to compare the second run tuples pre-quicksort in order to determine that they belong to that run and not the current (first) run. Does that make sense? Have I explained that well? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Yeah, I don't think there's a big performance difference between the two approaches. I'm not wedded to either approach. Whichever approach we use, my main point was that it would be better to handle this in the logical tape abstraction. In my approach, you would have those in-memory tails attached to the last two tapes. In Peter's approach, you would have one tape that's completely in memory, backed by the array. In either case, the tapes would look like normal tapes to most of tuplesort.c. There would be no extra TSS state, it would be TSS_SORTEDONTAPE or TSS_FINALMERGE as usual. TBH, it's not clear to me why making the logical tape abstraction manage some fraction of memtuples has any advantage at all. The only case that I can imagine where this could be useful is where a logical tape does asynchronous I/O, and has the benefit of some portion of memtuples (probably half) as scratch space (most I/O probably occurs while tuplesort quicksorts the other half of memtuples). But that has nothing to do with building a better abstraction. The more expansive patch that I'm currently working on, that covers every external sorting case -- the patch to *also* quicksort runs past the first run, regardless of whether or not we can get away with a quicksort with spillover -- is going very well. I haven't had a solid block of time to work on it this week due to other commitments, but it isn't very complicated. Performance benefits are considerable even without saving any I/O. I can be as much as twice as fast for merge sort sorts in some cases. So not quite as nice as quicksort with spillover, but still a significant improvement considering writing everything out is inevitable for the cases helped. As I said before, the upcoming patch has tuplesort give up on memtuples *ever* being a heap after the first run, whatever happens. I just quicksort and dump in batches past the first run. Since I give up on replacement selection sort only after the first run, I still have most of the upside of replacement selection, but little of the big downside of heap maintenance. This will result in smaller runs on average past the first run. I can give you at least 3 very strong arguments for why this is totally worth it in every case, but I'll wait till I'm asked for them. :-) One useful saving made in this upcoming multi-tape-run patch is that it never treats any non-current run as part of the heap beyond its run number, even when currentRun is the first (run 0). So no comparisons occur beyond the first run to maintain the heap invariant *even when the first run is current* -- tuples are simply appended that belong to the second run (we only do an extra comparison to determine that that's the run they belong in). So the second run (run 1) is not trusted to be heapified by dumptuples(), and is quicksorted (either by quicksort with spillover, along with much of the first run, or on its own, when there are multiple conventional on-tape runs; it doesn't matter which way it is quicksorted). From there on, every run is quicksorted when memtuples fills, and written out entirely in memtuple sized batches. The logical tape abstraction is currently too low-level for that. It's just a tape of bytes, and tuplesort.c determines where a tuple begins and ends. That needs to be changed so that the logical tape abstraction works tuple-at-a-time instead. For example, instead of LogicalTapeRead(N) which reads N bytes, you would have LogicalTapeReadTuple(), which reads next tuple, and returns its length and the tuple itself. But that would be quite sensible anyway. Why would it be sensible? I honestly wonder why you want to do things that way. What is the advantage of not having what I call the in-memory subrun managed by a logical tape? It's already nothing like any other type of run in several ways. Aside from being all in-memory, it is often much larger. It's special in that it kind of rejects the preliminary determination that some tuples within memtuples need to be in a second, traditional, on-tape run (because we can just quicksort everything and merge with the existing single on-tape run). Also, we now need tuplesort_gettuple_common() to ask READTUP() what to tell its caller about freeing memory that is allocated in within tuplesort.c directly. The memtuples array is already treated as an array, a heap, the head of each tape that is merged, and maybe one other thing that I forgot about offhand. The field SortTuple.tupindex has a total of 3 context-dependent meanings. Playing these kind of games with the memtuples array is very much something that happens already. More than anything else, I think that the new TSS_MEMTAPEMERGE state is justified as a special case because quicksort with spillover is legitimately a special case. Users will want to know how close they were to an internal sort when looking at EXPLAIN ANALYZE and so on. When cost_sort() is
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Yeah, something like that. To paraphrase, if I'm now understanding it correctly, Peter's idea is: When all the tuples have been fed to tuplesort, and it's time to perform the sort, quicksort all the tuples currently in the heap, ignoring the run numbers, and turn the resulting array into another tape. That tape is special: it's actually stored completely in memory. It is merged with the real tapes when tuples are returned from the tuplesort, just like regular tapes in TSS_FINALMERGE. Yeah. I imagine that we'll want to put memory prefetch hints for the new case, since I've independently shown that that works well for the in-memory case, which this can be very close to. My next patch will also include quicksorting of runs after we give up on heapification (after there is more than one run and it is established that we cannot use my quicksort with spillover optimization, so there are two or more real runs on tape). Once there is clearly not going to be one huge run (which can happen due to everything largely being in order, even when work_mem is small), and once incrementally spilling does not end in time to do a quicksort with spillover, then the replacement selection thing isn't too valuable. Especially with large memory sizes but memory bandwidth + latency as a bottleneck, which is the norm these days. This seems simpler than my earlier idea of reusing half the memtuples array only, and resorting the entire array each time, to have something that consistently approximates replacement selection but with quicksorting + batching, which I discussed before. I have this working, and it takes about a good chunk of the runtime off a sort that merges 3 runs on one reasonable case tested where work_mem was 300MB. It went from about 56.6 seconds with master to 35.8 seconds with this new approach when tested just now (this approach saves no writing of tuples, so it's not as effective as the original quicksort with spillover patch can be, but covers a fundamentally different case). I just need to clean up the patch, and see if I missed any further optimizations, but this feels like the way forward multi-run wise. I think it's worth giving up on replacement selection style runs after the first run is produced, because that's where the benefits are, if anywhere. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 08/03/2015 11:36 PM, Robert Haas wrote: On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan p...@heroku.com wrote: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I don't think that's what Heikki is talking about. He can correct me if I'm wrong, but what I think he's saying is that we should try to exploit the fact that we've already determined which in-memory tuples can be part of the current run and which in-memory tuples must become part of the next run. Suppose half the tuples in memory can become part of the current run and the other half must become part of the next run. If we throw all of those tuples into a single bucket and quicksort it, we're losing the benefit of the comparisons we did to figure out which tuples go in which runs. Instead, we could quicksort the current-run tuples and, separately, quick-sort the next-run tuples. Ignore the current-run tuples completely until the tape is empty, and then merge them with any next-run tuples that remain. Yeah, something like that. To paraphrase, if I'm now understanding it correctly, Peter's idea is: When all the tuples have been fed to tuplesort, and it's time to perform the sort, quicksort all the tuples currently in the heap, ignoring the run numbers, and turn the resulting array into another tape. That tape is special: it's actually stored completely in memory. It is merged with the real tapes when tuples are returned from the tuplesort, just like regular tapes in TSS_FINALMERGE. And my idea is: When all the tuples have been fed to tuplesort, and it's time to perform the sort, take all the tuples in the heap belonging to currentRun, quicksort them, and make them part of the current tape. They're not just pushed to the tape as usual, however, but attached as in-memory tail of the current tape. The logical tape abstraction will return them after all the tuples already in the tape, as if they were pushed to the tape as usual. Then take all the remaining tuples in the heap (if any), belonging to next tape, and do the same for them. They become an in-memory tail of the next tape. I'm not sure if there's any reason to believe that would be faster than your approach. In general, sorting is O(n lg n) so sorting two arrays that are each half as large figures to be slightly faster than sorting one big array. But the difference may not amount to much. Yeah, I don't think there's a big performance difference between the two approaches. I'm not wedded to either approach. Whichever approach we use, my main point was that it would be better to handle this in the logical tape abstraction. In my approach, you would have those in-memory tails attached to the last two tapes. In Peter's approach, you would have one tape that's completely in memory, backed by the array. In either case, the tapes would look like normal tapes to most of tuplesort.c. There would be no extra TSS state, it would be TSS_SORTEDONTAPE or TSS_FINALMERGE as usual. The logical tape abstraction is currently too low-level for that. It's just a tape of bytes, and tuplesort.c determines where a tuple begins and ends. That needs to be changed so that the logical tape abstraction works tuple-at-a-time instead. For example, instead of LogicalTapeRead(N) which reads N bytes, you would have LogicalTapeReadTuple(), which reads next tuple, and returns its length and the tuple itself. But that would be quite sensible anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Mon, Aug 3, 2015 at 1:36 PM, Robert Haas robertmh...@gmail.com wrote: I don't think that's what Heikki is talking about. He can correct me if I'm wrong, but what I think he's saying is that we should try to exploit the fact that we've already determined which in-memory tuples can be part of the current run and which in-memory tuples must become part of the next run. Suppose half the tuples in memory can become part of the current run and the other half must become part of the next run. If we throw all of those tuples into a single bucket and quicksort it, we're losing the benefit of the comparisons we did to figure out which tuples go in which runs. Instead, we could quicksort the current-run tuples and, separately, quick-sort the next-run tuples. Ignore the current-run tuples completely until the tape is empty, and then merge them with any next-run tuples that remain. Oh. Well, the benefit of the comparisons we did to figure out which tuples go in which runs is that we can determine the applicability of this optimization. Also, by keeping run 1 (if any) usually in memory, and run 0 partially on disk we avoid having to worry about run 1 as a thing that spoils the optimization (in the current single run optimization, dumping all tuples can make us acknowledge run 1 (i.e. currentRun++), preventing single run optimization, which we handily avoid in the patch). Finally, it saves us a bunch of real COMPARETUP() comparisons as HEAPCOMPARE() is called as tuples are inserted into the still-heapified memtuples array. I'm not sure if there's any reason to believe that would be faster than your approach. In general, sorting is O(n lg n) so sorting two arrays that are each half as large figures to be slightly faster than sorting one big array. But the difference may not amount to much. IMV, the smart way of avoiding wasting the comparisons we did to figure out which tuples go in which runs is to rig HEAPCOMPARE() to only do a COMPARETUP() for the currentRun, and make sure that we don't mess up and forget that if we don't end up quicksorting. The second run that is in memory can only consist of whatever tuples were added after heapification that were less than any of those previously observed tuples (a big majority, usually). So like you, I can't see any of these techniques helping much, even my smart technique. Maybe I should look at a case involving text or something to be sure. Thinking about it some more, I don't think it would be easy to maintain a clear separation between run 0 and run 1 in the memtuples array in terms of a cutoff point. It's still a heap at that stage, of course. You'd have to rig each tuple comparator so that COMPARETUP() cared about tupindex before comparing datum1 just for this, which seems rather unappealing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Fri, Jul 31, 2015 at 12:59 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Oh, ok, I was confused on how the heap works. You could still abstract this as in-memory tails of the tapes, but it's more complicated than I thought at first: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. Since you're talking about the case where we must drain all tuples within tuplesort_performsort(), I think you're talking about a distinct idea here (since surely you're not suggesting giving up on my idea of avoiding dumping most tuples, which is a key strength of the patch). That's fine, but I just want to be clear on that. Importantly, my optimization does not care about the fact that the array may have two runs in it, because it quicksorts the array and forgets about it being a heap. It only cares whether or not more than one run made it out to tape, which makes it more widely applicable than it would otherwise be. Also, the fact that much I/O can be avoided is clearly something that can only happen when work_mem is at least ~50% of a work_mem setting that would have resulted in an (entirely) internal sort. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I'm confused that you're talking about doing something clever within tuplesort_performsort(). In the case you're targeting, won't the vast majority of tuple dumping (through calls to dumptuples()) occur within puttuple_common()? I think that my optimization should probably retain it's own state.status even if we do this (i.e. TSS_MEMTAPEMERGE should stay). Hmm, I can see another possible optimization here, in the way the heap is managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with tupindex as the leading key, it would be more cache efficient to keep one heap for the currentRun, and an unsorted array of tuples belonging to currentRun + 1. When the heap becomes empty, and currentRun is incemented, quicksort the unsorted array to become the new heap. That's a completely separate idea from your patch, although if you did it that way, you wouldn't need the extra step to divide the large array into two, as you'd maintain that division all the time. This sounds to me like a refinement of your first idea (the idea that I just wrote about). I think the biggest problem with tuplesort after this patch of mine is committed is that it is still too attached to the idea of incrementally spilling and sifting. It makes sense to some degree where it makes my patch possible...if we hang on to the idea of incrementally spilling tuples on to tape in sorted order for a while, then maybe we can hang on for long enough to quicksort most tuples, *and* to avoid actually dumping most tuples (incremental spills make the latter possible). But when work_mem is only, say, 10% of the setting required for a fully internal sort, then the incremental spilling and sifting starts to look dubious, at least to me, because the TSS_MEMTAPEMERGE optimization added by my patch could not possibly apply, and dumping and merging many times is inevitable. What I think you're getting at here is that we still have a heap, but we don't use the heap to distinguish between tuples within a run. In other words, HEAPCOMPARE() often/always only cares about run number. We quicksort after a deferred period of time, probably just before dumping everything. Perhaps I've misunderstood, but I don't see much point in quicksorting a run before being sure that you're sorting as opposed to heapifying at that point (you're not clear on what we've decided on once we quicksort). I think it could make sense to make HEAPCOMPARE() not care about tuples within a run that is not currentRun, though. I think that anything that gives up on replacement selection's ability to generate large runs, particularly for already sorted inputs will be too hard a sell (not that I think that's what you proposed). That's way, way less of an advantage than it was in the past (back when external sorts took place using actual magnetic tape, it was a huge), but the fact remains that it is an advantage. And so, I've been prototyping an approach where we don't heapify once it is established that this TSS_MEMTAPEMERGE optimization of mine cannot possibly apply. We quicksort in batches rather
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan p...@heroku.com wrote: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. You're talking about a new thing here, that happens when it is necessary to dump everything and do a conventional merging of on-tape runs. IOW, we cannot fit a significant fraction of overall tuples in memory, and we need much of the memtuples array for the next run (usually this ends as a TSS_FINALMERGE). That being the case (...right?), I don't think that's what Heikki is talking about. He can correct me if I'm wrong, but what I think he's saying is that we should try to exploit the fact that we've already determined which in-memory tuples can be part of the current run and which in-memory tuples must become part of the next run. Suppose half the tuples in memory can become part of the current run and the other half must become part of the next run. If we throw all of those tuples into a single bucket and quicksort it, we're losing the benefit of the comparisons we did to figure out which tuples go in which runs. Instead, we could quicksort the current-run tuples and, separately, quick-sort the next-run tuples. Ignore the current-run tuples completely until the tape is empty, and then merge them with any next-run tuples that remain. I'm not sure if there's any reason to believe that would be faster than your approach. In general, sorting is O(n lg n) so sorting two arrays that are each half as large figures to be slightly faster than sorting one big array. But the difference may not amount to much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 4:01 PM, Peter Geoghegan p...@heroku.com wrote: On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. It's more complicated than it appears, I think. Tuples may be variable sized. WRITETUP() performs a pfree(), and gives us back a variable amount of availMem. What if we dumped a single, massive, outlier tuple out when a caller passes it and it goes to the root of the heap? We'd dump that massive tuple in one go (this would be an incremental dumptuples() call, which we still do in the patch), making things !LACKMEM() again, but by an usually comfortable margin. We read in a few more regular tuples, but we're done consuming tuples before things ever get LACKMEM() again (no more dumping needed, at least with this patch applied). What prevents the tuple at the top of the in-memory heap at the point of tuplesort_performsort() (say, one of the ones added to the heap as our glut of memory was *partially* consumed) being less than the last/greatest tuple on tape? If the answer is nothing, a merge step is clearly required. It's simple to prove this with the attached rough patch, intended to be applied on top of Postgres with my patch. It hacks tuplesort_gettuple_common() to always return tape tuples first, before returning memtuples only when tape tuples have been totally exhausted. If you run my cursory regression test suite with this, you'll see serious regressions. I also attach a regression test diff file from my development system, to save you the trouble of trying this yourself. Note how the count(distinct(s)) numbers get closer to being correct (lower) as work_mem increases make tuplesort approach an internal sort. It's possible that we can get away with something cheaper than a merge step, but my impression right now is that it isn't terribly expensive. OTOH, if we can make this work with the randomAccess case by being more clever about merging, that could be worthwhile. -- Peter Geoghegan diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 5371f4f..ce79347 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -2022,7 +2022,6 @@ just_memtuples: */ Assert(state-current state-memtupcount); - if (COMPARETUP(state, stup, state-memtuples[state-current]) = 0) { /* * Tape tuple less than or equal to memtuple array current @@ -2031,18 +2030,6 @@ just_memtuples: state-cached = false; *should_free = true; } - else - { -/* - * Tape tuple greater than memtuple array's current tuple. - * - * Return current memtuple tuple, and cache tape tuple for - * next call, where it may be returned. - */ -state-cached = true; -*should_free = false; -*stup = state-memtuples[state-current++]; - } return true; case TSS_FINALMERGE: regression.diffs Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 07/31/2015 02:01 AM, Peter Geoghegan wrote: What prevents the tuple at the top of the in-memory heap at the point of tuplesort_performsort() (say, one of the ones added to the heap as our glut of memory was*partially* consumed) being less than the last/greatest tuple on tape? If the answer is nothing, a merge step is clearly required. Oh, ok, I was confused on how the heap works. You could still abstract this as in-memory tails of the tapes, but it's more complicated than I thought at first: When it's time to drain the heap, in performsort, divide the array into two arrays, based on the run number of each tuple, and then quicksort the arrays separately. The first array becomes the in-memory tail of the current tape, and the second array becomes the in-memory tail of the next tape. You wouldn't want to actually allocate two arrays and copy SortTuples around, but keep using the single large array, just logically divided into two. So the bookkeeping isn't trivial, but seems doable. Hmm, I can see another possible optimization here, in the way the heap is managed in TSS_BUILDRUNS state. Instead of keeping a single heap, with tupindex as the leading key, it would be more cache efficient to keep one heap for the currentRun, and an unsorted array of tuples belonging to currentRun + 1. When the heap becomes empty, and currentRun is incemented, quicksort the unsorted array to become the new heap. That's a completely separate idea from your patch, although if you did it that way, you wouldn't need the extra step to divide the large array into two, as you'd maintain that division all the time. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. It's more complicated than it appears, I think. Tuples may be variable sized. WRITETUP() performs a pfree(), and gives us back a variable amount of availMem. What if we dumped a single, massive, outlier tuple out when a caller passes it and it goes to the root of the heap? We'd dump that massive tuple in one go (this would be an incremental dumptuples() call, which we still do in the patch), making things !LACKMEM() again, but by an usually comfortable margin. We read in a few more regular tuples, but we're done consuming tuples before things ever get LACKMEM() again (no more dumping needed, at least with this patch applied). What prevents the tuple at the top of the in-memory heap at the point of tuplesort_performsort() (say, one of the ones added to the heap as our glut of memory was *partially* consumed) being less than the last/greatest tuple on tape? If the answer is nothing, a merge step is clearly required. This is not a problem when every single tuple is dumped, but that doesn't happen anymore. I probably should have shown more tests, that tested HeapTuple sorts (not just datum tuple sorts). I agree that things at least usually happen as you describe, FWIW. I think it'd make sense to structure the code differently, to match the way I described this optimization above. Instead of adding a new tuplesort state for this, abstract this in the logical tape code. Add a function to attach an in-memory tail to a tape, and have LogicalTapeRead() read from the tail after reading the on-disk tape. The rest of the code wouldn't need to care that sometimes part of the tape is actually in memory. I'll need to think about all of that. Certainly, quicksorting runs in a more general way seems like a very promising idea, and I know that this patch does not go far enough. But it also add this idea of not dumping out most tuples, which seems impossible to generalize further, so maybe it's a useful special case to start from for that reason (and also because it's where the pain is currently felt most often). +* Note that there might actually be 2 runs, but only the +* contents of one of them went to tape, and so we can +* safely pretend that there is only 1 run (since we're +* about to give up on the idea of the memtuples array being +* a heap). This means that if our sort happened to require +* random access, the similar single run optimization +* below (which sets TSS_SORTEDONTAPE) might not be used at +* all. This is because dumping all tuples out might have +* forced an otherwise equivalent randomAccess case to +* acknowledge a second run, which we can avoid. Is that really true? We don't start a second run until we have to, i.e. when it's time to dump the first tuple of the second run to tape. So I don't think the case you describe above, where you have two runs but only one of them has tuples on disk, can actually happen. I think we're talking about two slightly different things. I agree that I am avoiding starting a second run because I am avoiding dumping tuples, just as you say (I describe this as avoiding acknowledging a second run). But there could still be SortTuples that have a tupindex that is 0 (they could be 1, to be specific). It's pretty clear from looking at the TSS_BUILDRUNS case within puttuple_common() that this is true. So, if instead you define starting a tuple as adding a sortTuple with a tupindex that is 0, then yes, this comment is true. The important thing is that since we're not dumping every tuple, it doesn't matter whether or not a that TSS_BUILDRUNS case within puttuple_common() ever took the currentRun + 1 insertion path (which can easily happen), provided things aren't so skewed that it ends up on tape even without dumping all tuples (which seems much less likely). As I've said, this optimization will occur a lot more often then the existing one run optimization (assuming !randomAccess), as a nice side benefit of not dumping every tuple. Quicksort does not use HEAPCOMPARE(), so clearly having multiple runs in that subrun is a non-issue. Whether or not we say that a second run started, or that there was merely the unfulfilled intent to start a new, second run is just semantics. While I certainly care about semantics, my point is that we agree that this useful pretend there is only one run thing happens (I think). The existing one run optimization only really occurs when the range of values in the set of tuples is well characterized by the tuple values observed during initial heapification, which is bad. Or would be bad, if the existing
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 11:32 AM, Robert Haas robertmh...@gmail.com wrote: Very interesting. And great performance numbers. Thanks for taking the time to investigate this - really cool. Thanks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Thu, Jul 30, 2015 at 3:47 AM, Simon Riggs si...@2ndquadrant.com wrote: This is a good optimization for the common case where tuples are mostly already in order. We could increase the usefulness of this by making UPDATE pick blocks that are close to a tuple's original block, rather than putting them near the end of a relation. Not sure what you mean here. So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. It's about use of memory. If you have multiple runs on tape, then they will need to be merged and you need memory to do that efficiently. If there are tuples in the last batch still in memory then it can work, but it depends upon how full memory is from the last batch and how many batches there are. I agree that this optimization has a lot to do with exploiting the fact that you don't need to free the memtuples array for future runs because you've already received all tuples (or keep space free for previous runs). I think that we should still use quicksort on runs where this optimization doesn't work out, but I also still think that that's a different idea. Doing what I've proposed here when there are multiple runs seems less valuable, if only because you're not going to avoid that much writing. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On Wed, Jul 29, 2015 at 9:05 PM, Peter Geoghegan p...@heroku.com wrote: The behavior of external sorts that do not require any merge step due to only having one run (what EXPLAIN ANALYZE output shows as an external sort, and not a merge sort) seems like an area that can be significantly improved upon. As noted in code comments, this optimization did not appear in The Art of Computer Programming, Volume III. It's not an unreasonable idea, but it doesn't work well on modern machines due to using heapsort, which is known to use the cache ineffectively. It also often implies significant additional I/O for little or no benefit. I suspect that all the reports we've heard of smaller work_mem sizes improving sort performance are actually down to this one-run optimization *hurting* performance. Very interesting. And great performance numbers. Thanks for taking the time to investigate this - really cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 07/30/2015 04:05 AM, Peter Geoghegan wrote: Patch -- quicksort with spillover = With the attached patch, I propose to add an additional, better one run special case optimization. This new special case splits the single run into 2 subruns. One of the runs is comprised of whatever tuples were in memory when the caller finished passing tuples to tuplesort. To sort that, we use quicksort, which in general has various properties that make it much faster than heapsort -- it's a cache oblivious algorithm, which is important these days. The other subrun is whatever tuples were on-tape when tuplesort_performsort() was called. This will often be a minority of the total, but certainly never much more than half. This is already sorted when tuplesort_performsort() is reached. This spillover is already inserted at the front of the sorted-on-tape tuples, and so already has reasonably good cache characteristics. With the patch, we perform an on-the-fly merge that is somewhat similar to the existing (unaffected) merge sort TSS_FINALMERGE case, except that one of the runs is in memory, and is potentially much larger than the on-tape/disk run (but never much smaller), and is quicksorted. The existing merge sort case similarly is only used when the caller specified !randomAccess. Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. I think it'd make sense to structure the code differently, to match the way I described this optimization above. Instead of adding a new tuplesort state for this, abstract this in the logical tape code. Add a function to attach an in-memory tail to a tape, and have LogicalTapeRead() read from the tail after reading the on-disk tape. The rest of the code wouldn't need to care that sometimes part of the tape is actually in memory. It should be pretty easy to support randomAccess too. If you think of the in-memory heap as a tail of the last tape, you can easily move backwards from the in-memory heap back to the on-disk tape, too. +* Note that there might actually be 2 runs, but only the +* contents of one of them went to tape, and so we can +* safely pretend that there is only 1 run (since we're +* about to give up on the idea of the memtuples array being +* a heap). This means that if our sort happened to require +* random access, the similar single run optimization +* below (which sets TSS_SORTEDONTAPE) might not be used at +* all. This is because dumping all tuples out might have +* forced an otherwise equivalent randomAccess case to +* acknowledge a second run, which we can avoid. Is that really true? We don't start a second run until we have to, i.e. when it's time to dump the first tuple of the second run to tape. So I don't think the case you describe above, where you have two runs but only one of them has tuples on disk, can actually happen. Performance == Impressive! Predictability == Even more impressive! Future work = As an extra optimization, you could delay quicksorting the in-memory array until it's time to read the first tuple from it. If the caller reads only the top-N tuples from the sort for some reason (other than LIMIT, which we already optimize for), that could avoid a lot of work. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 30 July 2015 at 08:00, Heikki Linnakangas hlinn...@iki.fi wrote: Hmm. You don't really need to merge the in-memory array into the tape, as you know that all the tuples in the in-memory must come after the tuples already on the tape. You can just return all the tuples from the tape first, and then all the tuples from the array. Agreed This is a good optimization for the common case where tuples are mostly already in order. We could increase the usefulness of this by making UPDATE pick blocks that are close to a tuple's original block, rather than putting them near the end of a relation. So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. It's about use of memory. If you have multiple runs on tape, then they will need to be merged and you need memory to do that efficiently. If there are tuples in the last batch still in memory then it can work, but it depends upon how full memory is from the last batch and how many batches there are. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run external sort
On 07/30/2015 01:47 PM, Simon Riggs wrote: On 30 July 2015 at 08:00, Heikki Linnakangas hlinn...@iki.fi wrote: So here's a shorter/different explanation of this optimization: When it's time to perform the sort, instead of draining the in-memory heap one tuple at a time to the last tape, you sort the heap with quicksort, and pretend that the sorted heap belongs to the last tape, after all the other tuples in the tape. Some questions/thoughts on that: Isn't that optimization applicable even when you have multiple runs? Quicksorting the heap and keeping it as an array in memory is surely always faster than heapsorting and pushing it to the tape. It's about use of memory. If you have multiple runs on tape, then they will need to be merged and you need memory to do that efficiently. If there are tuples in the last batch still in memory then it can work, but it depends upon how full memory is from the last batch and how many batches there are. True, you need a heap to hold the next tuple from each tape in the merge step. To avoid exceeding work_mem, you'd need to push some tuples from the in-memory array to the tape to make room for that. In practice, though, the memory needed for the merge step's heap is tiny. Even if you merge 1000 tapes, you only need memory for 1000 tuples in the heap. But yeah, you'll need some logic to share/divide the in-memory array between the heap and the in-memory tail of the last tape. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers