Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-23 Thread Bruno Wolff III
On Sat, Apr 22, 2006 at 14:20:32 -0700,
  daveg <[EMAIL PROTECTED]> wrote:
> On Sat, Apr 22, 2006 at 01:49:25PM -0700, David Fetter wrote:
> > On Sat, Apr 22, 2006 at 01:14:42PM -0700, David Gould wrote:
> > 
> > > To avoid running out of swap and triggering the oom killer we have
> > > had to reduce work_mem below what we prefer.
> > 
> > Dunno about your work_mem, but you can make sure the OOM killer
> > doesn't kill you as follows .
> 
> Or I could run with overcommit turned off, but we like overcommit because
> things like vaccuum appear to allocate maint_work_mem when they start, so
> if that is set at say 100 Mb it will allocate 100 Mb even to vacuum a 2
> page table. Overcommit lets this sort of thing get by without createing
> a need for even more swap.

I would expect that you would still come out ahead commiting some disk
space to swap, that will probably never be used, that allows you to better
configure your memory usage.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread daveg
On Sat, Apr 22, 2006 at 01:49:25PM -0700, David Fetter wrote:
> On Sat, Apr 22, 2006 at 01:14:42PM -0700, David Gould wrote:
> 
> > To avoid running out of swap and triggering the oom killer we have
> > had to reduce work_mem below what we prefer.
> 
> Dunno about your work_mem, but you can make sure the OOM killer
> doesn't kill you as follows .

Or I could run with overcommit turned off, but we like overcommit because
things like vaccuum appear to allocate maint_work_mem when they start, so
if that is set at say 100 Mb it will allocate 100 Mb even to vacuum a 2
page table. Overcommit lets this sort of thing get by without createing
a need for even more swap.

-dg


-- 
David Gould  [EMAIL PROTECTED]
If simplicity worked, the world would be overrun with insects.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread David Fetter
On Sat, Apr 22, 2006 at 01:14:42PM -0700, David Gould wrote:

> To avoid running out of swap and triggering the oom killer we have
> had to reduce work_mem below what we prefer.

Dunno about your work_mem, but you can make sure the OOM killer
doesn't kill you as follows .

HTH :)

Cheers,
D
-- 
David Fetter <[EMAIL PROTECTED]> http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread daveg
On Sat, Apr 22, 2006 at 06:38:53PM +0100, Simon Riggs wrote:
> On Sat, 2006-04-22 at 13:17 -0400, Tom Lane wrote:
> > Simon Riggs <[EMAIL PROTECTED]> writes:
> > > I still do, for multi-user systems. Releasing unused memory from a large
> > > CREATE INDEX will allow that memory to be swapped out, even if the brk
> > > point can't be changed.
> > 
> > Say what?  It can get "swapped out" anyway, whether we free() it or not.
> 
> Of course it can, but if the memory is not actively used by the sort
> then it will be OK if that happens and fairly likely also. If we
> actively use the memory for the sort it would is less likely to be
> swapped out and a bad thing if it did.
> 
> > More to the point, though: I don't believe that the proposed patch is a
> > good idea --- it does not reduce the peak sortmem use, which I think is
> > the critical factor for a multiuser system, 
> 
> I agree peak memory use is the critical factor. There is only one
> performsort in progress at any one time, though there can be many final
> merges/retrievals in progress concurrently. If the majority of the
> memory used by performsoft is released afterwards then it can be made
> available for subsequent sorts/hashes etc without increasing further the
> peak mem use.
> 
> > and what it does do is
> > reduce the locality of access to the sort temp file during the merge
> > phases.  That will definitely have some impact; maybe small, but some;
> > and I don't see where the benefit comes in.
> 
> That I already accept.

I'd like to add a user perspective: we run dual Opteron servers with
16 Gb of memory and 16 Gb of swap. When we are busy we can have 20 to
thirty substantial queries running at one time. It is very common for us to
have several sorts and also hash joins running concurrently, some for a
minute or two, some for much longer.  To avoid running out of swap and
triggering the oom killer we have had to reduce work_mem below what we
prefer.

We could add more swap, but at some point this has diminishing returns.
The proposed patch seems as if it would be helpful in our situation.

-dg

-- 
David Gould  [EMAIL PROTECTED]
If simplicity worked, the world would be overrun with insects.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Simon Riggs
On Sat, 2006-04-22 at 13:17 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > I still do, for multi-user systems. Releasing unused memory from a large
> > CREATE INDEX will allow that memory to be swapped out, even if the brk
> > point can't be changed.
> 
> Say what?  It can get "swapped out" anyway, whether we free() it or not.

Of course it can, but if the memory is not actively used by the sort
then it will be OK if that happens and fairly likely also. If we
actively use the memory for the sort it would is less likely to be
swapped out and a bad thing if it did.

> More to the point, though: I don't believe that the proposed patch is a
> good idea --- it does not reduce the peak sortmem use, which I think is
> the critical factor for a multiuser system, 

I agree peak memory use is the critical factor. There is only one
performsort in progress at any one time, though there can be many final
merges/retrievals in progress concurrently. If the majority of the
memory used by performsoft is released afterwards then it can be made
available for subsequent sorts/hashes etc without increasing further the
peak mem use.

> and what it does do is
> reduce the locality of access to the sort temp file during the merge
> phases.  That will definitely have some impact; maybe small, but some;
> and I don't see where the benefit comes in.

That I already accept.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Jim C. Nasby
On Sat, Apr 22, 2006 at 01:17:08PM -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > I still do, for multi-user systems. Releasing unused memory from a large
> > CREATE INDEX will allow that memory to be swapped out, even if the brk
> > point can't be changed.
> 
> Say what?  It can get "swapped out" anyway, whether we free() it or not.
> 
> More to the point, though: I don't believe that the proposed patch is a
> good idea --- it does not reduce the peak sortmem use, which I think is
> the critical factor for a multiuser system, and what it does do is
> reduce the locality of access to the sort temp file during the merge
> phases.  That will definitely have some impact; maybe small, but some;
> and I don't see where the benefit comes in.

Do we have any info on how long the final phase of a sort typically
takes compared to the rest of the sort? If it can take a substantial
amount of time, then reducing the memory usage during that time will at
least allow the OS to use that memory for caching again. In the future,
if we have a better means of controlling sort memory usage, then freeing
the memory earlier would also put it back in the pool earlier, which
would benefit the multiple concurrent sorts case.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes:
> I still do, for multi-user systems. Releasing unused memory from a large
> CREATE INDEX will allow that memory to be swapped out, even if the brk
> point can't be changed.

Say what?  It can get "swapped out" anyway, whether we free() it or not.

More to the point, though: I don't believe that the proposed patch is a
good idea --- it does not reduce the peak sortmem use, which I think is
the critical factor for a multiuser system, and what it does do is
reduce the locality of access to the sort temp file during the merge
phases.  That will definitely have some impact; maybe small, but some;
and I don't see where the benefit comes in.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Simon Riggs
On Fri, 2006-04-21 at 23:07 -0400, Bruce Momjian wrote:
> Where are we on this patch?

Well the patches work and have been performance tested, with results
posted. Again, the title of this thread doesn't precisely describe the
patch any longer.

The question is do people believe there is benefit in reducing the
amount of memory for the final sort phase, and if so, to what level?

I still do, for multi-user systems. Releasing unused memory from a large
CREATE INDEX will allow that memory to be swapped out, even if the brk
point can't be changed. For large queries with multiple sorts the memory
can be reused immediately.

The patch does sound somewhat obscure and a corner case, I grant you,
but the more memory you give a sort the smaller number of runs you are
likely to have. So the situation of having enough memory to, say, merge
500 runs at the same time as having less than 10 runs is actually IMHO
the common case.

Patch now is: "Reducing memory usage in sort final merge phase."

[I've also completed Cascade Merge sort ready for unit testing, but will
not be completing that for a few weeks yet]

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-21 Thread Bruce Momjian

Where are we on this patch?

---

Simon Riggs wrote:
> On Tue, 2006-03-21 at 17:47 -0500, Tom Lane wrote:
> 
> > I'm fairly unconvinced about Simon's underlying premise --- that we
> > can't make good use of work_mem in sorting after the run building phase
> > --- anyway.  
> 
> We can make good use of memory, but there does come a point in final
> merging where too much is of no further benefit. That point seems to be
> at about 256 blocks per tape; patch enclosed for testing. (256 blocks
> per tape roughly doubles performance over 32 blocks at that stage).
> 
> That is never the case during run building - more is always better.
> 
> > If we cut back our memory usage 
> Simon inserts the words: "too far"
> > then we'll be forcing a
> > significantly more-random access pattern to the temp file(s) during
> > merging, because we won't be able to pre-read as much at a time.
> 
> Yes, thats right.
> 
> If we have 512MB of memory that gives us enough for 2000 tapes, yet the
> initial runs might only build a few runs. There's just no way that all
> 512MB of memory is needed to optimise the performance of reading in a
> few tapes at time of final merge.
> 
> I'm suggesting we always keep 2MB per active tape, or the full
> allocation, whichever is lower. In the above example that could release
> over 500MB of memory, which more importantly can be reused by subsequent
> sorts if/when they occur.
> 
> 
> Enclose two patches:
> 1. mergebuffers.patch allows measurement of the effects of different
> merge buffer sizes, current default=32
> 
> 2. reassign2.patch which implements the two kinds of resource
> deallocation/reassignment proposed.
> 
> Best Regards, Simon Riggs
> 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

> 
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-25 Thread Jim C. Nasby
On Sat, Mar 25, 2006 at 12:24:00PM +, Simon Riggs wrote:
> memory. Using too much memory could also impact overall elapsed time
> when we have concurrent users, so the question is should we optimise
> resources for the multi-user case or for the single user case? Where is
> the right balance point? 

Sounds like what we need is a GUC... I know I certainly have cases where
I'll take faster and using more memory over the alternative.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-25 Thread Simon Riggs
On Wed, 2006-03-22 at 10:03 +, Simon Riggs wrote:

> Recent test results show that with a 512MB test sort we can reclaim
> 97% of memory during final merge with only a noise level (+2%)
> increase in overall elapsed time. (Thats just an example, your mileage
> may vary). So a large query would use and keep about 536MB memory
> rather than 1536MB.

Large performance test output, credit to Ayush Parashar, Greenplum.

We test a very common case for large sorts with high work_mem: High
work_mem significantly reduces the number of runs required, whereas high
work_mem significantly increases MaxTapes, so there will frequently be
the situation that Nruns << MaxTapes and this patch seeks to optimise
the final merge (only) for that case.

elapsed final merge CPU for final merge
with patch  385 s   100.65 s5.48s/71.05u s
w/o patch   377 s   84.73 s 4.79s/72.32u s

So looking at just the final merge in isolation we have a 19% increase
in elapsed time from a 97% reduction in memory usage (based upon the
assumption that reducing available slots by 97% will lead to an overall
97% reduction in memory usage from slots+tuples). This uses an earlier
result that the optimal merge buffer size for the final merge is 8 times
larger than the overall optimal merge buffer size of 32 blocks; altering
this ratio would bring down elapsed time at the cost of increasing
memory. Using too much memory could also impact overall elapsed time
when we have concurrent users, so the question is should we optimise
resources for the multi-user case or for the single user case? Where is
the right balance point? 

Resource usage: (resource usage) multiplied by (time in use)
with patch: 147,000 MB.secs (512 MB fir 285s, then 15MB for 100s)
w/o patch:  189,000 MB.secs (512 MB for 377s)
so overall resource consumption reduced to 77% of current usage, or the
other way up 45% additional users on a throughput basis.

Increase in final merge time is likely due to increased I/O. If this
final merge were input to other nodes in a complex query we may not
consume the tuples at maximum speed, so the additional time might easily
be covered by other actions.

Non final merge test results were within 3% of each other; the patch
doesn't touch that aspect at all, so from that we can say that the test
results are reasonably useful comparison.

- - - -

With patch:

LOG:  switching to external sort with 1831 tapes: CPU 2.86s/1.96u sec
elapsed 7.58 sec\
LOG:  finished writing run 1 to tape 0: CPU 7.36s/27.67u sec elapsed
42.05 sec\
LOG:  finished writing run 2 to tape 1: CPU 12.55s/56.85u sec elapsed
79.78 sec\
LOG:  finished writing run 3 to tape 2: CPU 17.88s/86.42u sec elapsed
120.94 sec\
LOG:  finished writing run 4 to tape 3: CPU 23.06s/116.46u sec elapsed
159.06 sec\
LOG:  finished writing run 5 to tape 4: CPU 28.57s/146.25u sec elapsed
201.59 sec\
LOG:  finished writing run 6 to tape 5: CPU 33.76s/176.14u sec elapsed
239.87 sec\
LOG:  performsort starting: CPU 38.13s/200.71u sec elapsed 272.83 sec\
LOG:  finished writing run 7 to tape 6: CPU 38.23s/204.51u sec elapsed
276.76 sec\
LOG:  finished writing final run 8 to tape 7: CPU 38.50s/211.93u sec
elapsed 284.51 sec\
LOG:  shrinking resources to 3% (from 4194304 to 146686 slots): CPU
38.52s/211.93u sec elapsed 284.69 sec\
LOG:  performsort done (except 8-way final merge): CPU 38.53s/212.00u
sec elapsed 284.85 sec\
LOG:  final merge: tape 7 exhausted: CPU 42.70s/270.65u sec elapsed
368.06 sec\
LOG:  reassigning resources; each tape gets: +2619 slots, +6770980 mem:
CPU 42.70s/270.70u sec elapsed 368.12 sec\
LOG:  final merge: tape 2 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 3 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 5 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 0 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 6 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 1 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 4 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  external sort ended, 293182 disk blocks used: CPU 44.01s/283.05u
sec elapsed 385.50 sec\

Without patch:

LOG:  switching to external sort with 1873 tapes: CPU 2.72s/2.03u sec
elapsed 7.07 sec\
LOG:  finished writing run 1 to tape 0: CPU 7.08s/28.42u sec elapsed
39.96 sec\
LOG:  finished writing run 2 to tape 1: CPU 12.10s/58.47u sec elapsed
79.37 sec\
LOG:  finished writing run 3 to tape 2: CPU 17.35s/89.39u sec elapsed
120.18 sec\
LOG:  finished writing run 4 to tape 3: CPU 22.50s/120.55u sec elapsed
161.24 sec\
LOG:  finished writing run 5 to tape 4: CPU 27.84s/151.41u sec elapsed
202.11 sec\
LOG:  finished writing run 6 to tape 5: CPU 33.15s/182.57u sec elapsed
243.34 sec\
LOG:  performsort starting: CPU 37.53s/208.36u sec elapsed 277.51 sec\
LOG:  finished writing run 7

Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-22 Thread Simon Riggs
On Wed, 2006-03-22 at 07:48 +, Simon Riggs wrote:
> On Tue, 2006-03-21 at 17:47 -0500, Tom Lane wrote:
> 
> > I'm fairly unconvinced about Simon's underlying premise --- that we
> > can't make good use of work_mem in sorting after the run building phase
> > --- anyway.  
> 
> We can make good use of memory, but there does come a point in final
> merging where too much is of no further benefit. That point seems to be
> at about 256 blocks per tape; patch enclosed for testing. (256 blocks
> per tape roughly doubles performance over 32 blocks at that stage).
> 
> That is never the case during run building - more is always better.
> 
> > If we cut back our memory usage 
> Simon inserts the words: "too far"
> > then we'll be forcing a
> > significantly more-random access pattern to the temp file(s) during
> > merging, because we won't be able to pre-read as much at a time.
> 
> Yes, thats right.
> 
> If we have 512MB of memory that gives us enough for 2000 tapes, yet the
> initial runs might only build a few runs. There's just no way that all
> 512MB of memory is needed to optimise the performance of reading in a
> few tapes at time of final merge.
> 
> I'm suggesting we always keep 2MB per active tape, or the full
> allocation, whichever is lower. In the above example that could release
> over 500MB of memory, which more importantly can be reused by subsequent
> sorts if/when they occur.
> 
> 
> Enclose two patches:
> 1. mergebuffers.patch allows measurement of the effects of different
> merge buffer sizes, current default=32
> 
> 2. reassign2.patch which implements the two kinds of resource
> deallocation/reassignment proposed.

Missed couple of minor points in patch: reassign3.patch attached ro
completely replace reassign2.patch.

Recent test results show that with a 512MB test sort we can reclaim 97%
of memory during final merge with only a noise level (+2%) increase in
overall elapsed time. (Thats just an example, your mileage may vary). So
a large query would use and keep about 536MB memory rather than 1536MB.

Best Regards, Simon Riggs
Index: src/backend/utils/sort/tuplesort.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.65
diff -c -r1.65 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	10 Mar 2006 23:19:00 -	1.65
--- src/backend/utils/sort/tuplesort.c	22 Mar 2006 09:34:58 -
***
*** 179,186 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
! 
  /*
   * Private state of a Tuplesort operation.
   */
--- 179,187 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define OPTIMAL_MERGE_BUFFER_SIZE	(BLCKSZ * 32)
! #define PREFERRED_MERGE_BUFFER_SIZE (BLCKSZ * 256)
! #define REUSE_SPACE_LIMIT   RELSEG_SIZE
  /*
   * Private state of a Tuplesort operation.
   */
***
*** 255,260 
--- 256,270 
  	 */
  	int			currentRun;
  
+ /*
+  * These variables are used during final merge to reassign resources
+  * as they become available for each tape
+  */
+ int lastPrereadTape;/* last tape preread from */
+ int numPrereads;/* num times last tape has been selected */
+ int reassignableSlots;  /* how many slots can be reassigned */
+ longreassignableMem;/* how much memory can be reassigned */
+ 
  	/*
  	 * Unless otherwise noted, all pointer variables below are pointers
  	 * to arrays of length maxTapes, holding per-tape data.
***
*** 294,299 
--- 304,310 
  	int		   *tp_runs;		/* # of real runs on each tape */
  	int		   *tp_dummy;		/* # of dummy runs for each tape (D[]) */
  	int		   *tp_tapenum;		/* Actual tape numbers (TAPE[]) */
+ 
  	int			activeTapes;	/* # of active input tapes in merge pass */
  
  	/*
***
*** 398,408 
--- 409,423 
  
  static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
  static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
+ static void grow_memtuples(Tuplesortstate *state);
+ static void shrink_memtuples(Tuplesortstate *state);
  static void inittapes(Tuplesortstate *state);
  static void selectnewtape(Tuplesortstate *state);
  static void mergeruns(Tuplesortstate *state);
  static void mergeonerun(Tuplesortstate *state);
  static void beginmerge(Tuplesortstate *state);
+ static void assignResourcesUniformly(Tuplesortstate *state, bool initialAssignment);
+ static void reassignresources(Tuplesortstate *state, int srcTape);
  static void mergepreread(Tuplesortstate *state);
  static void mergeprereadone(Tuplesortstate *state, int srcTape);
  static void dumptuples(Tuplesortstate *state, bool alltuples);
***
*** 727,733 
   * moves around with tuple addition/removal, this might result in thrashing.
  

Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-21 Thread Simon Riggs
On Tue, 2006-03-21 at 17:47 -0500, Tom Lane wrote:

> I'm fairly unconvinced about Simon's underlying premise --- that we
> can't make good use of work_mem in sorting after the run building phase
> --- anyway.  

We can make good use of memory, but there does come a point in final
merging where too much is of no further benefit. That point seems to be
at about 256 blocks per tape; patch enclosed for testing. (256 blocks
per tape roughly doubles performance over 32 blocks at that stage).

That is never the case during run building - more is always better.

> If we cut back our memory usage 
Simon inserts the words: "too far"
> then we'll be forcing a
> significantly more-random access pattern to the temp file(s) during
> merging, because we won't be able to pre-read as much at a time.

Yes, thats right.

If we have 512MB of memory that gives us enough for 2000 tapes, yet the
initial runs might only build a few runs. There's just no way that all
512MB of memory is needed to optimise the performance of reading in a
few tapes at time of final merge.

I'm suggesting we always keep 2MB per active tape, or the full
allocation, whichever is lower. In the above example that could release
over 500MB of memory, which more importantly can be reused by subsequent
sorts if/when they occur.


Enclose two patches:
1. mergebuffers.patch allows measurement of the effects of different
merge buffer sizes, current default=32

2. reassign2.patch which implements the two kinds of resource
deallocation/reassignment proposed.

Best Regards, Simon Riggs

Index: src/backend/utils/sort/tuplesort.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.65
diff -c -r1.65 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	10 Mar 2006 23:19:00 -	1.65
--- src/backend/utils/sort/tuplesort.c	21 Mar 2006 19:20:23 -
***
*** 179,186 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
! 
  /*
   * Private state of a Tuplesort operation.
   */
--- 179,187 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define OPTIMAL_MERGE_BUFFER_SIZE	(BLCKSZ * 32)
! #define PREFERRED_MERGE_BUFFER_SIZE (BLCKSZ * 256)
! #define REUSE_SPACE_LIMIT   RELSEG_SIZE
  /*
   * Private state of a Tuplesort operation.
   */
***
*** 255,260 
--- 256,270 
  	 */
  	int			currentRun;
  
+ /*
+  * These variables are used during final merge to reassign resources
+  * as they become available for each tape
+  */
+ int lastPrereadTape;/* last tape preread from */
+ int numPrereads;/* num times last tape has been selected */
+ int reassignableSlots;  /* how many slots can be reassigned */
+ longreassignableMem;/* how much memory can be reassigned */
+ 
  	/*
  	 * Unless otherwise noted, all pointer variables below are pointers
  	 * to arrays of length maxTapes, holding per-tape data.
***
*** 398,408 
--- 408,422 
  
  static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
  static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
+ static void grow_memtuples(Tuplesortstate *state);
+ static void shrink_memtuples(Tuplesortstate *state);
  static void inittapes(Tuplesortstate *state);
  static void selectnewtape(Tuplesortstate *state);
  static void mergeruns(Tuplesortstate *state);
  static void mergeonerun(Tuplesortstate *state);
  static void beginmerge(Tuplesortstate *state);
+ static void assignResourcesUniformly(Tuplesortstate *state, bool initialAssignment);
+ static void reassignresources(Tuplesortstate *state, int srcTape);
  static void mergepreread(Tuplesortstate *state);
  static void mergeprereadone(Tuplesortstate *state, int srcTape);
  static void dumptuples(Tuplesortstate *state, bool alltuples);
***
*** 727,733 
   * moves around with tuple addition/removal, this might result in thrashing.
   * Small increases in the array size are likely to be pretty inefficient.
   */
! static bool
  grow_memtuples(Tuplesortstate *state)
  {
  	/*
--- 741,747 
   * moves around with tuple addition/removal, this might result in thrashing.
   * Small increases in the array size are likely to be pretty inefficient.
   */
! static void
  grow_memtuples(Tuplesortstate *state)
  {
  	/*
***
*** 740,752 
  	 * this assumption should be good.  But let's check it.)
  	 */
  	if (state->availMem <= (long) (state->memtupsize * sizeof(SortTuple)))
! 		return false;
  	/*
  	 * On a 64-bit machine, allowedMem could be high enough to get us into
  	 * trouble with MaxAllocSize, too.
  	 */
  	if ((Size) (state->memtupsize * 2) >= MaxAllocSize / sizeof(SortTuple))
! 		return false;
  
  	FREEMEM(state, GetMemoryChunkS