Re: [HACKERS] Automatically setting work_mem

2006-04-23 Thread Bruno Wolff III
On Sat, Apr 22, 2006 at 14:20:32 -0700,
  daveg [EMAIL PROTECTED] wrote:
 On Sat, Apr 22, 2006 at 01:49:25PM -0700, David Fetter wrote:
  On Sat, Apr 22, 2006 at 01:14:42PM -0700, David Gould wrote:
  
   To avoid running out of swap and triggering the oom killer we have
   had to reduce work_mem below what we prefer.
  
  Dunno about your work_mem, but you can make sure the OOM killer
  doesn't kill you as follows http://lwn.net/Articles/104185/.
 
 Or I could run with overcommit turned off, but we like overcommit because
 things like vaccuum appear to allocate maint_work_mem when they start, so
 if that is set at say 100 Mb it will allocate 100 Mb even to vacuum a 2
 page table. Overcommit lets this sort of thing get by without createing
 a need for even more swap.

I would expect that you would still come out ahead commiting some disk
space to swap, that will probably never be used, that allows you to better
configure your memory usage.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Simon Riggs
On Fri, 2006-04-21 at 23:07 -0400, Bruce Momjian wrote:
 Where are we on this patch?

Well the patches work and have been performance tested, with results
posted. Again, the title of this thread doesn't precisely describe the
patch any longer.

The question is do people believe there is benefit in reducing the
amount of memory for the final sort phase, and if so, to what level?

I still do, for multi-user systems. Releasing unused memory from a large
CREATE INDEX will allow that memory to be swapped out, even if the brk
point can't be changed. For large queries with multiple sorts the memory
can be reused immediately.

The patch does sound somewhat obscure and a corner case, I grant you,
but the more memory you give a sort the smaller number of runs you are
likely to have. So the situation of having enough memory to, say, merge
500 runs at the same time as having less than 10 runs is actually IMHO
the common case.

Patch now is: Reducing memory usage in sort final merge phase.

[I've also completed Cascade Merge sort ready for unit testing, but will
not be completing that for a few weeks yet]

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Jim C. Nasby
On Sat, Apr 22, 2006 at 01:17:08PM -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I still do, for multi-user systems. Releasing unused memory from a large
  CREATE INDEX will allow that memory to be swapped out, even if the brk
  point can't be changed.
 
 Say what?  It can get swapped out anyway, whether we free() it or not.
 
 More to the point, though: I don't believe that the proposed patch is a
 good idea --- it does not reduce the peak sortmem use, which I think is
 the critical factor for a multiuser system, and what it does do is
 reduce the locality of access to the sort temp file during the merge
 phases.  That will definitely have some impact; maybe small, but some;
 and I don't see where the benefit comes in.

Do we have any info on how long the final phase of a sort typically
takes compared to the rest of the sort? If it can take a substantial
amount of time, then reducing the memory usage during that time will at
least allow the OS to use that memory for caching again. In the future,
if we have a better means of controlling sort memory usage, then freeing
the memory earlier would also put it back in the pool earlier, which
would benefit the multiple concurrent sorts case.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread Simon Riggs
On Sat, 2006-04-22 at 13:17 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  I still do, for multi-user systems. Releasing unused memory from a large
  CREATE INDEX will allow that memory to be swapped out, even if the brk
  point can't be changed.
 
 Say what?  It can get swapped out anyway, whether we free() it or not.

Of course it can, but if the memory is not actively used by the sort
then it will be OK if that happens and fairly likely also. If we
actively use the memory for the sort it would is less likely to be
swapped out and a bad thing if it did.

 More to the point, though: I don't believe that the proposed patch is a
 good idea --- it does not reduce the peak sortmem use, which I think is
 the critical factor for a multiuser system, 

I agree peak memory use is the critical factor. There is only one
performsort in progress at any one time, though there can be many final
merges/retrievals in progress concurrently. If the majority of the
memory used by performsoft is released afterwards then it can be made
available for subsequent sorts/hashes etc without increasing further the
peak mem use.

 and what it does do is
 reduce the locality of access to the sort temp file during the merge
 phases.  That will definitely have some impact; maybe small, but some;
 and I don't see where the benefit comes in.

That I already accept.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread daveg
On Sat, Apr 22, 2006 at 06:38:53PM +0100, Simon Riggs wrote:
 On Sat, 2006-04-22 at 13:17 -0400, Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
   I still do, for multi-user systems. Releasing unused memory from a large
   CREATE INDEX will allow that memory to be swapped out, even if the brk
   point can't be changed.
  
  Say what?  It can get swapped out anyway, whether we free() it or not.
 
 Of course it can, but if the memory is not actively used by the sort
 then it will be OK if that happens and fairly likely also. If we
 actively use the memory for the sort it would is less likely to be
 swapped out and a bad thing if it did.
 
  More to the point, though: I don't believe that the proposed patch is a
  good idea --- it does not reduce the peak sortmem use, which I think is
  the critical factor for a multiuser system, 
 
 I agree peak memory use is the critical factor. There is only one
 performsort in progress at any one time, though there can be many final
 merges/retrievals in progress concurrently. If the majority of the
 memory used by performsoft is released afterwards then it can be made
 available for subsequent sorts/hashes etc without increasing further the
 peak mem use.
 
  and what it does do is
  reduce the locality of access to the sort temp file during the merge
  phases.  That will definitely have some impact; maybe small, but some;
  and I don't see where the benefit comes in.
 
 That I already accept.

I'd like to add a user perspective: we run dual Opteron servers with
16 Gb of memory and 16 Gb of swap. When we are busy we can have 20 to
thirty substantial queries running at one time. It is very common for us to
have several sorts and also hash joins running concurrently, some for a
minute or two, some for much longer.  To avoid running out of swap and
triggering the oom killer we have had to reduce work_mem below what we
prefer.

We could add more swap, but at some point this has diminishing returns.
The proposed patch seems as if it would be helpful in our situation.

-dg

-- 
David Gould  [EMAIL PROTECTED]
If simplicity worked, the world would be overrun with insects.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread David Fetter
On Sat, Apr 22, 2006 at 01:14:42PM -0700, David Gould wrote:

 To avoid running out of swap and triggering the oom killer we have
 had to reduce work_mem below what we prefer.

Dunno about your work_mem, but you can make sure the OOM killer
doesn't kill you as follows http://lwn.net/Articles/104185/.

HTH :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778AIM: dfetter666
  Skype: davidfetter

Remember to vote!

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-22 Thread daveg
On Sat, Apr 22, 2006 at 01:49:25PM -0700, David Fetter wrote:
 On Sat, Apr 22, 2006 at 01:14:42PM -0700, David Gould wrote:
 
  To avoid running out of swap and triggering the oom killer we have
  had to reduce work_mem below what we prefer.
 
 Dunno about your work_mem, but you can make sure the OOM killer
 doesn't kill you as follows http://lwn.net/Articles/104185/.

Or I could run with overcommit turned off, but we like overcommit because
things like vaccuum appear to allocate maint_work_mem when they start, so
if that is set at say 100 Mb it will allocate 100 Mb even to vacuum a 2
page table. Overcommit lets this sort of thing get by without createing
a need for even more swap.

-dg


-- 
David Gould  [EMAIL PROTECTED]
If simplicity worked, the world would be overrun with insects.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-04-21 Thread Bruce Momjian

Where are we on this patch?

---

Simon Riggs wrote:
 On Tue, 2006-03-21 at 17:47 -0500, Tom Lane wrote:
 
  I'm fairly unconvinced about Simon's underlying premise --- that we
  can't make good use of work_mem in sorting after the run building phase
  --- anyway.  
 
 We can make good use of memory, but there does come a point in final
 merging where too much is of no further benefit. That point seems to be
 at about 256 blocks per tape; patch enclosed for testing. (256 blocks
 per tape roughly doubles performance over 32 blocks at that stage).
 
 That is never the case during run building - more is always better.
 
  If we cut back our memory usage 
 Simon inserts the words: too far
  then we'll be forcing a
  significantly more-random access pattern to the temp file(s) during
  merging, because we won't be able to pre-read as much at a time.
 
 Yes, thats right.
 
 If we have 512MB of memory that gives us enough for 2000 tapes, yet the
 initial runs might only build a few runs. There's just no way that all
 512MB of memory is needed to optimise the performance of reading in a
 few tapes at time of final merge.
 
 I'm suggesting we always keep 2MB per active tape, or the full
 allocation, whichever is lower. In the above example that could release
 over 500MB of memory, which more importantly can be reused by subsequent
 sorts if/when they occur.
 
 
 Enclose two patches:
 1. mergebuffers.patch allows measurement of the effects of different
 merge buffer sizes, current default=32
 
 2. reassign2.patch which implements the two kinds of resource
 deallocation/reassignment proposed.
 
 Best Regards, Simon Riggs
 

[ Attachment, skipping... ]

[ Attachment, skipping... ]

 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-25 Thread Simon Riggs
On Wed, 2006-03-22 at 10:03 +, Simon Riggs wrote:

 Recent test results show that with a 512MB test sort we can reclaim
 97% of memory during final merge with only a noise level (+2%)
 increase in overall elapsed time. (Thats just an example, your mileage
 may vary). So a large query would use and keep about 536MB memory
 rather than 1536MB.

Large performance test output, credit to Ayush Parashar, Greenplum.

We test a very common case for large sorts with high work_mem: High
work_mem significantly reduces the number of runs required, whereas high
work_mem significantly increases MaxTapes, so there will frequently be
the situation that Nruns  MaxTapes and this patch seeks to optimise
the final merge (only) for that case.

elapsed final merge CPU for final merge
with patch  385 s   100.65 s5.48s/71.05u s
w/o patch   377 s   84.73 s 4.79s/72.32u s

So looking at just the final merge in isolation we have a 19% increase
in elapsed time from a 97% reduction in memory usage (based upon the
assumption that reducing available slots by 97% will lead to an overall
97% reduction in memory usage from slots+tuples). This uses an earlier
result that the optimal merge buffer size for the final merge is 8 times
larger than the overall optimal merge buffer size of 32 blocks; altering
this ratio would bring down elapsed time at the cost of increasing
memory. Using too much memory could also impact overall elapsed time
when we have concurrent users, so the question is should we optimise
resources for the multi-user case or for the single user case? Where is
the right balance point? 

Resource usage: (resource usage) multiplied by (time in use)
with patch: 147,000 MB.secs (512 MB fir 285s, then 15MB for 100s)
w/o patch:  189,000 MB.secs (512 MB for 377s)
so overall resource consumption reduced to 77% of current usage, or the
other way up 45% additional users on a throughput basis.

Increase in final merge time is likely due to increased I/O. If this
final merge were input to other nodes in a complex query we may not
consume the tuples at maximum speed, so the additional time might easily
be covered by other actions.

Non final merge test results were within 3% of each other; the patch
doesn't touch that aspect at all, so from that we can say that the test
results are reasonably useful comparison.

- - - -

With patch:

LOG:  switching to external sort with 1831 tapes: CPU 2.86s/1.96u sec
elapsed 7.58 sec\
LOG:  finished writing run 1 to tape 0: CPU 7.36s/27.67u sec elapsed
42.05 sec\
LOG:  finished writing run 2 to tape 1: CPU 12.55s/56.85u sec elapsed
79.78 sec\
LOG:  finished writing run 3 to tape 2: CPU 17.88s/86.42u sec elapsed
120.94 sec\
LOG:  finished writing run 4 to tape 3: CPU 23.06s/116.46u sec elapsed
159.06 sec\
LOG:  finished writing run 5 to tape 4: CPU 28.57s/146.25u sec elapsed
201.59 sec\
LOG:  finished writing run 6 to tape 5: CPU 33.76s/176.14u sec elapsed
239.87 sec\
LOG:  performsort starting: CPU 38.13s/200.71u sec elapsed 272.83 sec\
LOG:  finished writing run 7 to tape 6: CPU 38.23s/204.51u sec elapsed
276.76 sec\
LOG:  finished writing final run 8 to tape 7: CPU 38.50s/211.93u sec
elapsed 284.51 sec\
LOG:  shrinking resources to 3% (from 4194304 to 146686 slots): CPU
38.52s/211.93u sec elapsed 284.69 sec\
LOG:  performsort done (except 8-way final merge): CPU 38.53s/212.00u
sec elapsed 284.85 sec\
LOG:  final merge: tape 7 exhausted: CPU 42.70s/270.65u sec elapsed
368.06 sec\
LOG:  reassigning resources; each tape gets: +2619 slots, +6770980 mem:
CPU 42.70s/270.70u sec elapsed 368.12 sec\
LOG:  final merge: tape 2 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 3 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 5 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 0 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 6 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 1 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  final merge: tape 4 exhausted: CPU 43.68s/283.05u sec elapsed
385.00 sec\
LOG:  external sort ended, 293182 disk blocks used: CPU 44.01s/283.05u
sec elapsed 385.50 sec\

Without patch:

LOG:  switching to external sort with 1873 tapes: CPU 2.72s/2.03u sec
elapsed 7.07 sec\
LOG:  finished writing run 1 to tape 0: CPU 7.08s/28.42u sec elapsed
39.96 sec\
LOG:  finished writing run 2 to tape 1: CPU 12.10s/58.47u sec elapsed
79.37 sec\
LOG:  finished writing run 3 to tape 2: CPU 17.35s/89.39u sec elapsed
120.18 sec\
LOG:  finished writing run 4 to tape 3: CPU 22.50s/120.55u sec elapsed
161.24 sec\
LOG:  finished writing run 5 to tape 4: CPU 27.84s/151.41u sec elapsed
202.11 sec\
LOG:  finished writing run 6 to tape 5: CPU 33.15s/182.57u sec elapsed
243.34 sec\
LOG:  performsort starting: CPU 37.53s/208.36u sec elapsed 277.51 sec\
LOG:  finished writing run 7 to 

Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-25 Thread Jim C. Nasby
On Sat, Mar 25, 2006 at 12:24:00PM +, Simon Riggs wrote:
 memory. Using too much memory could also impact overall elapsed time
 when we have concurrent users, so the question is should we optimise
 resources for the multi-user case or for the single user case? Where is
 the right balance point? 

Sounds like what we need is a GUC... I know I certainly have cases where
I'll take faster and using more memory over the alternative.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PATCHES] [HACKERS] Automatically setting work_mem

2006-03-22 Thread Simon Riggs
On Wed, 2006-03-22 at 07:48 +, Simon Riggs wrote:
 On Tue, 2006-03-21 at 17:47 -0500, Tom Lane wrote:
 
  I'm fairly unconvinced about Simon's underlying premise --- that we
  can't make good use of work_mem in sorting after the run building phase
  --- anyway.  
 
 We can make good use of memory, but there does come a point in final
 merging where too much is of no further benefit. That point seems to be
 at about 256 blocks per tape; patch enclosed for testing. (256 blocks
 per tape roughly doubles performance over 32 blocks at that stage).
 
 That is never the case during run building - more is always better.
 
  If we cut back our memory usage 
 Simon inserts the words: too far
  then we'll be forcing a
  significantly more-random access pattern to the temp file(s) during
  merging, because we won't be able to pre-read as much at a time.
 
 Yes, thats right.
 
 If we have 512MB of memory that gives us enough for 2000 tapes, yet the
 initial runs might only build a few runs. There's just no way that all
 512MB of memory is needed to optimise the performance of reading in a
 few tapes at time of final merge.
 
 I'm suggesting we always keep 2MB per active tape, or the full
 allocation, whichever is lower. In the above example that could release
 over 500MB of memory, which more importantly can be reused by subsequent
 sorts if/when they occur.
 
 
 Enclose two patches:
 1. mergebuffers.patch allows measurement of the effects of different
 merge buffer sizes, current default=32
 
 2. reassign2.patch which implements the two kinds of resource
 deallocation/reassignment proposed.

Missed couple of minor points in patch: reassign3.patch attached ro
completely replace reassign2.patch.

Recent test results show that with a 512MB test sort we can reclaim 97%
of memory during final merge with only a noise level (+2%) increase in
overall elapsed time. (Thats just an example, your mileage may vary). So
a large query would use and keep about 536MB memory rather than 1536MB.

Best Regards, Simon Riggs
Index: src/backend/utils/sort/tuplesort.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.65
diff -c -r1.65 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	10 Mar 2006 23:19:00 -	1.65
--- src/backend/utils/sort/tuplesort.c	22 Mar 2006 09:34:58 -
***
*** 179,186 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
! 
  /*
   * Private state of a Tuplesort operation.
   */
--- 179,187 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define OPTIMAL_MERGE_BUFFER_SIZE	(BLCKSZ * 32)
! #define PREFERRED_MERGE_BUFFER_SIZE (BLCKSZ * 256)
! #define REUSE_SPACE_LIMIT   RELSEG_SIZE
  /*
   * Private state of a Tuplesort operation.
   */
***
*** 255,260 
--- 256,270 
  	 */
  	int			currentRun;
  
+ /*
+  * These variables are used during final merge to reassign resources
+  * as they become available for each tape
+  */
+ int lastPrereadTape;/* last tape preread from */
+ int numPrereads;/* num times last tape has been selected */
+ int reassignableSlots;  /* how many slots can be reassigned */
+ longreassignableMem;/* how much memory can be reassigned */
+ 
  	/*
  	 * Unless otherwise noted, all pointer variables below are pointers
  	 * to arrays of length maxTapes, holding per-tape data.
***
*** 294,299 
--- 304,310 
  	int		   *tp_runs;		/* # of real runs on each tape */
  	int		   *tp_dummy;		/* # of dummy runs for each tape (D[]) */
  	int		   *tp_tapenum;		/* Actual tape numbers (TAPE[]) */
+ 
  	int			activeTapes;	/* # of active input tapes in merge pass */
  
  	/*
***
*** 398,408 
--- 409,423 
  
  static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
  static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
+ static void grow_memtuples(Tuplesortstate *state);
+ static void shrink_memtuples(Tuplesortstate *state);
  static void inittapes(Tuplesortstate *state);
  static void selectnewtape(Tuplesortstate *state);
  static void mergeruns(Tuplesortstate *state);
  static void mergeonerun(Tuplesortstate *state);
  static void beginmerge(Tuplesortstate *state);
+ static void assignResourcesUniformly(Tuplesortstate *state, bool initialAssignment);
+ static void reassignresources(Tuplesortstate *state, int srcTape);
  static void mergepreread(Tuplesortstate *state);
  static void mergeprereadone(Tuplesortstate *state, int srcTape);
  static void dumptuples(Tuplesortstate *state, bool alltuples);
***
*** 727,733 
   * moves around with tuple addition/removal, this might result in thrashing.
   * Small increases in the array size are 

Re: [HACKERS] Automatically setting work_mem

2006-03-22 Thread Luke Lonergan
Tom,

On 3/21/06 3:06 PM, Tom Lane [EMAIL PROTECTED] wrote:

 The real problem we are facing with a whole lot of our optimization
 issues (not only sorting) is that it's not all that trivial to get
 credible experimental results that we can expect will hold up across
 a range of usage scenarios.

As proven by the qsort tests - point taken.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Simon Riggs
On Fri, 2006-03-17 at 09:46 -0500, Tom Lane wrote:
 Qingqing Zhou [EMAIL PROTECTED] writes:
  So what's the difference between these two strategy?
  (1) Running time: do they use the same amount of memory? Why option 2 is
  better than 1?
  (2) Idle time: after sort done, option 1 will return all 1024 to the OS and
  2 will still keep 512?
 
 Point 2 is actually a serious flaw in Simon's proposal, because there
 is no portable way to make malloc return freed memory to the OS.  Some
 mallocs will do that ... in some cases ... but many simply don't ever
 move the brk address down.  It's not an easy thing to do when the arena
 gets cluttered with a lot of different alloc chunks and only some of
 them get freed.

I'm aware of that objection and agree its an issue...

One of the situations I am looking at is larger queries with multiple
sorts in them. I'm getting some reasonable results for final merge even
after releasing lots of memory (say 50-90%). memtuples array is not
required at all for randomAccess sorts (100% reduction).

The largest requirement for memory is the run building during
performsort. That portion of the code is not concurrently executed
within the same query. If we can reduce memory usage after that phase
completes then we stand a chance of not overusing memory on a big query
and not being able to reclaim it.

So overall memory usage could be as low as work_mem + (numsorts * 0.1 *
work_mem) which is a lot less than numsorts * work_mem. (e.g. 130% of
work_mem rather than 300% work_mem).

 So the semantics we'd have to adopt is that once a backend claims some
 shared work mem, it keeps it until process exit.  I don't think that
 makes the idea worthless, because there's usually a clear distinction
 between processes doing expensive stuff and processes doing cheap
 stuff.  But it's definitely a limitation.  

...Hopefully less so with mem reduction changes.

The other way is of course to allocate *all* sort/hash/big stuff space
out of shared memory and then let all the backends fight it out
(somehow...) to see who gets access to it. That way backends stay small
and we have well bounded memory usage.

 Also, if you've got a process
 doing expensive stuff, it's certainly possible to expect the user to
 just increase work_mem locally.

Doing that is fine, but you have to retune the system every time the
memory usage changes for any reason. So most of the time manual tuning
has to be very conservative to avoid getting it wrong.

 My own thoughts about the problems with our work_mem arrangement are
 that the real problem is the rule that we can allocate work_mem per sort
 or hash operation; this makes the actual total memory use per backend
 pretty unpredictable for nontrivial queries.  I don't know how to fix
 this though.  The planner needs to know the work_mem that will be used
 for any one of these operations in order to estimate costs, so simply
 trying to divide up work_mem among the operations of a completed plan
 tree is not going to improve matters.

Spent about 5 hours discussing that and the best answer was use
queuing

= = = = 

Anyway, thinking just about sort, I've got the following concrete
suggestions (first two of which coded and tested):

1. I originally picked MERGE_BUFFER_SIZE at 32 blocks as a guess. Better
test results show that is indeed the optimum when we take into account
both intermediate and final merging. However, the preferred buffer size
would be about 256 blocks in the case that Nruns  Ntapes i.e. when
work_mem is set high. In this case memory is reallocated to reduce the
overall usage after performsort done has happened.

2. When a tape runs out of tuples, its memory is reallocated to
remaining tapes to increase their I/O efficiency. This should help to
increase performance for smaller work_mem settings with large sorts, or
anything where the final merge Nruns is close to Ntapes. You can see
this occurring in the example below. The reallocation is either done
uniformly or all onto a single tape, depending upon the preread pattern.
This mostly doesn't occur with well sorted output, so there is little
overhead from doing this in the general case.

Right now, large sort performance is very good, whereas smaller sort
perfomance is still fairly bad. 

3. We implement new merge alogorithm as Tom suggested...

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Simon Riggs
On Sat, 2006-03-18 at 13:21 -0800, Luke Lonergan wrote:

 In short, faster performance through more aggressive runtime compilation.  A
 JIT for the database kernel.  It's not like I'm on shaky ground here - other
 commercial DBMS have done it for over a decade.

I think what Luke may be referring to is the ability to compile WHERE
clauses to remove the bottleneck around Eval for complex searches.

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Martijn van Oosterhout
On Tue, Mar 21, 2006 at 08:05:50PM +, Simon Riggs wrote:
  Point 2 is actually a serious flaw in Simon's proposal, because there
  is no portable way to make malloc return freed memory to the OS.  Some
  mallocs will do that ... in some cases ... but many simply don't ever
  move the brk address down.  It's not an easy thing to do when the arena
  gets cluttered with a lot of different alloc chunks and only some of
  them get freed.

snip

 The largest requirement for memory is the run building during
 performsort. That portion of the code is not concurrently executed
 within the same query. If we can reduce memory usage after that phase
 completes then we stand a chance of not overusing memory on a big query
 and not being able to reclaim it.

There is one way to guarentee the memory is released to the OS after
completion. Make the allocator allocate work_mem bytes using mmap()
rather than malloc(). munmap() will then definitly return the memory to
the OS. Unfortunatly, the coding required would probably not be
straight-forward... Glibc will only convert malloc() to an mmap() on
allocations  128KB  and I don't think PostgreSQL ever does that.

Have a ncie day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 There is one way to guarentee the memory is released to the OS after
 completion. Make the allocator allocate work_mem bytes using mmap()
 rather than malloc(). munmap() will then definitly return the memory to
 the OS. Unfortunatly, the coding required would probably not be
 straight-forward...

Nor portable.

 Glibc will only convert malloc() to an mmap() on
 allocations  128KB  and I don't think PostgreSQL ever does that.

Actually, we do: it doesn't take very long for the sequence of block
allocations within a context to ramp up to 128K.  (And I wouldn't be
opposed to tweaking the logic in aset.c to make it happen faster, once
an initial small allocation is filled up.)  Also, individual chunk
requests exceeding 8K or thereabouts are fed directly to malloc, so
stuff like the SortTuple array might well be effectively mmap'd.

I'm fairly unconvinced about Simon's underlying premise --- that we
can't make good use of work_mem in sorting after the run building phase
--- anyway.  If we cut back our memory usage then we'll be forcing a
significantly more-random access pattern to the temp file(s) during
merging, because we won't be able to pre-read as much at a time.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Luke Lonergan
Tom,

On 3/21/06 2:47 PM, Tom Lane [EMAIL PROTECTED] wrote:

 I'm fairly unconvinced about Simon's underlying premise --- that we
 can't make good use of work_mem in sorting after the run building phase
 --- anyway.  If we cut back our memory usage then we'll be forcing a
 significantly more-random access pattern to the temp file(s) during
 merging, because we won't be able to pre-read as much at a time.

I thought we let the OS do that ;-)

Seriously, I've suggested an experiment to evaluate the effectiveness of
internal buffering with ridiculously low amounts of RAM (work_mem) compared
to bypassing it entirely and preferring the buffer cache and OS I/O cache.
I suspect the work_mem caching of merge results, while algorithmically
appropriate, may not work effectively with the tiny amount of RAM allocated
to it, and could be better left to the OS because of it's liberal use of
read-ahead and disk caching.

Experiment should take but a minute to validate or disprove the hypothesis.

- Luke  



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 Experiment should take but a minute to validate or disprove the hypothesis.

Only if you're prepared to trust the results of one experiment on one
platform with a not-very-large amount of data.  Otherwise it's going to
take quite a few minutes ...

The real problem we are facing with a whole lot of our optimization
issues (not only sorting) is that it's not all that trivial to get
credible experimental results that we can expect will hold up across
a range of usage scenarios.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatically setting work_mem

2006-03-21 Thread Simon Riggs
On Tue, 2006-03-21 at 17:47 -0500, Tom Lane wrote:

 I'm fairly unconvinced about Simon's underlying premise --- that we
 can't make good use of work_mem in sorting after the run building phase
 --- anyway.  

We can make good use of memory, but there does come a point in final
merging where too much is of no further benefit. That point seems to be
at about 256 blocks per tape; patch enclosed for testing. (256 blocks
per tape roughly doubles performance over 32 blocks at that stage).

That is never the case during run building - more is always better.

 If we cut back our memory usage 
Simon inserts the words: too far
 then we'll be forcing a
 significantly more-random access pattern to the temp file(s) during
 merging, because we won't be able to pre-read as much at a time.

Yes, thats right.

If we have 512MB of memory that gives us enough for 2000 tapes, yet the
initial runs might only build a few runs. There's just no way that all
512MB of memory is needed to optimise the performance of reading in a
few tapes at time of final merge.

I'm suggesting we always keep 2MB per active tape, or the full
allocation, whichever is lower. In the above example that could release
over 500MB of memory, which more importantly can be reused by subsequent
sorts if/when they occur.


Enclose two patches:
1. mergebuffers.patch allows measurement of the effects of different
merge buffer sizes, current default=32

2. reassign2.patch which implements the two kinds of resource
deallocation/reassignment proposed.

Best Regards, Simon Riggs

Index: src/backend/utils/sort/tuplesort.c
===
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.65
diff -c -r1.65 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	10 Mar 2006 23:19:00 -	1.65
--- src/backend/utils/sort/tuplesort.c	21 Mar 2006 19:20:23 -
***
*** 179,186 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
! 
  /*
   * Private state of a Tuplesort operation.
   */
--- 179,187 
   */
  #define MINORDER		6		/* minimum merge order */
  #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
! #define OPTIMAL_MERGE_BUFFER_SIZE	(BLCKSZ * 32)
! #define PREFERRED_MERGE_BUFFER_SIZE (BLCKSZ * 256)
! #define REUSE_SPACE_LIMIT   RELSEG_SIZE
  /*
   * Private state of a Tuplesort operation.
   */
***
*** 255,260 
--- 256,270 
  	 */
  	int			currentRun;
  
+ /*
+  * These variables are used during final merge to reassign resources
+  * as they become available for each tape
+  */
+ int lastPrereadTape;/* last tape preread from */
+ int numPrereads;/* num times last tape has been selected */
+ int reassignableSlots;  /* how many slots can be reassigned */
+ longreassignableMem;/* how much memory can be reassigned */
+ 
  	/*
  	 * Unless otherwise noted, all pointer variables below are pointers
  	 * to arrays of length maxTapes, holding per-tape data.
***
*** 398,408 
--- 408,422 
  
  static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
  static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
+ static void grow_memtuples(Tuplesortstate *state);
+ static void shrink_memtuples(Tuplesortstate *state);
  static void inittapes(Tuplesortstate *state);
  static void selectnewtape(Tuplesortstate *state);
  static void mergeruns(Tuplesortstate *state);
  static void mergeonerun(Tuplesortstate *state);
  static void beginmerge(Tuplesortstate *state);
+ static void assignResourcesUniformly(Tuplesortstate *state, bool initialAssignment);
+ static void reassignresources(Tuplesortstate *state, int srcTape);
  static void mergepreread(Tuplesortstate *state);
  static void mergeprereadone(Tuplesortstate *state, int srcTape);
  static void dumptuples(Tuplesortstate *state, bool alltuples);
***
*** 727,733 
   * moves around with tuple addition/removal, this might result in thrashing.
   * Small increases in the array size are likely to be pretty inefficient.
   */
! static bool
  grow_memtuples(Tuplesortstate *state)
  {
  	/*
--- 741,747 
   * moves around with tuple addition/removal, this might result in thrashing.
   * Small increases in the array size are likely to be pretty inefficient.
   */
! static void
  grow_memtuples(Tuplesortstate *state)
  {
  	/*
***
*** 740,752 
  	 * this assumption should be good.  But let's check it.)
  	 */
  	if (state-availMem = (long) (state-memtupsize * sizeof(SortTuple)))
! 		return false;
  	/*
  	 * On a 64-bit machine, allowedMem could be high enough to get us into
  	 * trouble with MaxAllocSize, too.
  	 */
  	if ((Size) (state-memtupsize * 2) = MaxAllocSize / sizeof(SortTuple))
! 		return false;
  
  	FREEMEM(state, 

Re: [HACKERS] Automatically setting work_mem

2006-03-19 Thread Andreas Pflug

Thomas Hallgren wrote:

Luke Lonergan wrote:


Tom,


On 3/17/06 9:59 PM, Tom Lane [EMAIL PROTECTED] wrote:


This would buy what exactly?



I guess you didn't read the other 80% of the post.

In short, faster performance through more aggressive runtime 
compilation.  A
JIT for the database kernel.  It's not like I'm on shaky ground here - 
other

commercial DBMS have done it for over a decade.

In exactly a fortnight from today, I'll repost my suggestion to rewrite 
the backend in Java ;-)


Please do! we haven't seen this suggestion for quite a while, I'm 
starting to miss it...


Regards,
Andreas

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Automatically setting work_mem

2006-03-18 Thread Luke Lonergan
Tom,


On 3/17/06 9:59 PM, Tom Lane [EMAIL PROTECTED] wrote:

 This would buy what exactly?

I guess you didn't read the other 80% of the post.

In short, faster performance through more aggressive runtime compilation.  A
JIT for the database kernel.  It's not like I'm on shaky ground here - other
commercial DBMS have done it for over a decade.

- Luke



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Automatically setting work_mem

2006-03-18 Thread Luke Lonergan
Tom,

On 3/17/06 12:18 PM, Tom Lane [EMAIL PROTECTED] wrote:

 One user with ability to enter arbitrary SQL commands can *always* blow
 your resource planning away.  Blaming such things on work_mem is
 seriously misguided.

Agreed - that's why we need to split this discussion into the two categories
of (1) scheduling for concurrency protection and (2) dynamic resource
allocation.

Topic (1) is best handled by statement queuing IMO and as demonstrated by
other commercial DBMS.  This allows queues of different resource demands to
be used for ensuring that statements can not over consume memory, temp disk,
etc, and that queries with large requirements for some or all of those can
be allocated as much as possible, and those with smaller requirements will
be run (likely at much higher rates) while longer running queries take up
the larger resource pool.

(2) is what this thread is mostly talking about, and the dynamic allocation
of memory to plan nodes (sort, hash) needs to be done so that we are much
more efficient in memory footprint and give more where it's needed.  (2)
will require some way of putting an overall memory footprint to a statement,
then sub-allocating within it.  I suggest we assume that the overall memory
footprint is constrained somehow, perhaps another GUC that describes a per
statement maximum memory consumption, then at plan time we determine the
sub-allocations that best achieve the plan.  This would fit within a scheme
for (1) when we develop one.

- Luke 



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Automatically setting work_mem

2006-03-18 Thread Thomas Hallgren

Luke Lonergan wrote:

Tom,


On 3/17/06 9:59 PM, Tom Lane [EMAIL PROTECTED] wrote:


This would buy what exactly?


I guess you didn't read the other 80% of the post.

In short, faster performance through more aggressive runtime compilation.  A
JIT for the database kernel.  It's not like I'm on shaky ground here - other
commercial DBMS have done it for over a decade.


In exactly a fortnight from today, I'll repost my suggestion to rewrite the 
backend in Java ;-)

Regards,
Thomas Hallgren


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Simon Riggs
On Fri, 2006-03-17 at 13:29 +0800, Qingqing Zhou wrote:
 Simon Riggs [EMAIL PROTECTED] wrote
 
 Interesting, I understand that shared_work_mem is process-wise,
 allocate-when-use, request-may-or-may-not-get-it (as you have pointed out,
 this may make planner in a hard situation if we are sensitive to work_mem).
 But I still have something unclear. Let's say we have a sort operation need
 1024 memory. So the DBA may have the following two options:
 
 (1) SET work_mem = 1024; SET shared_work_mem = 0; do sort;
 (2) SET work_mem = 512; SET shared_work_mem = 512; do sort;
 
 So what's the difference between these two strategy?
 (1) Running time: do they use the same amount of memory? Why option 2 is
 better than 1?
 (2) Idle time: after sort done, option 1 will return all 1024 to the OS and
 2 will still keep 512?

The differences are
(1) no performance difference - all memory would be allocated and
deallocated at same time in either case
(2) shared_work_mem is SUSET rather than USERSET as work_mem is...
(3) The value is set for the whole server rather than by individual
tuning, so it would not make sense to use it as you have shown, even
though you could if you were the superuser

The goal is to do this:

do sort;/* no work_mem settings at all */

with shared_work_mem set once for the whole server in postgresql.conf

Best Regards, Simon Riggs


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Tom Lane
Qingqing Zhou [EMAIL PROTECTED] writes:
 So what's the difference between these two strategy?
 (1) Running time: do they use the same amount of memory? Why option 2 is
 better than 1?
 (2) Idle time: after sort done, option 1 will return all 1024 to the OS and
 2 will still keep 512?

Point 2 is actually a serious flaw in Simon's proposal, because there
is no portable way to make malloc return freed memory to the OS.  Some
mallocs will do that ... in some cases ... but many simply don't ever
move the brk address down.  It's not an easy thing to do when the arena
gets cluttered with a lot of different alloc chunks and only some of
them get freed.

So the semantics we'd have to adopt is that once a backend claims some
shared work mem, it keeps it until process exit.  I don't think that
makes the idea worthless, because there's usually a clear distinction
between processes doing expensive stuff and processes doing cheap
stuff.  But it's definitely a limitation.  Also, if you've got a process
doing expensive stuff, it's certainly possible to expect the user to
just increase work_mem locally.

(BTW, given that work_mem is locally increasable, I'm not sure what's
the point of considering that shared_work_mem has to be SUSET.  It's
not to prevent users from blowing out memory.)

My own thoughts about the problems with our work_mem arrangement are
that the real problem is the rule that we can allocate work_mem per sort
or hash operation; this makes the actual total memory use per backend
pretty unpredictable for nontrivial queries.  I don't know how to fix
this though.  The planner needs to know the work_mem that will be used
for any one of these operations in order to estimate costs, so simply
trying to divide up work_mem among the operations of a completed plan
tree is not going to improve matters.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Csaba Nagy
 My own thoughts about the problems with our work_mem arrangement are
 that the real problem is the rule that we can allocate work_mem per sort
 or hash operation; this makes the actual total memory use per backend
 pretty unpredictable for nontrivial queries.  I don't know how to fix
 this though.  The planner needs to know the work_mem that will be used
 for any one of these operations in order to estimate costs, so simply
 trying to divide up work_mem among the operations of a completed plan
 tree is not going to improve matters.

I know this is not right to the point related to what is discussed in
this thread, and that it would need some serious work, but how about a
mechanism to allow plans some flexibility at run-time ? What I mean is
not to do all the decisions at plan time, but include some branches in
the plan, and execute one branch or the other depending on actual
parameter values, current statistics, current memory available, ...
(name here other run-time resources).

This would make a lot more feasible to long-term cache query plans. For
e.g. you wouldn't have to worry too much about changing statistics if at
runtime you can check them again... and you could put decision points
based on current memory resources. Of course it still must be a balance
between the number of the decision points (which ultimately means the
size of the plan) and robustness against changing conditions, i.e.
branches should only go in for conditions likely to change.

Is this completely not feasible with current postgres architecture ? I
have no idea how the planning/runtime works internally.

It worths a look at how apache Derby does with query planning, where a
planned query is actually a compiled Java class, i.e. the executable
byte code which will run to fetch the results, created and compiled by
the planner... interesting approach, allows for lots of flexibility at
run-time, but probably won't work with C :-)

Cheers,
Csaba.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Josh Berkus
Tom,

 My own thoughts about the problems with our work_mem arrangement are
 that the real problem is the rule that we can allocate work_mem per sort
 or hash operation; this makes the actual total memory use per backend
 pretty unpredictable for nontrivial queries.  I don't know how to fix
 this though.  The planner needs to know the work_mem that will be used
 for any one of these operations in order to estimate costs, so simply
 trying to divide up work_mem among the operations of a completed plan
 tree is not going to improve matters.

Yes ... the unpredictability is the problem:
(1) We can only allocate the # of connections and default work_mem per 
operation.
(2) There are a variable # of concurrent queries per connection (0..1)
(3) Each query has a variable # of operations requiring work_mem, which 
will require a variable amount of work_mem.  If your malloc is good, this 
is limited to concurrent operations, but for some OSes this is all 
operations per query.  Thus the former uses 0..3xwork_mem per query and 
the latter 0..7x in general practice.

Overall, this means that based on a specific max_connections and work_mem, 
a variable amount between 1*work_mem and (connections*3*work_mem) memory 
may be needed at once.  Since the penalty for overallocating RAM is severe 
on most OSes, DBAs are forced to allow for the worst case.  This results 
in around 2/3 underallocation on systems with unpredictable loads.  This 
means that, even on heavily loaded DB systems, most of the time you're 
wasting a big chunk of your RAM.

Simon and I met about this (and other stuff) at the GreenPlum offices last 
summer.   The first plan we came up with is query queueing.  Query 
queueing would eliminate variability (2), which would then make the only 
variability one of how much work_mem would be needed per query, reducing 
(but not eliminating) the underallocation.   Additionally, we thought to 
tie the query queues to ROLES, which would allow the administrator to 
better control how much work_mem per type of query was allowed.  It would 
also allow admins to balance priorities better on mixed-load machines.

Mind you, I'm also thinking that on enterprise installations with 
multi-department use of the database, the fact that work_mem is 
inalienably USERSET is also an allocation problem.   One user with a SET 
command can blow all of your resource planning away.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Tom Lane
Josh Berkus josh@agliodbs.com writes:
 Mind you, I'm also thinking that on enterprise installations with 
 multi-department use of the database, the fact that work_mem is 
 inalienably USERSET is also an allocation problem.   One user with a SET 
 command can blow all of your resource planning away.

One user with ability to enter arbitrary SQL commands can *always* blow
your resource planning away.  Blaming such things on work_mem is
seriously misguided.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-03-17 kell 09:46, kirjutas Tom Lane:
 Qingqing Zhou [EMAIL PROTECTED] writes:
  So what's the difference between these two strategy?
  (1) Running time: do they use the same amount of memory? Why option 2 is
  better than 1?
  (2) Idle time: after sort done, option 1 will return all 1024 to the OS and
  2 will still keep 512?
 
 Point 2 is actually a serious flaw in Simon's proposal, because there
 is no portable way to make malloc return freed memory to the OS. 

So perhaps we could keep the shaded_work_mem in actual shared memory,
and alloc it to processes from there ?

We probably can't get it into a continuous chunk, but alt least we can
give it back for other backends to use when done.

 My own thoughts about the problems with our work_mem arrangement are
 that the real problem is the rule that we can allocate work_mem per sort
 or hash operation; this makes the actual total memory use per backend
 pretty unpredictable for nontrivial queries.  I don't know how to fix
 this though.  The planner needs to know the work_mem that will be used
 for any one of these operations in order to estimate costs, so simply
 trying to divide up work_mem among the operations of a completed plan
 tree is not going to improve matters.

Why not maybe make the work_mem allocation one of the variable
parameters thet is fed to planner, and try optimising for different sets
of sub-work_mems ?

---
Hannu



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Tom Lane
Hannu Krosing [EMAIL PROTECTED] writes:
 So perhaps we could keep the shaded_work_mem in actual shared memory,
 and alloc it to processes from there ?

No, that's utterly not reasonable, both from an allocation point of view
(you'd have to make shared memory enormous, and not all platforms will
like that) and from a locking point of view.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Martijn van Oosterhout
On Fri, Mar 17, 2006 at 04:45:17PM -0500, Tom Lane wrote:
 Hannu Krosing [EMAIL PROTECTED] writes:
  So perhaps we could keep the shaded_work_mem in actual shared memory,
  and alloc it to processes from there ?
 
 No, that's utterly not reasonable, both from an allocation point of view
 (you'd have to make shared memory enormous, and not all platforms will
 like that) and from a locking point of view.

Perhaps we just need to tweak the memory allocation routines to use
mmap() for large allocations rather than malloc(). Then they can be
easily returned to the system unlike the system heap. glibc does this
automatically sometimes.

Though you have to be careful, continuous mmap()/munmap() is more
expensive than malloc()/free() because mmap()ed memory must be zerod
out, which costs cycles...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Luke Lonergan
Csaba,

On 3/17/06 7:07 AM, Csaba Nagy [EMAIL PROTECTED] wrote:

 It worths a look at how apache Derby does with query planning, where a
 planned query is actually a compiled Java class, i.e. the executable
 byte code which will run to fetch the results, created and compiled by
 the planner... interesting approach, allows for lots of flexibility at
 run-time, but probably won't work with C :-)

We've looked at using the open source llvm compiler to create an
intermediate representation of the plan, then generate machine code and
dispatch for execution.

This would have the advantage of being able to place runtime constants into
the intermediate representation as constants (like the address of a
comparator function or operator), then let the compiler optimize them out,
hoist, etc.  You can't do this at compile time, and there would be no change
of the nice abstract code in the executor.

It's on our list - anyone else interested?

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatically setting work_mem

2006-03-17 Thread Tom Lane
Luke Lonergan [EMAIL PROTECTED] writes:
 We've looked at using the open source llvm compiler to create an
 intermediate representation of the plan, then generate machine code and
 dispatch for execution.

This would buy what exactly?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Automatically setting work_mem

2006-03-16 Thread Simon Riggs
One of the key points influencing performance of certain operations is
the amount of memory that is available to them. Sorts, Materialize, Hash
Joins and Aggs and hashed subquery plans all want lots of memory.

Static allocation of memory is good in some situations, but not in
others. In many cases, sysadmins want to be able to tell the server how
much memory it can use and then have the server work out how to allocate
that according to the work already executing. (Whatever we do, the
static allocation of memory via work_mem should remain, so I suggest
only additional options rather than change of the existing mechanisms.)

My goal is a simple and effective way of increasing performance without
needing to sweat over particular settings for individual backends, all
of which need to be individually reconsidered when we upgrade memory.
Small additional amounts of memory can make huge differences to elapsed
times; we need a flexible way to use more when it makes sense to do so.

I envisage a new setting, shared_work_mem, that would allow a sysadmin
to define a secondary pool of memory from which backends could dip into
once they run out of their primary allocation (work_mem).
shared_work_mem=0 would provide the current behaviour.

shared_work_mem would *not* be allocated until time of use; it is a
abstract concept only. As explained below it would not be shared memory
at all, but privately allocated process memory.

We would only look at dynamically changing work_mem in those few
restricted cases where we track that against the work_mem limit. If we
hit that limit, we would make a request to the central pool: Can I be
allotted another 2MB please? (etc). The central allotment mechanism
would then say Yes or No. If allotted the memory, the backend would then
palloc up to that limit. The backend may return later for additional
allotments, but for now it has been allowed to dynamically increase its
memory usage. This allotment would be noted in the memory context
header, so that when the memory context is freed, the allotment can be
returned to the central pool by a deallotment call. This is now easier
than before since each sort within a query has its own memory context.

(I use the term allot to differentiate it from the term allocate
which describes the execution of malloc etc. First the server would
conceptually allot memory, then we would physically allocate it. Once
allocated, memory would not be deallocated any earlier than normal.
Memory would never be deallotted until the memory is deallocated.)

shared_work_mem would be a SUSET parameter, allowing it to be changed
up/down while server running. All of the details around this would be
built into a new API for re/palloc calls aimed at larger requests.

Some thorny points are:
1. what is the allotment policy?
2. what do we do when the memory pool has all been used up?
3. do we make the allocation at planning time? - allowing us to
potentially use a different plan because we know we will have the memory
to use when we get to the executor.

My current thoughts are:
(1) allow different allotment policies, possibly even using an external
function call, but basically giving everybody the flexibility they want.
For now, we can provide a simple mechanism for starters, then add more
later
e.g. shared_work_mem_policy = 'firstcomefirstserved(10)'
I don't want to introduce too many new parameters here...

(2) Various options here: 
a) query queues for allocation
b) query throws ERROR OOM (unlikely to be a popular one)
c) query gets nothing (good alloc scheme can prevent harshness here...)
Simplicity says c), since a) is lots of work and b) not useful

(3) lets do this simply to start with - allocation only occurs in
executor, so is tied neatly into the executor memory contexts.

Another idea is to introduce transactional statement queuing, but that
may be a sledgehammer to crack a fairly simple nut.

There are some additional technical points which need not be discussed
yet, though which would need to be addressed also.

Best Regards, Simon Riggs



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Automatically setting work_mem

2006-03-16 Thread Qingqing Zhou

Simon Riggs [EMAIL PROTECTED] wrote

 We would only look at dynamically changing work_mem in those few
 restricted cases where we track that against the work_mem limit. If we
 hit that limit, we would make a request to the central pool: Can I be
 allotted another 2MB please? (etc). The central allotment mechanism
 would then say Yes or No. If allotted the memory, the backend would then
 palloc up to that limit. The backend may return later for additional
 allotments, but for now it has been allowed to dynamically increase its
 memory usage. This allotment would be noted in the memory context
 header, so that when the memory context is freed, the allotment can be
 returned to the central pool by a deallotment call. This is now easier
 than before since each sort within a query has its own memory context.


Interesting, I understand that shared_work_mem is process-wise,
allocate-when-use, request-may-or-may-not-get-it (as you have pointed out,
this may make planner in a hard situation if we are sensitive to work_mem).
But I still have something unclear. Let's say we have a sort operation need
1024 memory. So the DBA may have the following two options:

(1) SET work_mem = 1024; SET shared_work_mem = 0; do sort;
(2) SET work_mem = 512; SET shared_work_mem = 512; do sort;

So what's the difference between these two strategy?
(1) Running time: do they use the same amount of memory? Why option 2 is
better than 1?
(2) Idle time: after sort done, option 1 will return all 1024 to the OS and
2 will still keep 512?

Regards,
Qingqing



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org