Re: [HACKERS] 2nd Level Buffer Cache

2011-04-26 Thread Bruce Momjian
Josh Berkus wrote:
 
  Was it really all that bad?  IIRC we replaced ARC with the current clock
  sweep due to patent concerns.  (Maybe there were performance concerns as
  well, I don't remember).
 
 Yeah, that was why the patent was frustrating.  Performance was poor and
 we were planning on replacing ARC in 8.2 anyway.  Instead we had to
 backport it.

[ Replying late.]

FYI, the performance problem was that while ARC was slightly better than
clock sweep in keeping useful buffers in the cache, it was terrible when
multiple CPUs were all modifying the buffer cache, which is why we were
going to remove it anyway.

In summary, any new algorithm has to be better at keeping useful data in
the cache, and also not slow down workloads on multiple CPUs.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-31 Thread Greg Smith

On 03/24/2011 03:36 PM, Jim Nasby wrote:

On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
   

Robert Haasrobertmh...@gmail.com  writes:
 

It looks like the only way anything can ever get put on the free list
right now is if a relation or database is dropped.  That doesn't seem
too good.
   

Why not?  AIUI the free list is only for buffers that are totally dead,
ie contain no info that's possibly of interest to anybody.  It is *not*
meant to substitute for running the clock sweep when you have to discard
a live buffer.
 

Turns out we've had this discussion before: 
http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and 
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php
   


Investigating this has been on the TODO list for four years now:

http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php

I feel that work in this area is blocked behind putting together a 
decent mix of benchmarks that can be used to test whether changes here 
are actually good or bad.  All of the easy changes to buffer allocation 
strategy, ones that you could verify by inspection and simple tests, 
were made in 8.3.  The stuff that's left has the potential to either 
improve or reduce performance, and which will happen is very workload 
dependent.


Setting up systematic benchmarks of multiple workloads to run 
continuously on big hardware is a large, boring, expensive problem that 
few can justify financing (except for Jim of course), and even fewer 
want to volunteer time toward.  This whole discussion of cache policy 
tweaks is fun, but I just delete all the discussion now because it's 
just going in circles without a good testing regime.  The right way to 
start is by saying this is the benchmark I'm going to improve with this 
change, and it has a profiled hotspot at this point.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-26 Thread Jeff Janes
On Fri, Mar 25, 2011 at 8:07 AM, Gurjeet Singh singh.gurj...@gmail.com wrote:
 On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote:
  On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com
  wrote:
 
  A related area that could use some looking at is why performance tops
  out at shared_buffers ~8GB and starts to fall thereafter.
 
  Under what circumstances does this happen?  Can a simple pgbench -S
  with a large scaling factor elicit this behavior?

 To be honest, I'm mostly just reporting what I've heard Greg Smith say
 on this topic.   I don't have any machine with that kind of RAM.

 I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple
 Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2
 Compute Units each), 1690 GB of local instance storage, 64-bit platform).
 That's the largest memory AWS has.

Does AWS have machines with battery-backed write cache?  I think
people running servers with 192G probably have BBWC, so it may be hard
to do realistic tests without also having one on the test machine.

But probably a bigger problem is that (to the best of my knowledge) we
don't seem to have a non-proprietary, generally implementable
benchmark system or load-generator which is known to demonstrate the
problem.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-25 Thread Gurjeet Singh
On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote:
  On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com
 wrote:
  On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
  kevin.gritt...@wicourts.gov wrote:
  Maybe the thing to focus on first is the oft-discussed benchmark
  farm (similar to the build farm), with a good mix of loads, so
  that the impact of changes can be better tracked for multiple
  workloads on a variety of platforms and configurations.  Without
  something like that it is very hard to justify the added complexity
  of an idea like this in terms of the performance benefit gained.
 
  A related area that could use some looking at is why performance tops
  out at shared_buffers ~8GB and starts to fall thereafter.
 
  Under what circumstances does this happen?  Can a simple pgbench -S
  with a large scaling factor elicit this behavior?

 To be honest, I'm mostly just reporting what I've heard Greg Smith say
 on this topic.   I don't have any machine with that kind of RAM.


I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple
Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2
Compute Units each), 1690 GB of local instance storage, 64-bit platform).
That's the largest memory AWS has.

Let me know if I can help.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-25 Thread Jim Nasby
On Mar 25, 2011, at 10:07 AM, Gurjeet Singh wrote:
 On Tue, Mar 22, 2011 at 3:53 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote:
  On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
  On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
  kevin.gritt...@wicourts.gov wrote:
  Maybe the thing to focus on first is the oft-discussed benchmark
  farm (similar to the build farm), with a good mix of loads, so
  that the impact of changes can be better tracked for multiple
  workloads on a variety of platforms and configurations.  Without
  something like that it is very hard to justify the added complexity
  of an idea like this in terms of the performance benefit gained.
 
  A related area that could use some looking at is why performance tops
  out at shared_buffers ~8GB and starts to fall thereafter.
 
  Under what circumstances does this happen?  Can a simple pgbench -S
  with a large scaling factor elicit this behavior?
 
 To be honest, I'm mostly just reporting what I've heard Greg Smith say
 on this topic.   I don't have any machine with that kind of RAM.
 
 I can sponsor a few hours (say 10) of one High-memory on-demand Quadruple 
 Extra Large instance (26 EC2 Compute Units (8 virtual cores with 3.25 EC2 
 Compute Units each), 1690 GB of local instance storage, 64-bit platform). 
 That's the largest memory AWS has.

Related to that... after talking to Greg Smith at PGEast last night, he felt it 
would be very valuable just to profile how much time is being spent 
waiting/holding the freelist lock in a real environment. I'm going to see if we 
can do that on one of our slave databases.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-25 Thread Jeff Janes
On Thu, Mar 24, 2011 at 7:51 PM, Greg Stark gsst...@mit.edu wrote:
 On Thu, Mar 24, 2011 at 11:33 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I tried under the circumstances I thought were mostly likely to show a
 time difference, and I was unable to detect a reliable difference in
 timing between free list and clock sweep.

 It strikes me that it shouldn't be terribly hard to add a profiling
 option to Postgres to dump out a list of precisely which blocks of
 data were accessed in which order. Then it's fairly straightforward to
 process that list using different algorithms to measure which
 generates the fewest cache misses.

It is pretty easy to get the list by adding a couple elog.  To be
safe you probably also need to record pins and unpins, as you can't
evict a pinned buffer no matter how other-wise eligible it might be.
For most workloads you might be able to get away with just assuming
that if it is eligible for replacement under any reasonable strategy,
than it is very unlikely to still be pinned.  Also, if the list is
derived from a concurrent environment, then the order of access you
see under a particular policy might no longer be the same if a
different policy were adopted.

But whose work-load would you use to do the testing?  The ones I was
testing were simple enough that I just know what the access pattern
is, the root and 1st level branch blocks are almost always in shared
buffer, the leaf and table blocks almost never are.

Here my concern was not how to choose which block to replace in a
conceptual way, but rather how to code that selection in way that is
fast and concurrent and low latency for the latency-sensitive
processes.  Either method will evict the same blocks, with the
exception of differences introduced by race conditions that get
resolved differently.

A benefit of focusing on the implementation rather than the high level
selection strategy is that improvements in implementation are more
likely to better carry over to other workloads.

My high level conclusions were that the running of the selection is
generally not a bottleneck, and in the cases where it was, the
bottleneck was due to contention on the LWLock, regardless of what was
done under that lock.  Changing who does the clock-sweep is probably
not meaningful unless it facilitates a lock-strength reduction or
other contention reduction.

I have also played with simulations of different algorithms for
managing the usage_count, and I could get improvements but they
weren't big enough or general enough to be very exciting.  It was
generally the case were if the data size was X, the improvement was
maybe 30% over the current, but if the data size was 0.8X or 1.2X,
there was no difference.  So not very general.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-25 Thread Robert Haas
On Mar 25, 2011, at 11:58 AM, Jim Nasby j...@nasby.net wrote:
 Related to that... after talking to Greg Smith at PGEast last night, he felt 
 it would be very valuable just to profile how much time is being spent 
 waiting/holding the freelist lock in a real environment. I'm going to see if 
 we can do that on one of our slave databases.

Yeah, that would be great. Also, some LWLOCK_STATS output or oprofile output 
would be definitely be useful.

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Jim Nasby
On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 It looks like the only way anything can ever get put on the free list
 right now is if a relation or database is dropped.  That doesn't seem
 too good.
 
 Why not?  AIUI the free list is only for buffers that are totally dead,
 ie contain no info that's possibly of interest to anybody.  It is *not*
 meant to substitute for running the clock sweep when you have to discard
 a live buffer.

Turns out we've had this discussion before: 
http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and 
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php

Tom made the point in the first one that it might be good to proactively move 
buffers to the freelist so that backends would normally just have to hit the 
freelist and not run the sweep.

Unfortunately I haven't yet been able to do any performance testing of any of 
this... perhaps someone else can try and measure the amount of time spent by 
backends running the clock sweep with different shared buffer sizes.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Radosław Smogura
Jim Nasby j...@nasby.net Thursday 24 March 2011 20:36:48
 On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
  Robert Haas robertmh...@gmail.com writes:
  It looks like the only way anything can ever get put on the free list
  right now is if a relation or database is dropped.  That doesn't seem
  too good.
  
  Why not?  AIUI the free list is only for buffers that are totally dead,
  ie contain no info that's possibly of interest to anybody.  It is *not*
  meant to substitute for running the clock sweep when you have to discard
  a live buffer.
 
 Turns out we've had this discussion before:
 http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and
 http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php
 
 Tom made the point in the first one that it might be good to proactively
 move buffers to the freelist so that backends would normally just have to
 hit the freelist and not run the sweep.
 
 Unfortunately I haven't yet been able to do any performance testing of any
 of this... perhaps someone else can try and measure the amount of time
 spent by backends running the clock sweep with different shared buffer
 sizes. --
 Jim C. Nasby, Database Architect   j...@nasby.net
 512.569.9461 (cell) http://jim.nasby.net

Will not be enough to take spin lock (or make ASM (lock) and increment call 
for Intels/AMD) around increment StrategyControl-nextVictimBuffer, everything 
here may be controlled by macro GetNextVictimBuffer(). Within for (;;) the 
valid buffer may be obtained with modulo NBuffers, to decrease lock time. We 
may try to calculate how many buffers we had skipped, and decrease e.g. 
trycount by this value, and put some additional restriction like no more 
passes then NBuffers*4 calls, and notify error.

This will made clock sweep concurrent.

Regards,
Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Robert Haas
On Wed, Mar 23, 2011 at 6:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 It looks like the only way anything can ever get put on the free list
 right now is if a relation or database is dropped.  That doesn't seem
 too good.

 Why not?  AIUI the free list is only for buffers that are totally dead,
 ie contain no info that's possibly of interest to anybody.  It is *not*
 meant to substitute for running the clock sweep when you have to discard
 a live buffer.

It seems at least plausible that buffer allocation could be
significantly faster if it need only pop the head of a list, rather
than scanning until it finds a suitable candidate.  Moving as much
buffer allocation work as possible into the background seems like it
ought to be useful.

Granted, I've made no attempt to code or test this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Greg Stark
On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote:
 It seems at least plausible that buffer allocation could be
 significantly faster if it need only pop the head of a list, rather
 than scanning until it finds a suitable candidate.  Moving as much
 buffer allocation work as possible into the background seems like it
 ought to be useful.


Linked lists are notoriously non-concurrent, that's the whole reason
for the clock sweep algorithm to exist at all instead of just using an
LRU directly. That said, an LRU needs to be able to remove elements
from the middle and not just enqueue elements on the tail, so the
situation isn't exactly equivalent.

Just popping off the head is simple enough but the bgwriter would need
to be able to add elements to the tail of the list and the people
popping elements off the head would need to compete with it for the
lock on the list. And I think you need a single lock for the whole
list because of the cases where the list becomes a single element or
empty.

The main impact this list would have is that it would presumably need
some real number of buffers to satisfy the pressure for victim buffers
for a real amount of time. That would represent a decrease in cache
size, effectively evicting buffers from cache as if the cache were
smaller by that amount.

Theoretical results are that a small change in cache size affects
cache hit rates substantially. I'm not sure that's born out by
practical experience with Postgres though. People tend to either be
doing mostly i/o or very little i/o. Cache hit rate only really
matters and is likely to be affected by small changes in cache size in
the space in between

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Robert Haas
On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark gsst...@mit.edu wrote:
 On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com wrote:
 It seems at least plausible that buffer allocation could be
 significantly faster if it need only pop the head of a list, rather
 than scanning until it finds a suitable candidate.  Moving as much
 buffer allocation work as possible into the background seems like it
 ought to be useful.

 Linked lists are notoriously non-concurrent, that's the whole reason
 for the clock sweep algorithm to exist at all instead of just using an
 LRU directly. That said, an LRU needs to be able to remove elements
 from the middle and not just enqueue elements on the tail, so the
 situation isn't exactly equivalent.

 Just popping off the head is simple enough but the bgwriter would need
 to be able to add elements to the tail of the list and the people
 popping elements off the head would need to compete with it for the
 lock on the list. And I think you need a single lock for the whole
 list because of the cases where the list becomes a single element or
 empty.

 The main impact this list would have is that it would presumably need
 some real number of buffers to satisfy the pressure for victim buffers
 for a real amount of time. That would represent a decrease in cache
 size, effectively evicting buffers from cache as if the cache were
 smaller by that amount.

 Theoretical results are that a small change in cache size affects
 cache hit rates substantially. I'm not sure that's born out by
 practical experience with Postgres though. People tend to either be
 doing mostly i/o or very little i/o. Cache hit rate only really
 matters and is likely to be affected by small changes in cache size in
 the space in between

You wouldn't really have to reduce the effective cache size - there's
logic in there to just skip to the next buffer if the first one you
pull off the freelist has been reused.  But the cache hit rates on
those buffers would (you'd hope) be fairly low, since they are the
ones we're about to reuse.  Maybe it doesn't work out to a win,
though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Radosław Smogura
Robert Haas robertmh...@gmail.com Thursday 24 March 2011 22:41:19
 On Thu, Mar 24, 2011 at 5:34 PM, Greg Stark gsst...@mit.edu wrote:
  On Thu, Mar 24, 2011 at 8:59 PM, Robert Haas robertmh...@gmail.com 
wrote:
  It seems at least plausible that buffer allocation could be
  significantly faster if it need only pop the head of a list, rather
  than scanning until it finds a suitable candidate.  Moving as much
  buffer allocation work as possible into the background seems like it
  ought to be useful.
  
  Linked lists are notoriously non-concurrent, that's the whole reason
  for the clock sweep algorithm to exist at all instead of just using an
  LRU directly. That said, an LRU needs to be able to remove elements
  from the middle and not just enqueue elements on the tail, so the
  situation isn't exactly equivalent.
  
  Just popping off the head is simple enough but the bgwriter would need
  to be able to add elements to the tail of the list and the people
  popping elements off the head would need to compete with it for the
  lock on the list. And I think you need a single lock for the whole
  list because of the cases where the list becomes a single element or
  empty.
  
  The main impact this list would have is that it would presumably need
  some real number of buffers to satisfy the pressure for victim buffers
  for a real amount of time. That would represent a decrease in cache
  size, effectively evicting buffers from cache as if the cache were
  smaller by that amount.
  
  Theoretical results are that a small change in cache size affects
  cache hit rates substantially. I'm not sure that's born out by
  practical experience with Postgres though. People tend to either be
  doing mostly i/o or very little i/o. Cache hit rate only really
  matters and is likely to be affected by small changes in cache size in
  the space in between
 
 You wouldn't really have to reduce the effective cache size - there's
 logic in there to just skip to the next buffer if the first one you
 pull off the freelist has been reused.  But the cache hit rates on
 those buffers would (you'd hope) be fairly low, since they are the
 ones we're about to reuse.  Maybe it doesn't work out to a win,
 though.
If I may,
Under unnormal circumstances (like current process is held by kernel) 
obtaining buffer from list may be cheaper
this code 
while (StrategyControl-firstFreeBuffer = 0)
{
buf = BufferDescriptors[StrategyControl-firstFreeBuffer];
Assert(buf-freeNext != FREENEXT_NOT_IN_LIST);

/* Unconditionally remove buffer from freelist */
StrategyControl-firstFreeBuffer = buf-freeNext;
buf-freeNext = FREENEXT_NOT_IN_LIST;
could look 
do
{
SpinLock();
if (StrategyControl-firstFreeBuffer = 0) {
Unspin()
break;
}

buf = BufferDescriptors[StrategyControl-firstFreeBuffer];
Unspin();

Assert(buf-freeNext != FREENEXT_NOT_IN_LIST);

/* Unconditionally remove buffer from freelist */
StrategyControl-firstFreeBuffer = buf-freeNext;
buf-freeNext = FREENEXT_NOT_IN_LIST;like this
}while(true);
and aquirng spin lock for linked list is enaugh, and cheaper then taking 
lwlock is more complex than spin on this.

after this simmilary with spin lock on 
trycounter = NBuffers*4;
for (;;)
{
spinlock()
buf = BufferDescriptors[StrategyControl-nextVictimBuffer];

if (++StrategyControl-nextVictimBuffer = NBuffers)
{
StrategyControl-nextVictimBuffer = 0;
StrategyControl-completePasses++;
}
unspin();

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Jeff Janes
On Thu, Mar 24, 2011 at 12:36 PM, Jim Nasby j...@nasby.net wrote:
 On Mar 23, 2011, at 5:12 PM, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
 It looks like the only way anything can ever get put on the free list
 right now is if a relation or database is dropped.  That doesn't seem
 too good.

 Why not?  AIUI the free list is only for buffers that are totally dead,
 ie contain no info that's possibly of interest to anybody.  It is *not*
 meant to substitute for running the clock sweep when you have to discard
 a live buffer.

 Turns out we've had this discussion before: 
 http://archives.postgresql.org/pgsql-hackers/2010-12/msg01088.php and 
 http://archives.postgresql.org/pgsql-hackers/2010-12/msg00689.php

 Tom made the point in the first one that it might be good to proactively move 
 buffers to the freelist so that backends would normally just have to hit the 
 freelist and not run the sweep.

 Unfortunately I haven't yet been able to do any performance testing of any of 
 this... perhaps someone else can try and measure the amount of time spent by 
 backends running the clock sweep with different shared buffer sizes.

I tried under the circumstances I thought were mostly likely to show a
time difference, and I was unable to detect a reliable difference in
timing between free list and clock sweep.


Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-24 Thread Greg Stark
On Thu, Mar 24, 2011 at 11:33 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I tried under the circumstances I thought were mostly likely to show a
 time difference, and I was unable to detect a reliable difference in
 timing between free list and clock sweep.

It strikes me that it shouldn't be terribly hard to add a profiling
option to Postgres to dump out a list of precisely which blocks of
data were accessed in which order. Then it's fairly straightforward to
process that list using different algorithms to measure which
generates the fewest cache misses.

This is usually how the topic is handled in academic discussions. The
optimal cache policy is the one which flushes the cache entry which
will be used next the furthest into the future. Given a precalculated
file you can calculate the results from that optimal strategy and then
compare each strategy against that one.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Radosław Smogura
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16
 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote:
  On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com 
wrote:
  Can't you make just one large mapping and lock it in 8k regions? I
  thought the problem with mmap was not being able to detect other
  processes
  (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm
  l) compatibility issues (possibly obsolete), etc.
  
  I was assuming that locking part of a mapping would force the kernel
  to split the mapping. It has to record the locked state somewhere so
  it needs a data structure that represents the size of the locked
  section and that would, I assume, be the mapping.
  
  It's possible the kernel would not in fact fall over too badly doing
  this. At some point I'll go ahead and do experiments on it. It's a bit
  fraught though as it the performance may depend on the memory
  management features of the chipset.
  
  That said, that's only part of the battle. On 32bit you can't map the
  whole database as your database could easily be larger than your
  address space. I have some ideas on how to tackle that but the
  simplest test would be to just mmap 8kB chunks everywhere.
 
 Even on 64 bit systems you only have 48 bit address space which is not
 a theoretical  limitation.  However, at least on linux you can map in
 and map out pretty quick (10 microseconds paired on my linux vm) so
 that's not so big of a deal.  Dealing with rapidly growing files is a
 problem.  That said, probably you are not going to want to reserve
 multiple gigabytes in 8k non contiguous chunks.
 
  But it's worse than that. Since you're not responsible for flushing
  blocks to disk any longer you need some way to *unlock* a block when
  it's possible to be flushed. That means when you flush the xlog you
  have to somehow find all the blocks that might no longer need to be
  locked and atomically unlock them. That would require new
  infrastructure we don't have though it might not be too hard.
  
  What would be nice is a mlock_until() where you eventually issue a
  call to tell the kernel what point in time you've reached and it
  unlocks everything older than that time.
 
 I wonder if there is any reason to mlock at all...if you are going to
 'do' mmap, can't you just hide under current lock architecture for
 actual locking and do direct memory access without mlock?
 
 merlin
I can't reproduce this. Simple test shows 2x faster read with mmap that 
read();

I'm sending this what I done with mmap (really ugly, but I'm in forest). It is 
read only solution, so init database first with some amount of data (I have 
about 300MB) (2nd level scripts may do this for You).

This what I found:
1. If I not require to put new mmap (mmap with FIXED) in previous region (just 
I do munmap / mmap) with each query, execution time grows, about 10%.

2. Sometimes is enough just to comment or uncomment something that do not have 
side effects on code flow (bufmgr.c; (un)comment some unused if; put NULL, it 
will be replaced), and e.g. query execution time may grow 2x.

3. My initial solution, was 2% faster, about 9ms when reading, now it's 10% 
slower, after making them more usable.

Regards,
Radek


pg_mmap_20110323.patch.bz2
Description: application/bzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Radosław Smogura
Merlin Moncure mmonc...@gmail.com Tuesday 22 March 2011 23:06:02
 On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura
 
 rsmog...@softperience.eu wrote:
  Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16
  
  On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote:
   On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com
  
  wrote:
   Can't you make just one large mapping and lock it in 8k regions? I
   thought the problem with mmap was not being able to detect other
   processes
   (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.h
   tm l) compatibility issues (possibly obsolete), etc.
   
   I was assuming that locking part of a mapping would force the kernel
   to split the mapping. It has to record the locked state somewhere so
   it needs a data structure that represents the size of the locked
   section and that would, I assume, be the mapping.
   
   It's possible the kernel would not in fact fall over too badly doing
   this. At some point I'll go ahead and do experiments on it. It's a bit
   fraught though as it the performance may depend on the memory
   management features of the chipset.
   
   That said, that's only part of the battle. On 32bit you can't map the
   whole database as your database could easily be larger than your
   address space. I have some ideas on how to tackle that but the
   simplest test would be to just mmap 8kB chunks everywhere.
  
  Even on 64 bit systems you only have 48 bit address space which is not
  a theoretical  limitation.  However, at least on linux you can map in
  and map out pretty quick (10 microseconds paired on my linux vm) so
  that's not so big of a deal.  Dealing with rapidly growing files is a
  problem.  That said, probably you are not going to want to reserve
  multiple gigabytes in 8k non contiguous chunks.
  
   But it's worse than that. Since you're not responsible for flushing
   blocks to disk any longer you need some way to *unlock* a block when
   it's possible to be flushed. That means when you flush the xlog you
   have to somehow find all the blocks that might no longer need to be
   locked and atomically unlock them. That would require new
   infrastructure we don't have though it might not be too hard.
   
   What would be nice is a mlock_until() where you eventually issue a
   call to tell the kernel what point in time you've reached and it
   unlocks everything older than that time.
  
  I wonder if there is any reason to mlock at all...if you are going to
  'do' mmap, can't you just hide under current lock architecture for
  actual locking and do direct memory access without mlock?
  
  merlin
  
  Actually after dealing with mmap and adding munmap I found crucial thing
  why to not use mmap:
  You need to munmap, and for me this takes much time, even if I read with
  SHARED | PROT_READ, it's looks like Linux do flush or something else,
  same as with MAP_FIXED, MAP_PRIVATE, etc.
 
 can you produce small program demonstrating the problem?  This is not
 how things should work AIUI.
 
 I was thinking about playing with mmap implementation of clog system
 -- it's perhaps better fit.  clog is rigidly defined size, and has
 very high performance requirements.  Also it's much less changes than
 reimplementing heap buffering, and maybe not so much affected by
 munmap.
 
 merlin

Ah... just one thing, maybe usefull why performance is lost with huge memory. 
I saw mmaped buffers are allocated in something like 0x007, so definitly above 
4gb.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Jim Nasby
On Mar 22, 2011, at 2:53 PM, Robert Haas wrote:
 On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Maybe the thing to focus on first is the oft-discussed benchmark
 farm (similar to the build farm), with a good mix of loads, so
 that the impact of changes can be better tracked for multiple
 workloads on a variety of platforms and configurations.  Without
 something like that it is very hard to justify the added complexity
 of an idea like this in terms of the performance benefit gained.
 
 A related area that could use some looking at is why performance tops
 out at shared_buffers ~8GB and starts to fall thereafter.
 
 Under what circumstances does this happen?  Can a simple pgbench -S
 with a large scaling factor elicit this behavior?
 
 To be honest, I'm mostly just reporting what I've heard Greg Smith say
 on this topic.   I don't have any machine with that kind of RAM.

When we started using 192G servers we tried switching our largest OLTP database 
(would have been about 1.2TB at the time) from 8GB shared buffers to 28GB. 
Performance went down enough to notice; I don't have any solid metrics, but I'd 
ballpark it at 10-15%.

One thing that I've always wondered about is the logic of having backends run 
the clocksweep on a normal basis. OS's that use clock-sweep have a dedicated 
process to run the clock in the background, with the intent of keeping X amount 
of pages on the free list. We actually have most of the mechanisms to do that, 
we just don't have the added process. I believe bg_writer was intended to 
handle that, but in reality I don't think it actually manages to keep much of 
anything on the free list. Once we have a performance testing environment I'd 
be interested to test a modified version that includes a dedicated background 
clock sweep process that strives to keep X amount of buffers on the free list.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Robert Haas
On Wed, Mar 23, 2011 at 1:53 PM, Jim Nasby j...@nasby.net wrote:
 When we started using 192G servers we tried switching our largest OLTP 
 database (would have been about 1.2TB at the time) from 8GB shared buffers to 
 28GB. Performance went down enough to notice; I don't have any solid metrics, 
 but I'd ballpark it at 10-15%.

 One thing that I've always wondered about is the logic of having backends run 
 the clocksweep on a normal basis. OS's that use clock-sweep have a dedicated 
 process to run the clock in the background, with the intent of keeping X 
 amount of pages on the free list. We actually have most of the mechanisms to 
 do that, we just don't have the added process. I believe bg_writer was 
 intended to handle that, but in reality I don't think it actually manages to 
 keep much of anything on the free list. Once we have a performance testing 
 environment I'd be interested to test a modified version that includes a 
 dedicated background clock sweep process that strives to keep X amount of 
 buffers on the free list.

It looks like the only way anything can ever get put on the free list
right now is if a relation or database is dropped.  That doesn't seem
too good.  I wonder if the background writer shouldn't be trying to
maintain the free list.  That is, perhaps BgBufferSync() should notice
when the number of free buffers drops below some threshold, and run
the clock sweep enough to get it back up to that threshold.

On a related note, I've been thinking about whether we could make
bgwriter_delay adaptively self-tuning.  If we notice that we
overslept, we don't sleep as long the next time; if not much happens
while we sleep, we sleep longer the next time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Greg Stark
On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote:
 It looks like the only way anything can ever get put on the free list
 right now is if a relation or database is dropped.  That doesn't seem
 too good.  I wonder if the background writer shouldn't be trying to
 maintain the free list.  That is, perhaps BgBufferSync() should notice
 when the number of free buffers drops below some threshold, and run
 the clock sweep enough to get it back up to that threshold.


I think this is just a terminology discrepancy. In postgres the free
list is only used for buffers that contain no useful data at all. The
only time there are buffers on the free list is at startup or if a
relation or database is dropped.

Most of the time blocks are read into buffers that already contain
other data. Candidate buffers to evict are buffers that have been used
least recently. That's what the clock sweep implements.

What the bgwriter's responsible for is looking at the buffers *ahead*
of the clock sweep and flushing them to disk. They stay in ram and
don't go on the free list, all that changes is that they're clean and
therefore can be reused without having to do any i/o.

I'm a bit skeptical that this works because as soon as bgwriter
saturates the i/o the os will throttle the rate at which it can write.
When that happens even a few dozens of milliseconds will be plenty to
allow the purely user-space processes consuming the buffers to catch
up instantly.

But Greg Smith has done a lot of work tuning the bgwriter so that it
is at least useful in some circumstances. I could well see it being
useful for systems where latency matters and the i/o is not saturated.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Radosław Smogura
Greg Stark gsst...@mit.edu Wednesday 23 March 2011 21:30:04
 On Wed, Mar 23, 2011 at 8:00 PM, Robert Haas robertmh...@gmail.com wrote:
  It looks like the only way anything can ever get put on the free list
  right now is if a relation or database is dropped.  That doesn't seem
  too good.  I wonder if the background writer shouldn't be trying to
  maintain the free list.  That is, perhaps BgBufferSync() should notice
  when the number of free buffers drops below some threshold, and run
  the clock sweep enough to get it back up to that threshold.
 
 I think this is just a terminology discrepancy. In postgres the free
 list is only used for buffers that contain no useful data at all. The
 only time there are buffers on the free list is at startup or if a
 relation or database is dropped.
 
 Most of the time blocks are read into buffers that already contain
 other data. Candidate buffers to evict are buffers that have been used
 least recently. That's what the clock sweep implements.
 
 What the bgwriter's responsible for is looking at the buffers *ahead*
 of the clock sweep and flushing them to disk. They stay in ram and
 don't go on the free list, all that changes is that they're clean and
 therefore can be reused without having to do any i/o.
 
 I'm a bit skeptical that this works because as soon as bgwriter
 saturates the i/o the os will throttle the rate at which it can write.
 When that happens even a few dozens of milliseconds will be plenty to
 allow the purely user-space processes consuming the buffers to catch
 up instantly.
 
 But Greg Smith has done a lot of work tuning the bgwriter so that it
 is at least useful in some circumstances. I could well see it being
 useful for systems where latency matters and the i/o is not saturated.

Freelist is almost useless under normal operations, but it's only one check if 
it's empty or not, which could be optimized by checking (... -1), or !(...  
0)

Regards,
Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 It looks like the only way anything can ever get put on the free list
 right now is if a relation or database is dropped.  That doesn't seem
 too good.

Why not?  AIUI the free list is only for buffers that are totally dead,
ie contain no info that's possibly of interest to anybody.  It is *not*
meant to substitute for running the clock sweep when you have to discard
a live buffer.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread KONDO Mitsumasa

Hi, hackers.

I am interested in this discussion!
So I surveyed current buffer algorithms around other software. I share about it.
(sorry, it is easy survey..)

CLOCK-PRO and LIRS are popular in current buffer algorithms in my easy survey.
Their algorithms are same author that is Song Jiang.
CLOCK-PRO is improved LIRS algorithm based on CLOCK algorithm.

CLOCK-PRO is used by Apache Derby and NetBSD.
And, LIRS is used by MySQL.


The following is easy explanation of LIRS.

LRU use Recency metric that is the number of other blocks accessed from last 
reference to the current time.

Strong points of LRU
 - Low overhead and simplicity data structure
 - LRU assumption is works well

Weak points of LRU
 - A recently used block will not necessarily be used again or soon
 - The prediction is based on a single source information


LIRS algorithm use Recency metric and Inter-Reference Recency(IRR) metric that 
is the number of other unique blocks accessed between two consecutive 
references to the block.
The priority in LIRS algorithm is the order of IRR and Recency.
IRR metric compensate for LRU weak points.

LIRS paper insists on the following.
 - LIRS is same overhead as LRU.
 - Results of experiments were indicated that LIRS is higher buffer hit rate 
than LRU and other buffer algorithms.
   * Their experiment is used LIRS and other algorithms in PostgreSQL buffer 
system.


In CLOCK-PRO paper is indicated that CLOCK-PRO is superior than LIRS and other 
buffer algorithms (including Arc).


I think that PostgreSQL is very powerful and reliable database!
So I hope that PostgreSQL buffer system will be more powerful and more 
intelligent.

Thanks.

[Refference]
 - CLOCK-PRO: 
http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang05_CLOCK-Pro.pdf
 - LIRS: 
http://dragonstar.ict.ac.cn/course_09/XD_Zhang/%286%29-LIRS-replacement.pdf
 - Apache Derbey(Google Summer Code): 
http://www.eecg.toronto.edu/~gokul/derby/derby-report-aug-19-2006.pdf
 - NetBSD source code: 
http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD
 - MySQL source code: 
http://mysql.lamphost.net/sources/doxygen/mysql-5.1/structPgman_1_1Page__entry.html
 - Song Jiang HP: http://www.ece.eng.wayne.edu/~sjiang/

--
Kondo Mitsumasa
NTT Corporation, NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Jeff Janes
On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Maybe the thing to focus on first is the oft-discussed benchmark
 farm (similar to the build farm), with a good mix of loads, so
 that the impact of changes can be better tracked for multiple
 workloads on a variety of platforms and configurations.  Without
 something like that it is very hard to justify the added complexity
 of an idea like this in terms of the performance benefit gained.

 A related area that could use some looking at is why performance tops
 out at shared_buffers ~8GB and starts to fall thereafter.

Under what circumstances does this happen?  Can a simple pgbench -S
with a large scaling factor elicit this behavior?

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Jeff Janes
On Fri, Mar 18, 2011 at 8:14 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 rsmogura rsmog...@softperience.eu wrote:

 Yes, there is some change, and I looked at this more carefully, as
 my performance results wasn't such as I expected. I found PG uses
 BufferAccessStrategy to do sequence scans, so my test query took
 only 32 buffers from pool and didn't overwritten index pool too
 much. This BAS is really surprising. In any case when I end
 polishing I will send good patch, with proof.

 Yeah, that heuristic makes this less critical, for sure.

 Actually idea of this patch was like this:
 Some operations requires many buffers, PG uses clock sweep to
 get next free buffer, so it may overwrite index buffer. From point
 of view of good database design We should use indices, so purging
 out index from cache will affect performance.

 As the side effect I saw that this 2nd level keeps pg_* indices
 in memory too, so I think to include 3rd level cache for some pg_*
 tables.

 Well, the more complex you make it the more overhead there is, which
 makes it harder to come out ahead.  FWIW, in musing about it (as
 recently as this week), my idea was to add another field which would
 factor into the clock sweep calculations.  For indexes, it might be
 levels above leaf pages.

The high level blocks of frequently used indexes do a pretty good job
of keeping their usage counts high already, and so probably stay in
the buffer pool already.  And to the extent they don't, promoting all
indexes (even infrequently used ones, which I think most databases
have) would probably not be the way to encourage the others.

I would be more interested in looking at the sweep algorithm itself.
One thing I noticed in simulating the clock sweep is that the entry of
pages into the buffer with a usage count of 1 might not be very
useful.  That give that page 2 sweeps of the clock arm before getting
evicted, so they have an opportunity to get used again.  But since all
the blocks they are competing against also do the same thing, that
just means the arm sweeps about twice as fast, so they don't really
get much more of an opportunity.  The other thought was that each
buffers gets its usage incremented by 2 or 3 rather than 1 each time
it is found already in the cache.




 Maybe the thing to focus on first is the oft-discussed benchmark
 farm (similar to the build farm), with a good mix of loads, so
 that the impact of changes can be better tracked for multiple
 workloads on a variety of platforms and configurations.

Yeah, that sounds great.  Even just having a centrally organized group
of scripts/programs that have a good mix of loads, without the
automated farm to go with it, would be a help.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Andrew Dunstan



On 03/22/2011 12:47 PM, Jeff Janes wrote:



Maybe the thing to focus on first is the oft-discussed benchmark
farm (similar to the build farm), with a good mix of loads, so
that the impact of changes can be better tracked for multiple
workloads on a variety of platforms and configurations.

Yeah, that sounds great.  Even just having a centrally organized group
of scripts/programs that have a good mix of loads, without the
automated farm to go with it, would be a help.





Part of the reason for releasing the buildfarm server code a few months 
ago (see https://github.com/PGBuildFarm/server-code) was to encourage 
development of a benchmark farm, amoong other offspring. But I haven't 
seen such an animal emerging.


Someone just needs to sit down and do it and present us with a fait 
accompli.



cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Robert Haas
On Tue, Mar 22, 2011 at 11:24 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 9:19 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Maybe the thing to focus on first is the oft-discussed benchmark
 farm (similar to the build farm), with a good mix of loads, so
 that the impact of changes can be better tracked for multiple
 workloads on a variety of platforms and configurations.  Without
 something like that it is very hard to justify the added complexity
 of an idea like this in terms of the performance benefit gained.

 A related area that could use some looking at is why performance tops
 out at shared_buffers ~8GB and starts to fall thereafter.

 Under what circumstances does this happen?  Can a simple pgbench -S
 with a large scaling factor elicit this behavior?

To be honest, I'm mostly just reporting what I've heard Greg Smith say
on this topic.   I don't have any machine with that kind of RAM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Devrim GÜNDÜZ
On Tue, 2011-03-22 at 15:53 -0400, Robert Haas wrote:
 
 To be honest, I'm mostly just reporting what I've heard Greg Smith say
 on this topic.   I don't have any machine with that kind of RAM. 

I thought we had a machine for hackers who want to do performance
testing. Mark?
-- 
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Josh Berkus

Radek,


I have implemented initial concept of 2nd level cache. Idea is to keep some
segments of shared memory for special buffers (e.g. indices) to prevent
overwrite those by other operations. I added those functionality to nbtree
index scan.


The problem with any special buffering of database objects (other than 
maybe the system catalogs) improves one use case at the expense of 
others.  For example, special buffering of indexes would have a negative 
effect on use cases which are primarily seq scans.  Also, how would your 
index buffer work for really huge indexes, like GiST and GIN indexes?


In general, I think that improving the efficiency/scalability of our 
existing buffer system is probably going to bear a lot more fruit than 
adding extra levels of buffering.


That being said, one my argue that the root pages of btree indexes are a 
legitimate special case.   However, it seems like clock-sweep would end 
up keeping those in shared buffers all the time regardless.


--
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Radosław Smogura
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16
 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote:
  On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com 
wrote:
  Can't you make just one large mapping and lock it in 8k regions? I
  thought the problem with mmap was not being able to detect other
  processes
  (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm
  l) compatibility issues (possibly obsolete), etc.
  
  I was assuming that locking part of a mapping would force the kernel
  to split the mapping. It has to record the locked state somewhere so
  it needs a data structure that represents the size of the locked
  section and that would, I assume, be the mapping.
  
  It's possible the kernel would not in fact fall over too badly doing
  this. At some point I'll go ahead and do experiments on it. It's a bit
  fraught though as it the performance may depend on the memory
  management features of the chipset.
  
  That said, that's only part of the battle. On 32bit you can't map the
  whole database as your database could easily be larger than your
  address space. I have some ideas on how to tackle that but the
  simplest test would be to just mmap 8kB chunks everywhere.
 
 Even on 64 bit systems you only have 48 bit address space which is not
 a theoretical  limitation.  However, at least on linux you can map in
 and map out pretty quick (10 microseconds paired on my linux vm) so
 that's not so big of a deal.  Dealing with rapidly growing files is a
 problem.  That said, probably you are not going to want to reserve
 multiple gigabytes in 8k non contiguous chunks.
 
  But it's worse than that. Since you're not responsible for flushing
  blocks to disk any longer you need some way to *unlock* a block when
  it's possible to be flushed. That means when you flush the xlog you
  have to somehow find all the blocks that might no longer need to be
  locked and atomically unlock them. That would require new
  infrastructure we don't have though it might not be too hard.
  
  What would be nice is a mlock_until() where you eventually issue a
  call to tell the kernel what point in time you've reached and it
  unlocks everything older than that time.
 
 I wonder if there is any reason to mlock at all...if you are going to
 'do' mmap, can't you just hide under current lock architecture for
 actual locking and do direct memory access without mlock?
 
 merlin

Actually after dealing with mmap and adding munmap I found crucial thing why 
to not use mmap:
You need to munmap, and for me this takes much time, even if I read with 
SHARED | PROT_READ, it's looks like Linux do flush or something else, same as 
with MAP_FIXED, MAP_PRIVATE, etc.

Regards,
Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-22 Thread Merlin Moncure
On Tue, Mar 22, 2011 at 4:28 PM, Radosław Smogura
rsmog...@softperience.eu wrote:
 Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16
 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote:
  On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
  Can't you make just one large mapping and lock it in 8k regions? I
  thought the problem with mmap was not being able to detect other
  processes
  (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm
  l) compatibility issues (possibly obsolete), etc.
 
  I was assuming that locking part of a mapping would force the kernel
  to split the mapping. It has to record the locked state somewhere so
  it needs a data structure that represents the size of the locked
  section and that would, I assume, be the mapping.
 
  It's possible the kernel would not in fact fall over too badly doing
  this. At some point I'll go ahead and do experiments on it. It's a bit
  fraught though as it the performance may depend on the memory
  management features of the chipset.
 
  That said, that's only part of the battle. On 32bit you can't map the
  whole database as your database could easily be larger than your
  address space. I have some ideas on how to tackle that but the
  simplest test would be to just mmap 8kB chunks everywhere.

 Even on 64 bit systems you only have 48 bit address space which is not
 a theoretical  limitation.  However, at least on linux you can map in
 and map out pretty quick (10 microseconds paired on my linux vm) so
 that's not so big of a deal.  Dealing with rapidly growing files is a
 problem.  That said, probably you are not going to want to reserve
 multiple gigabytes in 8k non contiguous chunks.

  But it's worse than that. Since you're not responsible for flushing
  blocks to disk any longer you need some way to *unlock* a block when
  it's possible to be flushed. That means when you flush the xlog you
  have to somehow find all the blocks that might no longer need to be
  locked and atomically unlock them. That would require new
  infrastructure we don't have though it might not be too hard.
 
  What would be nice is a mlock_until() where you eventually issue a
  call to tell the kernel what point in time you've reached and it
  unlocks everything older than that time.

 I wonder if there is any reason to mlock at all...if you are going to
 'do' mmap, can't you just hide under current lock architecture for
 actual locking and do direct memory access without mlock?

 merlin

 Actually after dealing with mmap and adding munmap I found crucial thing why
 to not use mmap:
 You need to munmap, and for me this takes much time, even if I read with
 SHARED | PROT_READ, it's looks like Linux do flush or something else, same as
 with MAP_FIXED, MAP_PRIVATE, etc.

can you produce small program demonstrating the problem?  This is not
how things should work AIUI.

I was thinking about playing with mmap implementation of clog system
-- it's perhaps better fit.  clog is rigidly defined size, and has
very high performance requirements.  Also it's much less changes than
reimplementing heap buffering, and maybe not so much affected by
munmap.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Greg Stark
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus j...@agliodbs.com wrote:
 To take the opposite approach... has anyone looked at having the OS just 
 manage all caching for us? Something like MMAPed shared buffers? Even if we 
 find the issue with large shared buffers, we still can't dedicate serious 
 amounts of memory to them because of work_mem issues. Granted, that's 
 something else on the TODO list, but it really seems like we're re-inventing 
 the wheels that the OS has already created here...

A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it would
depend on the OS implementing things in a certain way and guaranteeing
things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.

 As far as I know, no OS has a more sophisticated approach to eviction
 than LRU.  And clock-sweep is a significant improvement on performance
 over LRU for frequently accessed database objects ... plus our
 optimizations around not overwriting the whole cache for things like VACUUM.

The clock-sweep algorithm was standard OS design before you or I knew
how to type. I would expect any half-decent OS to have sometihng at
least as good -- perhaps better because it can rely on hardware
features to handle things.

However the second point is the crux of the issue and of all similar
issues on where to draw the line between the OS and Postgres. The OS
knows better about the hardware characteristics and can better
optimize the overall system behaviour, but Postgres understands better
its own access patterns and can better optimize its behaviour whereas
the OS is stuck reverse-engineering what Postgres needs, usually from
simple heuristics.


 2-level caches work well for a variety of applications.

I think 2-level caches with simple heuristics like pin all the
indexes is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.

Where it might be helpful is as a more refined version of the
sequential scans use limited set of buffers patch. Instead of having
each sequential scan use a hard coded number of buffers, perhaps all
sequential scans should share a fraction of the global buffer pool
managed separately from the main pool. Though in my thought
experiments I don't see any real win here. In the current scheme if
there's any sign the buffer is useful it gets thrown from the
sequential scan's set of buffers to reuse anyways.

 Now, what would be *really* useful is some way to avoid all the data
 copying we do between shared_buffers and the FS cache.


Well the two options are mmap/mlock or directio. The former might be a
fun experiment but I expect any OS to fall over pretty quickly when
faced with thousands (or millions) of 8kB mappings. The latter would
need Postgres to do async i/o and hopefully a global view of its i/o
access patterns so it could do prefetching in a lot more cases.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread rsmogura

On Mon, 21 Mar 2011 10:24:22 +, Greg Stark wrote:
On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus j...@agliodbs.com 
wrote:
To take the opposite approach... has anyone looked at having the OS 
just manage all caching for us? Something like MMAPed shared buffers? 
Even if we find the issue with large shared buffers, we still can't 
dedicate serious amounts of memory to them because of work_mem 
issues. Granted, that's something else on the TODO list, but it 
really seems like we're re-inventing the wheels that the OS has 
already created here...


A lot of people have talked about it. You can find references to mmap
going at least as far back as 2001 or so. The problem is that it 
would
depend on the OS implementing things in a certain way and 
guaranteeing

things we don't think can be portably assumed. We would need to mlock
large amounts of address space which most OS's don't allow, and we
would need to at least mlock and munlock lots of small bits of memory
all over the place which would create lots and lots of mappings which
the kernel and hardware implementations would generally not
appreciate.
Actually, just from curious, I done test with mmap, and I got 2% boost 
on data reading, maybe because of skipping memcpy in fread. I really 
curious how fast, if even, it will be if I add some good and needed 
stuff and how e.g. vacuum will work.


snip


2-level caches work well for a variety of applications.


I think 2-level caches with simple heuristics like pin all the
indexes is unlikely to be helpful. At least it won't optimize the
average case and I think that's been proven. It might be helpful for
optimizing the worst-case which would reduce the standard deviation.
Perhaps we're at the point now where that matters.

Actually, 2nd level caches do not pin index buffer. It's just, in 
simple words, some set of reserved buffers' ids to be used for index 
pages, all logic with pining, etc. it's same, the difference is that 
default level operation will not touch 2nd level. I post some reports 
from my simple tests. When I was experimenting with 2nd level caches I 
saw that some operations may swap out system tables buffers, too.


snip

Regards,
Radek

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Merlin Moncure
On Mon, Mar 21, 2011 at 5:24 AM, Greg Stark gsst...@mit.edu wrote:
 On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus j...@agliodbs.com wrote:
 To take the opposite approach... has anyone looked at having the OS just 
 manage all caching for us? Something like MMAPed shared buffers? Even if we 
 find the issue with large shared buffers, we still can't dedicate serious 
 amounts of memory to them because of work_mem issues. Granted, that's 
 something else on the TODO list, but it really seems like we're 
 re-inventing the wheels that the OS has already created here...

 A lot of people have talked about it. You can find references to mmap
 going at least as far back as 2001 or so. The problem is that it would
 depend on the OS implementing things in a certain way and guaranteeing
 things we don't think can be portably assumed. We would need to mlock
 large amounts of address space which most OS's don't allow, and we
 would need to at least mlock and munlock lots of small bits of memory
 all over the place which would create lots and lots of mappings which
 the kernel and hardware implementations would generally not
 appreciate.

 As far as I know, no OS has a more sophisticated approach to eviction
 than LRU.  And clock-sweep is a significant improvement on performance
 over LRU for frequently accessed database objects ... plus our
 optimizations around not overwriting the whole cache for things like VACUUM.

 The clock-sweep algorithm was standard OS design before you or I knew
 how to type. I would expect any half-decent OS to have sometihng at
 least as good -- perhaps better because it can rely on hardware
 features to handle things.

 However the second point is the crux of the issue and of all similar
 issues on where to draw the line between the OS and Postgres. The OS
 knows better about the hardware characteristics and can better
 optimize the overall system behaviour, but Postgres understands better
 its own access patterns and can better optimize its behaviour whereas
 the OS is stuck reverse-engineering what Postgres needs, usually from
 simple heuristics.


 2-level caches work well for a variety of applications.

 I think 2-level caches with simple heuristics like pin all the
 indexes is unlikely to be helpful. At least it won't optimize the
 average case and I think that's been proven. It might be helpful for
 optimizing the worst-case which would reduce the standard deviation.
 Perhaps we're at the point now where that matters.

 Where it might be helpful is as a more refined version of the
 sequential scans use limited set of buffers patch. Instead of having
 each sequential scan use a hard coded number of buffers, perhaps all
 sequential scans should share a fraction of the global buffer pool
 managed separately from the main pool. Though in my thought
 experiments I don't see any real win here. In the current scheme if
 there's any sign the buffer is useful it gets thrown from the
 sequential scan's set of buffers to reuse anyways.

 Now, what would be *really* useful is some way to avoid all the data
 copying we do between shared_buffers and the FS cache.


 Well the two options are mmap/mlock or directio. The former might be a
 fun experiment but I expect any OS to fall over pretty quickly when
 faced with thousands (or millions) of 8kB mappings. The latter would
 need Postgres to do async i/o and hopefully a global view of its i/o
 access patterns so it could do prefetching in a lot more cases.

Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes 
(http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
compatibility issues (possibly obsolete), etc.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Heikki Linnakangas

On 21.03.2011 17:54, Merlin Moncure wrote:

Can't you make just one large mapping and lock it in 8k regions? I
thought the problem with mmap was not being able to detect other
processes 
(http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
compatibility issues (possibly obsolete), etc.


That mail is about replacing SysV shared memory with mmap(). Detecting 
other processes is a problem in that use, but that's not an issue with 
using mmap() to replace shared buffers.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Josh Berkus
On 3/21/11 3:24 AM, Greg Stark wrote:
 2-level caches work well for a variety of applications.
 
 I think 2-level caches with simple heuristics like pin all the
 indexes is unlikely to be helpful. At least it won't optimize the
 average case and I think that's been proven. It might be helpful for
 optimizing the worst-case which would reduce the standard deviation.
 Perhaps we're at the point now where that matters.

You're missing my point ... Postgres already *has* a 2-level cache:
shared_buffers and the FS cache.  Anything we add to that will be adding
levels.

We already did that, actually, when we implemented ARC: effectively gave
PostgreSQL a 3-level cache.  The results were not very good, although
the algorithm could be at fault there.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Greg Stark
On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Can't you make just one large mapping and lock it in 8k regions? I
 thought the problem with mmap was not being able to detect other
 processes 
 (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
 compatibility issues (possibly obsolete), etc.

I was assuming that locking part of a mapping would force the kernel
to split the mapping. It has to record the locked state somewhere so
it needs a data structure that represents the size of the locked
section and that would, I assume, be the mapping.

It's possible the kernel would not in fact fall over too badly doing
this. At some point I'll go ahead and do experiments on it. It's a bit
fraught though as it the performance may depend on the memory
management features of the chipset.

That said, that's only part of the battle. On 32bit you can't map the
whole database as your database could easily be larger than your
address space. I have some ideas on how to tackle that but the
simplest test would be to just mmap 8kB chunks everywhere.

But it's worse than that. Since you're not responsible for flushing
blocks to disk any longer you need some way to *unlock* a block when
it's possible to be flushed. That means when you flush the xlog you
have to somehow find all the blocks that might no longer need to be
locked and atomically unlock them. That would require new
infrastructure we don't have though it might not be too hard.

What would be nice is a mlock_until() where you eventually issue a
call to tell the kernel what point in time you've reached and it
unlocks everything older than that time.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Greg Stark
On Mon, Mar 21, 2011 at 4:47 PM, Josh Berkus j...@agliodbs.com wrote:
 You're missing my point ... Postgres already *has* a 2-level cache:
 shared_buffers and the FS cache.  Anything we add to that will be adding
 levels.

I don't think those two levels are interesting -- they don't interact
cleverly at all.

I was assuming the two levels were segments of the shared buffers that
didn't interoperate at all. If you kick buffers from the higher level
cache into the lower level one then why not just increase the number
of clock sweeps before you flush a buffer and insert non-index pages
into a lower clock level instead of writing code for two levels?

I don't think it will outperform in general because LRU is provably
within some margin from optimal and the clock sweep is an approximate
LRU. The only place you're going to find wins is when you know
something extra about the *future* access pattern that the lru/clock
doesn't know based on the past behaviour. Just saying indexes are
heavily used or system tables are heavily used isn't really extra
information since the LRU can figure that out. Something like
sequential scans of tables larger than shared buffers don't go back
and read old pages before they age out is.

The other place you might win is if you have some queries that you
want to always be fast at the expense of slower queries. So your short
web queries that only need to touch a few small tables and system
tables can tag buffers that are higher priority and shouldn't be
swapped out to achieve a slightly higher hit rate on the global cache.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Alvaro Herrera
Excerpts from Josh Berkus's message of lun mar 21 13:47:21 -0300 2011:

 We already did that, actually, when we implemented ARC: effectively gave
 PostgreSQL a 3-level cache.  The results were not very good, although
 the algorithm could be at fault there.

Was it really all that bad?  IIRC we replaced ARC with the current clock
sweep due to patent concerns.  (Maybe there were performance concerns as
well, I don't remember).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Josh Berkus

 Was it really all that bad?  IIRC we replaced ARC with the current clock
 sweep due to patent concerns.  (Maybe there were performance concerns as
 well, I don't remember).

Yeah, that was why the patent was frustrating.  Performance was poor and
we were planning on replacing ARC in 8.2 anyway.  Instead we had to
backport it.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Merlin Moncure
On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote:
 On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Can't you make just one large mapping and lock it in 8k regions? I
 thought the problem with mmap was not being able to detect other
 processes 
 (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html)
 compatibility issues (possibly obsolete), etc.

 I was assuming that locking part of a mapping would force the kernel
 to split the mapping. It has to record the locked state somewhere so
 it needs a data structure that represents the size of the locked
 section and that would, I assume, be the mapping.

 It's possible the kernel would not in fact fall over too badly doing
 this. At some point I'll go ahead and do experiments on it. It's a bit
 fraught though as it the performance may depend on the memory
 management features of the chipset.

 That said, that's only part of the battle. On 32bit you can't map the
 whole database as your database could easily be larger than your
 address space. I have some ideas on how to tackle that but the
 simplest test would be to just mmap 8kB chunks everywhere.

Even on 64 bit systems you only have 48 bit address space which is not
a theoretical  limitation.  However, at least on linux you can map in
and map out pretty quick (10 microseconds paired on my linux vm) so
that's not so big of a deal.  Dealing with rapidly growing files is a
problem.  That said, probably you are not going to want to reserve
multiple gigabytes in 8k non contiguous chunks.

 But it's worse than that. Since you're not responsible for flushing
 blocks to disk any longer you need some way to *unlock* a block when
 it's possible to be flushed. That means when you flush the xlog you
 have to somehow find all the blocks that might no longer need to be
 locked and atomically unlock them. That would require new
 infrastructure we don't have though it might not be too hard.

 What would be nice is a mlock_until() where you eventually issue a
 call to tell the kernel what point in time you've reached and it
 unlocks everything older than that time.

I wonder if there is any reason to mlock at all...if you are going to
'do' mmap, can't you just hide under current lock architecture for
actual locking and do direct memory access without mlock?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-21 Thread Radosław Smogura
Merlin Moncure mmonc...@gmail.com Monday 21 March 2011 20:58:16
 On Mon, Mar 21, 2011 at 2:08 PM, Greg Stark gsst...@mit.edu wrote:
  On Mon, Mar 21, 2011 at 3:54 PM, Merlin Moncure mmonc...@gmail.com 
wrote:
  Can't you make just one large mapping and lock it in 8k regions? I
  thought the problem with mmap was not being able to detect other
  processes
  (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.htm
  l) compatibility issues (possibly obsolete), etc.
  
  I was assuming that locking part of a mapping would force the kernel
  to split the mapping. It has to record the locked state somewhere so
  it needs a data structure that represents the size of the locked
  section and that would, I assume, be the mapping.
  
  It's possible the kernel would not in fact fall over too badly doing
  this. At some point I'll go ahead and do experiments on it. It's a bit
  fraught though as it the performance may depend on the memory
  management features of the chipset.
  
  That said, that's only part of the battle. On 32bit you can't map the
  whole database as your database could easily be larger than your
  address space. I have some ideas on how to tackle that but the
  simplest test would be to just mmap 8kB chunks everywhere.
 
 Even on 64 bit systems you only have 48 bit address space which is not
 a theoretical  limitation.  However, at least on linux you can map in
 and map out pretty quick (10 microseconds paired on my linux vm) so
 that's not so big of a deal.  Dealing with rapidly growing files is a
 problem.  That said, probably you are not going to want to reserve
 multiple gigabytes in 8k non contiguous chunks.
 
  But it's worse than that. Since you're not responsible for flushing
  blocks to disk any longer you need some way to *unlock* a block when
  it's possible to be flushed. That means when you flush the xlog you
  have to somehow find all the blocks that might no longer need to be
  locked and atomically unlock them. That would require new
  infrastructure we don't have though it might not be too hard.
  
  What would be nice is a mlock_until() where you eventually issue a
  call to tell the kernel what point in time you've reached and it
  unlocks everything older than that time.
Sorry for curious, but I think mlock is for swap prevent not for flush 
prevent.

 I wonder if there is any reason to mlock at all...if you are going to
 'do' mmap, can't you just hide under current lock architecture for
 actual locking and do direct memory access without mlock?
 
 merlin

mmap man do not say anything about when flush occurs when mmap is file and is 
shared, so flushes may be intended or not. Much more, this what I read, SysV 
shared memory is emulated by mmap (and I think this mmap is on /dev/shm)

Radek

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread rsmogura

On Thu, 17 Mar 2011 16:02:18 -0500, Kevin Grittner wrote:

Rados*aw Smogurarsmog...@softperience.eu wrote:


I have implemented initial concept of 2nd level cache. Idea is to
keep some segments of shared memory for special buffers (e.g.
indices) to prevent overwrite those by other operations. I added
those functionality to nbtree index scan.

I tested this with doing index scan, seq read, drop system
buffers, do index scan and in few places I saw performance
improvements, but actually, I'm not sure if this was just random
or intended improvement.


I've often wondered about this.  In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache.  A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.

I can't really look at this for a couple weeks, but I'm definitely
interested.  I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.

https://commitfest.postgresql.org/action/commitfest_view/open


There is few places to optimize code as well, and patch need many
work, but may you see it and give opinions?


For something like this it makes perfect sense to show proof of
concept before trying to cover everything.

-Kevin


Yes, there is some change, and I looked at this more carefully, as my 
performance results wasn't such as I expected. I found PG uses 
BufferAccessStrategy to do sequence scans, so my test query took only 32 
buffers from pool and didn't overwritten index pool too much. This BAS 
is really surprising. In any case when I end polishing I will send good 
patch, with proof.


Actually idea of this patch was like this:
Some operations requires many buffers, PG uses clock sweep to get 
next free buffer, so it may overwrite index buffer. From point of view 
of good database design We should use indices, so purging out index from 
cache will affect performance.


As the side effect I saw that this 2nd level keeps pg_* indices in 
memory too, so I think to include 3rd level cache for some pg_* tables.


Regards,
Radek

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Kevin Grittner
rsmogura rsmog...@softperience.eu wrote:
 
 Yes, there is some change, and I looked at this more carefully, as
 my performance results wasn't such as I expected. I found PG uses 
 BufferAccessStrategy to do sequence scans, so my test query took
 only 32 buffers from pool and didn't overwritten index pool too
 much. This BAS is really surprising. In any case when I end
 polishing I will send good patch, with proof.
 
Yeah, that heuristic makes this less critical, for sure.
 
 Actually idea of this patch was like this:
 Some operations requires many buffers, PG uses clock sweep to
 get next free buffer, so it may overwrite index buffer. From point
 of view of good database design We should use indices, so purging
 out index from cache will affect performance.
 
 As the side effect I saw that this 2nd level keeps pg_* indices
 in memory too, so I think to include 3rd level cache for some pg_*
 tables.
 
Well, the more complex you make it the more overhead there is, which
makes it harder to come out ahead.  FWIW, in musing about it (as
recently as this week), my idea was to add another field which would
factor into the clock sweep calculations.  For indexes, it might be
levels above leaf pages.  I haven't reviewed the code in depth to
know how to use it, this was just idle daydreaming based on that
prior experience.  It's far from certain that the concept will
actually prove beneficial in PostgreSQL.
 
Maybe the thing to focus on first is the oft-discussed benchmark
farm (similar to the build farm), with a good mix of loads, so
that the impact of changes can be better tracked for multiple
workloads on a variety of platforms and configurations.  Without
something like that it is very hard to justify the added complexity
of an idea like this in terms of the performance benefit gained.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Maybe the thing to focus on first is the oft-discussed benchmark
 farm (similar to the build farm), with a good mix of loads, so
 that the impact of changes can be better tracked for multiple
 workloads on a variety of platforms and configurations.  Without
 something like that it is very hard to justify the added complexity
 of an idea like this in terms of the performance benefit gained.

A related area that could use some looking at is why performance tops
out at shared_buffers ~8GB and starts to fall thereafter.  InnoDB can
apparently handle much larger buffer pools without a performance
drop-off.  There are some advantages to our reliance on the OS buffer
cache, to be sure, but as RAM continues to grow this might start to
get annoying.  On a 4GB system you might have shared_buffers set to
25% of memory, but on a 64GB system it'll be a smaller percentage, and
as memory capacities continue to clime it'll be smaller still.
Unfortunately I don't have the hardware to investigate this, but it's
worth thinking about, especially if we're thinking of doing things
that add more caching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Alvaro Herrera
Excerpts from rsmogura's message of vie mar 18 11:57:48 -0300 2011:

  Actually idea of this patch was like this:
  Some operations requires many buffers, PG uses clock sweep to get 
  next free buffer, so it may overwrite index buffer. From point of view 
  of good database design We should use indices, so purging out index from 
  cache will affect performance.

The BufferAccessStrategy stuff was written to solve this problem.

  As the side effect I saw that this 2nd level keeps pg_* indices in 
  memory too, so I think to include 3rd level cache for some pg_* tables.

Keep in mind that there's already another layer of caching (see
syscache.c) for system catalogs on top of the buffer cache.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Jim Nasby
On Mar 18, 2011, at 11:19 AM, Robert Haas wrote:
 On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 A related area that could use some looking at is why performance tops
 out at shared_buffers ~8GB and starts to fall thereafter.  InnoDB can
 apparently handle much larger buffer pools without a performance
 drop-off.  There are some advantages to our reliance on the OS buffer
 cache, to be sure, but as RAM continues to grow this might start to
 get annoying.  On a 4GB system you might have shared_buffers set to
 25% of memory, but on a 64GB system it'll be a smaller percentage, and
 as memory capacities continue to clime it'll be smaller still.
 Unfortunately I don't have the hardware to investigate this, but it's
 worth thinking about, especially if we're thinking of doing things
 that add more caching.

+1

To take the opposite approach... has anyone looked at having the OS just manage 
all caching for us? Something like MMAPed shared buffers? Even if we find the 
issue with large shared buffers, we still can't dedicate serious amounts of 
memory to them because of work_mem issues. Granted, that's something else on 
the TODO list, but it really seems like we're re-inventing the wheels that the 
OS has already created here...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 2:15 PM, Jim Nasby j...@nasby.net wrote:
 +1

 To take the opposite approach... has anyone looked at having the OS just 
 manage all caching for us? Something like MMAPed shared buffers? Even if we 
 find the issue with large shared buffers, we still can't dedicate serious 
 amounts of memory to them because of work_mem issues. Granted, that's 
 something else on the TODO list, but it really seems like we're re-inventing 
 the wheels that the OS has already created here...

The problem is that the OS doesn't offer any mechanism that would
allow us to obey the WAL-before-data rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Radosław Smogura
Kevin Grittner kevin.gritt...@wicourts.gov Thursday 17 March 2011 22:02:18
 Rados*aw Smogurarsmog...@softperience.eu wrote:
  I have implemented initial concept of 2nd level cache. Idea is to
  keep some segments of shared memory for special buffers (e.g.
  indices) to prevent overwrite those by other operations. I added
  those functionality to nbtree index scan.
  
  I tested this with doing index scan, seq read, drop system
  buffers, do index scan and in few places I saw performance
  improvements, but actually, I'm not sure if this was just random
  or intended improvement.
 
 I've often wondered about this.  In a database I developed back in
 the '80s it was clearly a win to have a special cache for index
 entries and other special pages closer to the database than the
 general cache.  A couple things have changed since the '80s (I mean,
 besides my waistline and hair color), and PostgreSQL has many
 differences from that other database, so I haven't been sure it
 would help as much, but I have wondered.
 
 I can't really look at this for a couple weeks, but I'm definitely
 interested.  I suggest that you add this to the next CommitFest as a
 WIP patch, under the Performance category.
 
 https://commitfest.postgresql.org/action/commitfest_view/open
 
  There is few places to optimize code as well, and patch need many
  work, but may you see it and give opinions?
 
 For something like this it makes perfect sense to show proof of
 concept before trying to cover everything.
 
 -Kevin

Here I attach latest version of patch with few performance improvements (code 
is still dirty) and some reports from test, as well my simple tests.

Actually there is small improvement without dropping system caches, and bigger 
with dropping. I have small performance decrease (if we can talk about 
measuring basing on this tests) to original PG version when dealing with same 
configuration, but increase is with 2nd level buffers... or maybe I badly 
compared reports.

In tests I tried to choose typical, simple queries. 

Regards,
Radek


2nd_lvl_cache_20110318.diff.bz2
Description: application/bzip


test-scritps_20110319_0026.tar.bz2
Description: application/bzip-compressed-tar


reports_20110318.tar.bz2
Description: application/bzip-compressed-tar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Josh Berkus
On 3/18/11 11:15 AM, Jim Nasby wrote:
 To take the opposite approach... has anyone looked at having the OS just 
 manage all caching for us? Something like MMAPed shared buffers? Even if we 
 find the issue with large shared buffers, we still can't dedicate serious 
 amounts of memory to them because of work_mem issues. Granted, that's 
 something else on the TODO list, but it really seems like we're re-inventing 
 the wheels that the OS has already created here...

As far as I know, no OS has a more sophisticated approach to eviction
than LRU.  And clock-sweep is a significant improvement on performance
over LRU for frequently accessed database objects ... plus our
optimizations around not overwriting the whole cache for things like VACUUM.

2-level caches work well for a variety of applications.

Now, what would be *really* useful is some way to avoid all the data
copying we do between shared_buffers and the FS cache.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 2nd Level Buffer Cache

2011-03-17 Thread Radosław Smogura
Hi,

I have implemented initial concept of 2nd level cache. Idea is to keep some 
segments of shared memory for special buffers (e.g. indices) to prevent 
overwrite those by other operations. I added those functionality to nbtree 
index scan.

I tested this with doing index scan, seq read, drop system buffers, do index 
scan and in few places I saw performance improvements, but actually, I'm not 
sure if this was just random or intended improvement.

There is few places to optimize code as well, and patch need many work, but 
may you see it and give opinions?

Regards,
Radek
diff --git a/.gitignore b/.gitignore
index 3f11f2e..6542e35 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,3 +22,4 @@ lcov.info
 /GNUmakefile
 /config.log
 /config.status
+/nbproject/private/
\ No newline at end of file
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 2796445..0229f5a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -508,7 +508,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	if (blkno != P_NEW)
 	{
 		/* Read an existing block of the relation */
-		buf = ReadBuffer(rel, blkno);
+		buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND);
 		LockBuffer(buf, access);
 		_bt_checkpage(rel, buf);
 	}
@@ -548,7 +548,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 			blkno = GetFreeIndexPage(rel);
 			if (blkno == InvalidBlockNumber)
 break;
-			buf = ReadBuffer(rel, blkno);
+			buf = ReadBufferLevel(rel, blkno, BUFFER_LEVEL_2ND);
 			if (ConditionalLockBuffer(buf))
 			{
 page = BufferGetPage(buf);
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index dadb49d..2922711 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -22,6 +22,7 @@ BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
 int32	   *PrivateRefCount;
 
+BufferLevelDesc *bufferLevels;
 
 /*
  * Data Structures:
@@ -72,8 +73,7 @@ int32	   *PrivateRefCount;
 void
 InitBufferPool(void)
 {
-	bool		foundBufs,
-foundDescs;
+	bool		foundBufs, foundDescs, foundBufferLevels = false;
 
 	BufferDescriptors = (BufferDesc *)
 		ShmemInitStruct(Buffer Descriptors,
@@ -83,19 +83,38 @@ InitBufferPool(void)
 		ShmemInitStruct(Buffer Blocks,
 		NBuffers * (Size) BLCKSZ, foundBufs);
 
-	if (foundDescs || foundBufs)
+bufferLevels = (BufferLevelDesc*)
+ShmemInitStruct(Buffer Levels Descriptors Table,
+		sizeof(BufferLevelDesc) * BUFFER_LEVEL_SIZE, 
+foundBufferLevels);
+	if (foundDescs || foundBufs || foundBufferLevels)
 	{
 		/* both should be present or neither */
-		Assert(foundDescs  foundBufs);
+		Assert(foundDescs  foundBufs  foundBufferLevels);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
 	{
 		BufferDesc *buf;
+BufferLevelDesc *bufferLevelDesc;
+
 		int			i;
-
+
 		buf = BufferDescriptors;
 
+/* Initialize buffer levels. */
+//1st Level - Default
+bufferLevelDesc = bufferLevels;
+bufferLevelDesc-index = 0;
+bufferLevelDesc-super = BUFFER_LEVEL_END_OF_LIST;
+bufferLevelDesc-lower = BUFFER_LEVEL_END_OF_LIST;
+
+//2nd Level - For indices
+bufferLevelDesc++;
+bufferLevelDesc-index = 1;
+bufferLevelDesc-super = BUFFER_LEVEL_END_OF_LIST;
+bufferLevelDesc-lower = 0;
+
 		/*
 		 * Initialize all the buffer headers.
 		 */
@@ -117,6 +136,10 @@ InitBufferPool(void)
 			 */
 			buf-freeNext = i + 1;
 
+/* Assign buffer level. */
+//TODO Currently hardcoded - 
+buf-buf_level = ( 0.3 * NBuffers  i ) ? BUFFER_LEVEL_DEFAULT : BUFFER_LEVEL_2ND;
+
 			buf-io_in_progress_lock = LWLockAssign();
 			buf-content_lock = LWLockAssign();
 		}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1f89e52..867bae0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -47,7 +47,8 @@
 #include storage/standby.h
 #include utils/rel.h
 #include utils/resowner.h
-
+#include catalog/pg_type.h
+#include funcapi.h
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)-buf_id) * BLCKSZ))
@@ -85,7 +86,7 @@ static volatile BufferDesc *PinCountWaitBuf = NULL;
 static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
   ForkNumber forkNum, BlockNumber blockNum,
   ReadBufferMode mode, BufferAccessStrategy strategy,
-  bool *hit);
+  bool *hit, BufferLevel bufferLevel);
 static bool PinBuffer(volatile BufferDesc *buf, 

Re: [HACKERS] 2nd Level Buffer Cache

2011-03-17 Thread Kevin Grittner
Rados*aw Smogurarsmog...@softperience.eu wrote:
 
 I have implemented initial concept of 2nd level cache. Idea is to
 keep some segments of shared memory for special buffers (e.g.
 indices) to prevent overwrite those by other operations. I added
 those functionality to nbtree index scan.
 
 I tested this with doing index scan, seq read, drop system
 buffers, do index scan and in few places I saw performance
 improvements, but actually, I'm not sure if this was just random
 or intended improvement.
 
I've often wondered about this.  In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache.  A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.
 
I can't really look at this for a couple weeks, but I'm definitely
interested.  I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.
 
https://commitfest.postgresql.org/action/commitfest_view/open
 
 There is few places to optimize code as well, and patch need many
 work, but may you see it and give opinions?
 
For something like this it makes perfect sense to show proof of
concept before trying to cover everything.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers