Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-25 Thread Tom Lane
I wrote:
 I don't have the URL at hand but it was posted just a few days ago.

... actually, it was the beginning of this here thread ...

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-24 Thread Tom Lane
Curt Sampson [EMAIL PROTECTED] writes:
 On Sat, 23 Oct 2004, Tom Lane wrote:
 Seems to me the overhead of any such scheme would swamp the savings from
 avoiding kernel/userspace copies ...

 Well, one really can't know without testing, but memory copies are
 extremely expensive if they go outside of the cache.

Sure, but what about all the copying from write queue to page?

 the locking issues alone would be painful.

 I don't see why they would be any more painful than the current locking
 issues.

Because there are more locks --- the write queue data structure will
need to be locked separately from the page.  (Even with a separate write
queue per page, there will need to be a shared data structure that
allows you to allocate and find write queues, and that thing will be a
subject of contention.  See BufMgrLock, which is not held while actively
twiddling the contents of pages, but is a serious cause of contention
anyway.)

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-24 Thread Curt Sampson
On Sun, 24 Oct 2004, Tom Lane wrote:

  Well, one really can't know without testing, but memory copies are
  extremely expensive if they go outside of the cache.

 Sure, but what about all the copying from write queue to page?

There's a pretty big difference between few-hundred-bytes-on-write and
eight-kilobytes-with-every-read memory copy.

As for the queue allocation, again, I have no data to back this up, but
I don't think it would be as bad as BufMgrLock. Not every page will have
a write queue, and a hot page is only going to get one once. (If a
page has a write queue, you might as well leave it with the page after
flushing it, and get rid of it only when the page leaves memory.)

I see the OS issues related to mapping that much memory as a much bigger
potential problem.

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-24 Thread Tom Lane
Curt Sampson [EMAIL PROTECTED] writes:
 I see the OS issues related to mapping that much memory as a much bigger
 potential problem.

I see potential problems everywhere I look ;-)

Considering that the available numbers suggest we could win just a few
percent (and that's assuming that all this extra mechanism has zero
cost), I can't believe that the project is worth spending manpower on.
There is a lot of much more attractive fruit hanging at lower levels.
The bitmap-indexing stuff that was recently being discussed, for
instance, would certainly take less effort than this; it would create
no new portability issues; and at least for the queries where it helps,
it could offer integer-multiple speedups, not percentage points.

My engineering professors taught me that you put large effort where you
have a chance at large rewards.  Converting PG to mmap doesn't seem to
meet that test, even if I believed it would work.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-24 Thread Curt Sampson
On Sun, 24 Oct 2004, Tom Lane wrote:

 Considering that the available numbers suggest we could win just a few
 percent...

I must confess that I was completely unaware of these numbers. Where
do I find them?

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-23 Thread Curt Sampson
On Sat, 9 Oct 2004, Tom Lane wrote:

 mmap provides msync which is comparable to fsync, but AFAICS it
 provides no way to prevent an in-memory change from reaching disk too
 soon.  This would mean that WAL entries would have to be written *and
 flushed* before we could make the data change at all, which would
 convert multiple updates of a single page into a series of write-and-
 wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
 is bad enough, once per atomic action is intolerable.

Back when I was working out how to do this, I reckoned that you could
use mmap by keeping a write queue for each modified page. Reading,
you'd have to read the datum from the page and then check the write
queue for that page to see if that datum had been updated, using the
new value if it's there. Writing, you'd add the modified datum to the
write queue, but not apply the write queue to the page until you'd had
confirmation that the corresponding transaction log entry had been
written. So multiple writes are no big deal; they just all queue up in
the write queue, and at any time you can apply as much of the write
queue to the page itself as the current log entry will allow.

There are several different strategies available for mapping and
unmapping the pages, and in fact there might need to be several
available to get the best performance out of different systems. Most
OSes do not seem to be optimized for having thousands or tens of
thousands of small mappings (certainly NetBSD isn't), but I've never
done any performance tests to see what kind of strategies might work
well or not.

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-23 Thread Tom Lane
Curt Sampson [EMAIL PROTECTED] writes:
 Back when I was working out how to do this, I reckoned that you could
 use mmap by keeping a write queue for each modified page. Reading,
 you'd have to read the datum from the page and then check the write
 queue for that page to see if that datum had been updated, using the
 new value if it's there. Writing, you'd add the modified datum to the
 write queue, but not apply the write queue to the page until you'd had
 confirmation that the corresponding transaction log entry had been
 written. So multiple writes are no big deal; they just all queue up in
 the write queue, and at any time you can apply as much of the write
 queue to the page itself as the current log entry will allow.

Seems to me the overhead of any such scheme would swamp the savings from
avoiding kernel/userspace copies ... the locking issues alone would be
painful.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-23 Thread Curt Sampson
On Sat, 23 Oct 2004, Tom Lane wrote:

 Seems to me the overhead of any such scheme would swamp the savings from
 avoiding kernel/userspace copies ...

Well, one really can't know without testing, but memory copies are
extremely expensive if they go outside of the cache.

 the locking issues alone would be painful.

I don't see why they would be any more painful than the current locking
issues. In fact, I don't see any reason to add more locking than we
already use when updating pages.

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.NetBSD.org
 Make up enjoying your city life...produced by BIC CAMERA

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-18 Thread Jan Wieck
On 10/14/2004 6:36 PM, Simon Riggs wrote:
[...]
I think Jan has said this also in far fewer words, but I'll leave that to
Jan to agree/disagree...
I do agree. The total DB size has as little to do with the optimum 
shared buffer cache size as the total available RAM of the machine.

After reading your comments it appears more clear to me. All what those 
tests did show is the amount of high frequently accessed data in this 
database population and workload combination.

I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a
shared_buffers cache as is required by the database workload, and this
should not be constrained to a small percentage of server RAM.
Right.
Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-18 Thread Jan Wieck
On 10/14/2004 8:10 PM, Christopher Browne wrote:
Quoth [EMAIL PROTECTED] (Simon Riggs):
I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as
large a shared_buffers cache as is required by the database
workload, and this should not be constrained to a small percentage
of server RAM.
I don't think that this particularly follows from what ARC does.
The combination of ARC together with the background writer is supposed 
to allow us to allocate the optimum even if that is large. The former 
implementation of the LRU without background writer would just hang the 
server for a long time during a checkpoint, which is absolutely 
inacceptable for any OLTP system.

Jan
What ARC does is to prevent certain conspicuous patterns of
sequential accesses from essentially trashing the contents of the
cache.
If a particular benchmark does not include conspicuous vacuums or
sequential scans on large tables, then there is little reason to
expect ARC to have a noticeable impact on performance.
It _could_ be that this implies that ARC allows you to get some use
out of a larger shared cache, as it won't get blown away by vacuums
and Seq Scans.  But it is _not_ obvious that this is a necessary
truth.
_Other_ truths we know about are:
 a) If you increase the shared cache, that means more data that is
represented in both the shared cache and the OS buffer cache,
which seems rather a waste;
 b) The larger the shared cache, the more pages there are for the
backend to rummage through before it looks to the filesystem,
and therefore the more expensive cache misses get.  Cache hits
get more expensive, too.  Searching through memory is not
costless.

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-15 Thread Kevin Brown
Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Hmm...something just occurred to me about this.
 
  Would a hybrid approach be possible?  That is, use mmap() to handle
  reads, and use write() to handle writes?
 
 Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
 Basically it says that there are no guarantees whatsoever if you try
 this.  The SUS text is a bit weaselly (the application must ensure
 correct synchronization) but the HPUX mmap man page, among others,
 lays it on the line:
 
  It is also unspecified whether write references to a memory region
  mapped with MAP_SHARED are visible to processes reading the file and
  whether writes to a file are visible to processes that have mapped the
  modified portion of that file, except for the effect of msync().
 
 It might work on particular OSes but I think depending on such behavior
 would be folly...

Yeah, and at this point it can't be considered portable in any real
way because of this.  Thanks for the perspective.  I should have
expected the general specification to be quite broken in this regard,
not to mention certain implementations.  :-)

Good thing there's a lot of lower-hanging fruit than this...



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-15 Thread Alan Stange
Tom Lane wrote:
Kevin Brown [EMAIL PROTECTED] writes:
 

Hmm...something just occurred to me about this.
   

Would a hybrid approach be possible?  That is, use mmap() to handle
reads, and use write() to handle writes?
   

Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
Basically it says that there are no guarantees whatsoever if you try
this.  The SUS text is a bit weaselly (the application must ensure
correct synchronization) but the HPUX mmap man page, among others,
lays it on the line:
It is also unspecified whether write references to a memory region
mapped with MAP_SHARED are visible to processes reading the file and
whether writes to a file are visible to processes that have mapped the
modified portion of that file, except for the effect of msync().
It might work on particular OSes but I think depending on such behavior
would be folly...
We have some anecdotal experience along these lines:There was a set 
of kernel bugs in Solaris 2.6 or 7 related to this as well.   We had 
several kernel panics and it took a bit to chase down, but the basic 
feedback was oops.  we're screwed.   I've forgotten most of the 
details right now; the basic problem was a file was being read+written 
via mmap and read()/write() at (essentially) the same time from the same 
pid.   It would panic the system quite reliably.  I believe the bugs 
related to this have been resolved in Solaris, but it was unpleasant to 
chase that problem down...

-- Alan
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-15 Thread Sean Chittenden
this.  The SUS text is a bit weaselly (the application must ensure
correct synchronization) but the HPUX mmap man page, among others,
lays it on the line:
 It is also unspecified whether write references to a memory region
 mapped with MAP_SHARED are visible to processes reading the file 
and
 whether writes to a file are visible to processes that have 
mapped the
 modified portion of that file, except for the effect of msync().

It might work on particular OSes but I think depending on such behavior
would be folly...
Agreed.  Only OSes with a coherent file system buffer cache should ever 
use mmap(2).  In order for this to work on HPUX, msync(2) would need to 
be used.  -sc

--
Sean Chittenden
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Simon Riggs

First off, I'd like to get involved with these tests - pressure of other
work only has prevented me.

Here's my take on the results so far:

I think taking the ratio of the memory allocated to shared_buffers against
the total memory available on the server is completely fallacious. That is
why they cannnot be explained - IMHO the ratio has no real theoretical
basis.

The important ratio for me is the amount of shared_buffers against the total
size of the database in the benchmark test. Every database workload has a
differing percentage of the total database size that represents the working
set, or the memory that can be beneficially cached. For the tests that
DBT-2 is performing, I say that there is only so many blocks that are worth
the trouble caching. If you cache more than this, you are wasting your time.

For me, these tests don't show that there is a sweet spot that you should
set your shared_buffers to, only that for that specific test, you have
located the correct size for shared_buffers. For me, it would be an
incorrect inference that this could then be interpreted that this was the
percentage of the available RAM where the sweet spot lies for all
workloads.

The theoretical basis for my comments is this: DBT-2 is essentially a static
workload. That means, for a long test, we can work out with reasonable
certainty the probability that a block will be requested, for every single
block in the database. Given a particular size of cache, you can work out
what your overall cache hit ratio is and therfore what your speed up is
compared with retrieving every single block from disk (the no cache
scenario). If you draw a graph of speedup (y) against cache size as a % of
total database size, the graph looks like an upside-down L - i.e. the
graph rises steeply as you give it more memory, then turns sharply at a
particular point, after which it flattens out. The turning point is the
sweet spot we all seek - the optimum amount of cache memory to allocate -
but this spot depends upon the worklaod and database size, not on available
RAM on the system under test.

Clearly, the presence of the OS disk cache complicates this. Since we have
two caches both allocated from the same pot of memory, it should be clear
that if we overallocate one cache beyond its optimium effectiveness, while
the second cache is still in its more is better stage, then we will get
reduced performance. That seems to be the case here. I wouldn't accept that
a fixed ratio between the two caches exists for ALL, or even the majority of
workloads - though clearly broad brush workloads such as OLTP and Data
Warehousing do have similar-ish requirements.

As an example, lets look at an example:
An application with two tables: SmallTab has 10,000 rows of 100 bytes each
(so table is ~1 Mb)- one row per photo in a photo gallery web site. LargeTab
has large objects within it and has 10,000 photos, average size 10 Mb (so
table is ~100Gb). Assuming all photos are requested randomly, you can see
that an optimum cache size for this workload is 1Mb RAM, 100Gb disk. Trying
to up the cache doesn't have much effect on  the probability that a photo
(from LargeTab) will be in cache, unless you have a large % of 100Gb of RAM,
when you do start to make gains. (Please don't be picky about indexes,
catalog, block size etc). That clearly has absolutely nothing at all to do
with the RAM of the system on which it is running.

I think Jan has said this also in far fewer words, but I'll leave that to
Jan to agree/disagree...

I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a
shared_buffers cache as is required by the database workload, and this
should not be constrained to a small percentage of server RAM.

Best Regards,

Simon Riggs

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Josh Berkus
 Sent: 08 October 2004 22:43
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: [PERFORM] First set of OSDL Shared Mem scalability results,
 some wierdness ...


 Folks,

 I'm hoping that some of you can shed some light on this.

 I've been trying to peg the sweet spot for shared memory using OSDL's
 equipment.   With Jan's new ARC patch, I was expecting that the desired
 amount of shared_buffers to be greatly increased.  This has not
 turned out to
 be the case.

 The first test series was using OSDL's DBT2 (OLTP) test, with 150
 warehouses.   All tests were run on a 4-way Pentium III 700mhz
 3.8GB RAM
 system hooked up to a rather high-end storage device (14
 spindles).Tests
 were on PostgreSQL 8.0b3, Linux 2.6.7.

 Here's a top-level summary:

 shared_buffers% RAM   NOTPM20*
 1000  0.2%1287
 23000 5%  1507
 46000 10% 1481
 69000 15% 1382
 92000 20% 1375
 11500025% 1380
 13800030%   

Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Josh Berkus
Simon,

lots of good stuff clipped

 If you draw a graph of speedup (y) against cache size as a 
 % of total database size, the graph looks like an upside-down L - i.e.
 the graph rises steeply as you give it more memory, then turns sharply at a
 particular point, after which it flattens out. The turning point is the
 sweet spot we all seek - the optimum amount of cache memory to allocate -
 but this spot depends upon the worklaod and database size, not on available
 RAM on the system under test.

Hmmm ... how do you explain, then the camel hump nature of the real 
performance?That is, when we allocated even a few MB more than the 
optimum ~190MB, overall performance stated to drop quickly.   The result is 
that allocating 2x optimum RAM is nearly as bad as allocating too little 
(e.g. 8MB).  

The only explanation I've heard of this so far is that there is a significant 
loss of efficiency with larger caches.  Or do you see the loss of 200MB out 
of 3500MB would actually affect the Kernel cache that much?

Anyway, one test of your theory that I can run immediately is to run the exact 
same workload on a bigger, faster server and see if the desired quantity of 
shared_buffers is roughly the same.  I'm hoping that you're wrong -- not 
because I don't find your argument persuasive, but because if you're right it 
leaves us without any reasonable ability to recommend shared_buffer settings.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Christopher Browne
Quoth [EMAIL PROTECTED] (Simon Riggs):
 I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as
 large a shared_buffers cache as is required by the database
 workload, and this should not be constrained to a small percentage
 of server RAM.

I don't think that this particularly follows from what ARC does.

What ARC does is to prevent certain conspicuous patterns of
sequential accesses from essentially trashing the contents of the
cache.

If a particular benchmark does not include conspicuous vacuums or
sequential scans on large tables, then there is little reason to
expect ARC to have a noticeable impact on performance.

It _could_ be that this implies that ARC allows you to get some use
out of a larger shared cache, as it won't get blown away by vacuums
and Seq Scans.  But it is _not_ obvious that this is a necessary
truth.

_Other_ truths we know about are:

 a) If you increase the shared cache, that means more data that is
represented in both the shared cache and the OS buffer cache,
which seems rather a waste;

 b) The larger the shared cache, the more pages there are for the
backend to rummage through before it looks to the filesystem,
and therefore the more expensive cache misses get.  Cache hits
get more expensive, too.  Searching through memory is not
costless.
-- 
(format nil [EMAIL PROTECTED] cbbrowne acm.org)
http://linuxfinances.info/info/linuxdistributions.html
The X-Files are too optimistic.  The truth is *not* out there...
-- Anthony Ord [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Tom Lane
Kevin Brown [EMAIL PROTECTED] writes:
 Hmm...something just occurred to me about this.

 Would a hybrid approach be possible?  That is, use mmap() to handle
 reads, and use write() to handle writes?

Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
Basically it says that there are no guarantees whatsoever if you try
this.  The SUS text is a bit weaselly (the application must ensure
correct synchronization) but the HPUX mmap man page, among others,
lays it on the line:

 It is also unspecified whether write references to a memory region
 mapped with MAP_SHARED are visible to processes reading the file and
 whether writes to a file are visible to processes that have mapped the
 modified portion of that file, except for the effect of msync().

It might work on particular OSes but I think depending on such behavior
would be folly...

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Jan Wieck
On 10/9/2004 7:20 AM, Kevin Brown wrote:
Christopher Browne wrote:
Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:
 - Data that passes through the cache also passes through kernel
   cache, so it's recorded twice, and read twice...
Even worse, memory that's used for the PG cache is memory that's not
available to the kernel's page cache.  Even if the overall memory
Which underlines my previous statement, that a PG shared cache much 
larger than the high-frequently accessed data portion of the DB is 
counterproductive. Double buffering (kernel-disk-buffer plus shared 
buffer) only makes sense for data that would otherwise cause excessive 
memory copies in and out of the shared buffer. After that, in only 
lowers the memory available for disk buffers.

Jan
usage in the system isn't enough to cause some paging to disk, most
modern kernels will adjust the page/disk cache size dynamically to fit
the memory demands of the system, which in this case means it'll be
smaller if running programs need more memory for their own use.
This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files -- doing so under a truly
modern OS would surely at the very least save a buffer copy (from the
page/disk cache to program memory) because the OS could instead
direcly map the buffer cache pages directly to the program's memory
space.
Since PG often has to have multiple files open at the same time, and
in a production database many of those files will be rather large, PG
would have to limit the size of the mmap()ed region on 32-bit
platforms, which means that things like the order of mmap() operations
to access various parts of the file can become just as important in
the mmap()ed case as it is in the read()/write() case (if not more
so!).  I would imagine that the use of mmap() on a 64-bit platform
would be a much, much larger win because PG would most likely be able
to mmap() entire files and let the OS work out how to order disk reads
and writes.
The biggest problem as I see it is that (I think) mmap() would have to
be made to cooperate with malloc() for virtual address space.  I
suspect issues like this have already been worked out by others,
however...


--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Jan Wieck
On 10/14/2004 12:22 AM, Greg Stark wrote:
Jan Wieck [EMAIL PROTECTED] writes:
Which would require that shared memory is not allowed to be swapped out, and
that is allowed in Linux by default IIRC, not to completely distort the entire
test.
Well if it's getting swapped out then it's clearly not being used effectively.
Is it really that easy if 3 different cache algorithms (PG cache, kernel 
buffers and swapping) are competing for the same chips?

Jan
There are APIs to bar swapping out pages and the tests could be run without
swap. I suggested it only as an experiment though, there are lots of details
between here and having it be a good configuration for production use.

--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Jan Wieck
On 10/13/2004 11:52 PM, Greg Stark wrote:
Jan Wieck [EMAIL PROTECTED] writes:
On 10/8/2004 10:10 PM, Christopher Browne wrote:
 [EMAIL PROTECTED] (Josh Berkus) wrote:
 I've been trying to peg the sweet spot for shared memory using
 OSDL's equipment.  With Jan's new ARC patch, I was expecting that
 the desired amount of shared_buffers to be greatly increased.  This
 has not turned out to be the case.
 That doesn't surprise me.
Neither does it surprise me.
There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.
Which would require that shared memory is not allowed to be swapped out, 
and that is allowed in Linux by default IIRC, not to completely distort 
the entire test.

Jan
--
#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-13 Thread Greg Stark
Jan Wieck [EMAIL PROTECTED] writes:

 On 10/8/2004 10:10 PM, Christopher Browne wrote:
 
  [EMAIL PROTECTED] (Josh Berkus) wrote:
  I've been trying to peg the sweet spot for shared memory using
  OSDL's equipment.  With Jan's new ARC patch, I was expecting that
  the desired amount of shared_buffers to be greatly increased.  This
  has not turned out to be the case.
  That doesn't surprise me.
 
 Neither does it surprise me.

There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.

That configuration would have disadvantages for servers running other jobs
besides postgres. And I was led to believe earlier that postgres starts each
backend with a fairly fresh slate as far as the ARC algorithm, so it wouldn't
work well for a postgres server that had lots of short to moderate life
sessions.

But if it were even close it could be interesting. Reading the data with
O_DIRECT and having a single global cache could be interesting experiments. I
know there are arguments against each of these, but ...

I'm still pulling for an mmap approach to eliminate postgres's buffer cache
entirely in the long term, but it seems like slim odds now. But one way or the
other having two layers of buffering seems like a waste.

-- 
greg


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-10 Thread Dennis Bjorklund
On Fri, 8 Oct 2004, Josh Berkus wrote:

 As you can see, the sweet spot appears to be between 5% and 10% of RAM, 
 which is if anything *lower* than recommendations for 7.4!   

What recommendation is that? To have shared buffers being about 10% of the
ram sounds familiar to me. What was recommended for 7.4? In the past we
used to say that the worst value is 50% since then the same things might
be cached both by pg and the os disk cache.

Why do we excpect the shared buffer size sweet spot to change because of
the new arc stuff? And why would it make it better to have bigger shared 
mem?

Wouldn't it be the opposit, that now we don't invalidate as much of the
cache for vacuums and seq. scan so now we can do as good caching as 
before but with less shared buffers.

That said, testing and getting some numbers of good sizes for shared mem
is good.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some

2004-10-09 Thread Matthew
Christopher Browne wrote:
[EMAIL PROTECTED] (Josh Berkus) wrote:
 

This result is so surprising that I want people to take a look at it
and tell me if there's something wrong with the tests or some
bottlenecking factor that I've not seen.
   

I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:
1.  When it allows a VACUUM not to throw useful data out of 
the shared cache in that VACUUM now only 'chews' on one
page of the cache;

Right, Josh, I assume you didn't run these test with pg_autovacuum 
running, which might be interesting. 

Also how do these numbers compare to 7.4?   They may not be what you 
expected, but they might still be an improvment.

Matthew
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-09 Thread Kevin Brown
I wrote:
 That said, if it's typical for many changes to made to a page
 internally before PG needs to commit that page to disk, then your
 argument makes sense, and that's especially true if we simply cannot
 have the page written to disk in a partially-modified state (something
 I can easily see being an issue for the WAL -- would the same hold
 true of the index/data files?).

Also, even if multiple changes would be made to the page, with the
page being valid for a disk write only after all such changes are
made, the use of mmap() (in conjunction with an internal buffer that
would then be copied to the mmap()ed memory space at the appropriate
time) would potentially save a system call over the use of write()
(even if write() were used to write out multiple pages).  However,
there is so much lower-hanging fruit than this that an mmap()
implementation almost certainly isn't worth pursuing for this alone.

So: it seems to me that mmap() is worth pursuing only if most internal
buffers tend to be written to only once or if it's acceptable for a
partially modified data/index page to be written to disk (which I
suppose could be true for data/index pages in the face of a rock-solid
WAL).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-09 Thread Tom Lane
Kevin Brown [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 mmap() is Right Out because it does not afford us sufficient control
 over when changes to the in-memory data will propagate to disk.

 ... that's especially true if we simply cannot
 have the page written to disk in a partially-modified state (something
 I can easily see being an issue for the WAL -- would the same hold
 true of the index/data files?).

You're almost there.  Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe.  That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet.  In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk.  (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)

mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon.  This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.

There is another reason for doing things this way.  Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk.  It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes.  If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Here's a top-level summary:

 shared_buffers% RAM   NOTPM20*
 1000  0.2%1287
 23000 5%  1507
 46000 10% 1481
 69000 15% 1382
 92000 20% 1375
 11500025% 1380
 13800030% 1344

 As you can see, the sweet spot appears to be between 5% and 10% of RAM, 
 which is if anything *lower* than recommendations for 7.4!   

This doesn't actually surprise me a lot.  There are a number of aspects
of Postgres that will get slower the more buffers there are.

One thing that I hadn't focused on till just now, which is a new
overhead in 8.0, is that StrategyDirtyBufferList() scans the *entire*
buffer list *every time it's called*, which is to say once per bgwriter
loop.  And to add insult to injury, it's doing that with the BufMgrLock
held (not that it's got any choice).

We could alleviate this by changing the API between this function and
BufferSync, such that StrategyDirtyBufferList can stop as soon as it's
found all the buffers that are going to be written in this bgwriter
cycle ... but AFAICS that means abandoning the bgwriter_percent knob
since you'd never really know how many dirty pages there were
altogether.

BTW, what is the actual size of the test database (disk footprint wise)
and how much of that do you think is heavily accessed during the run?
It's possible that the test conditions are such that adjusting
shared_buffers isn't going to mean anything anyway.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Josh Berkus
Tom,

 BTW, what is the actual size of the test database (disk footprint wise)
 and how much of that do you think is heavily accessed during the run?
 It's possible that the test conditions are such that adjusting
 shared_buffers isn't going to mean anything anyway.

The raw data is 32GB, but a lot of the activity is incremental, that is 
inserts and updates to recent inserts.Still, according to Mark, most of 
the data does get queried in the course of filling orders.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Christopher Browne
[EMAIL PROTECTED] (Josh Berkus) wrote:
 I've been trying to peg the sweet spot for shared memory using
 OSDL's equipment.  With Jan's new ARC patch, I was expecting that
 the desired amount of shared_buffers to be greatly increased.  This
 has not turned out to be the case.

That doesn't surprise me.

My primary expectation would be that ARC would be able to make small
buffers much more effective alongside vacuums and seq scans than they
used to be.  That does not establish anything about the value of
increasing the size buffer caches...

 This result is so surprising that I want people to take a look at it
 and tell me if there's something wrong with the tests or some
 bottlenecking factor that I've not seen.

I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:

 1.  When it allows a VACUUM not to throw useful data out of 
 the shared cache in that VACUUM now only 'chews' on one
 page of the cache;

 2.  When it allows a Seq Scan to not push useful data out of
 the shared cache, for much the same reason.

I don't imagine either scenario are prominent in the OSDL tests.

Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:

 - Data that passes through the cache also passes through kernel
   cache, so it's recorded twice, and read twice...

 - The more cache pages there are, the more work is needed for
   PostgreSQL to manage them.  That will notably happen anywhere
   that there is a need to scan the cache.

 - If there are any inefficiencies in how the OS kernel manages shared
   memory, as their size scales, well, that will obviously cause a
   slowdown.
-- 
If this was helpful, http://svcs.affero.net/rm.php?r=cbbrowne rate me
http://www.ntlug.org/~cbbrowne/internet.html
One World. One Web. One Program.   -- MICROS~1 hype
Ein Volk, ein Reich, ein Fuehrer   -- Nazi hype
(One people, one country, one leader)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])