Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-23 Thread Tom Lane
Curt Sampson [EMAIL PROTECTED] writes:
 Back when I was working out how to do this, I reckoned that you could
 use mmap by keeping a write queue for each modified page. Reading,
 you'd have to read the datum from the page and then check the write
 queue for that page to see if that datum had been updated, using the
 new value if it's there. Writing, you'd add the modified datum to the
 write queue, but not apply the write queue to the page until you'd had
 confirmation that the corresponding transaction log entry had been
 written. So multiple writes are no big deal; they just all queue up in
 the write queue, and at any time you can apply as much of the write
 queue to the page itself as the current log entry will allow.

Seems to me the overhead of any such scheme would swamp the savings from
avoiding kernel/userspace copies ... the locking issues alone would be
painful.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-15 Thread Kevin Brown
Tom Lane wrote:
 Kevin Brown [EMAIL PROTECTED] writes:
  Hmm...something just occurred to me about this.
 
  Would a hybrid approach be possible?  That is, use mmap() to handle
  reads, and use write() to handle writes?
 
 Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
 Basically it says that there are no guarantees whatsoever if you try
 this.  The SUS text is a bit weaselly (the application must ensure
 correct synchronization) but the HPUX mmap man page, among others,
 lays it on the line:
 
  It is also unspecified whether write references to a memory region
  mapped with MAP_SHARED are visible to processes reading the file and
  whether writes to a file are visible to processes that have mapped the
  modified portion of that file, except for the effect of msync().
 
 It might work on particular OSes but I think depending on such behavior
 would be folly...

Yeah, and at this point it can't be considered portable in any real
way because of this.  Thanks for the perspective.  I should have
expected the general specification to be quite broken in this regard,
not to mention certain implementations.  :-)

Good thing there's a lot of lower-hanging fruit than this...



-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-15 Thread Sean Chittenden
this.  The SUS text is a bit weaselly (the application must ensure
correct synchronization) but the HPUX mmap man page, among others,
lays it on the line:
 It is also unspecified whether write references to a memory region
 mapped with MAP_SHARED are visible to processes reading the file 
and
 whether writes to a file are visible to processes that have 
mapped the
 modified portion of that file, except for the effect of msync().

It might work on particular OSes but I think depending on such behavior
would be folly...
Agreed.  Only OSes with a coherent file system buffer cache should ever 
use mmap(2).  In order for this to work on HPUX, msync(2) would need to 
be used.  -sc

--
Sean Chittenden
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Simon Riggs

First off, I'd like to get involved with these tests - pressure of other
work only has prevented me.

Here's my take on the results so far:

I think taking the ratio of the memory allocated to shared_buffers against
the total memory available on the server is completely fallacious. That is
why they cannnot be explained - IMHO the ratio has no real theoretical
basis.

The important ratio for me is the amount of shared_buffers against the total
size of the database in the benchmark test. Every database workload has a
differing percentage of the total database size that represents the working
set, or the memory that can be beneficially cached. For the tests that
DBT-2 is performing, I say that there is only so many blocks that are worth
the trouble caching. If you cache more than this, you are wasting your time.

For me, these tests don't show that there is a sweet spot that you should
set your shared_buffers to, only that for that specific test, you have
located the correct size for shared_buffers. For me, it would be an
incorrect inference that this could then be interpreted that this was the
percentage of the available RAM where the sweet spot lies for all
workloads.

The theoretical basis for my comments is this: DBT-2 is essentially a static
workload. That means, for a long test, we can work out with reasonable
certainty the probability that a block will be requested, for every single
block in the database. Given a particular size of cache, you can work out
what your overall cache hit ratio is and therfore what your speed up is
compared with retrieving every single block from disk (the no cache
scenario). If you draw a graph of speedup (y) against cache size as a % of
total database size, the graph looks like an upside-down L - i.e. the
graph rises steeply as you give it more memory, then turns sharply at a
particular point, after which it flattens out. The turning point is the
sweet spot we all seek - the optimum amount of cache memory to allocate -
but this spot depends upon the worklaod and database size, not on available
RAM on the system under test.

Clearly, the presence of the OS disk cache complicates this. Since we have
two caches both allocated from the same pot of memory, it should be clear
that if we overallocate one cache beyond its optimium effectiveness, while
the second cache is still in its more is better stage, then we will get
reduced performance. That seems to be the case here. I wouldn't accept that
a fixed ratio between the two caches exists for ALL, or even the majority of
workloads - though clearly broad brush workloads such as OLTP and Data
Warehousing do have similar-ish requirements.

As an example, lets look at an example:
An application with two tables: SmallTab has 10,000 rows of 100 bytes each
(so table is ~1 Mb)- one row per photo in a photo gallery web site. LargeTab
has large objects within it and has 10,000 photos, average size 10 Mb (so
table is ~100Gb). Assuming all photos are requested randomly, you can see
that an optimum cache size for this workload is 1Mb RAM, 100Gb disk. Trying
to up the cache doesn't have much effect on  the probability that a photo
(from LargeTab) will be in cache, unless you have a large % of 100Gb of RAM,
when you do start to make gains. (Please don't be picky about indexes,
catalog, block size etc). That clearly has absolutely nothing at all to do
with the RAM of the system on which it is running.

I think Jan has said this also in far fewer words, but I'll leave that to
Jan to agree/disagree...

I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a
shared_buffers cache as is required by the database workload, and this
should not be constrained to a small percentage of server RAM.

Best Regards,

Simon Riggs

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Josh Berkus
 Sent: 08 October 2004 22:43
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: [PERFORM] First set of OSDL Shared Mem scalability results,
 some wierdness ...


 Folks,

 I'm hoping that some of you can shed some light on this.

 I've been trying to peg the sweet spot for shared memory using OSDL's
 equipment.   With Jan's new ARC patch, I was expecting that the desired
 amount of shared_buffers to be greatly increased.  This has not
 turned out to
 be the case.

 The first test series was using OSDL's DBT2 (OLTP) test, with 150
 warehouses.   All tests were run on a 4-way Pentium III 700mhz
 3.8GB RAM
 system hooked up to a rather high-end storage device (14
 spindles).Tests
 were on PostgreSQL 8.0b3, Linux 2.6.7.

 Here's a top-level summary:

 shared_buffers% RAM   NOTPM20*
 1000  0.2%1287
 23000 5%  1507
 46000 10% 1481
 69000 15% 1382
 92000 20% 1375
 11500025% 1380
 13800030

Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Josh Berkus
Simon,

lots of good stuff clipped

 If you draw a graph of speedup (y) against cache size as a 
 % of total database size, the graph looks like an upside-down L - i.e.
 the graph rises steeply as you give it more memory, then turns sharply at a
 particular point, after which it flattens out. The turning point is the
 sweet spot we all seek - the optimum amount of cache memory to allocate -
 but this spot depends upon the worklaod and database size, not on available
 RAM on the system under test.

Hmmm ... how do you explain, then the camel hump nature of the real 
performance?That is, when we allocated even a few MB more than the 
optimum ~190MB, overall performance stated to drop quickly.   The result is 
that allocating 2x optimum RAM is nearly as bad as allocating too little 
(e.g. 8MB).  

The only explanation I've heard of this so far is that there is a significant 
loss of efficiency with larger caches.  Or do you see the loss of 200MB out 
of 3500MB would actually affect the Kernel cache that much?

Anyway, one test of your theory that I can run immediately is to run the exact 
same workload on a bigger, faster server and see if the desired quantity of 
shared_buffers is roughly the same.  I'm hoping that you're wrong -- not 
because I don't find your argument persuasive, but because if you're right it 
leaves us without any reasonable ability to recommend shared_buffer settings.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Christopher Browne
Quoth [EMAIL PROTECTED] (Simon Riggs):
 I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as
 large a shared_buffers cache as is required by the database
 workload, and this should not be constrained to a small percentage
 of server RAM.

I don't think that this particularly follows from what ARC does.

What ARC does is to prevent certain conspicuous patterns of
sequential accesses from essentially trashing the contents of the
cache.

If a particular benchmark does not include conspicuous vacuums or
sequential scans on large tables, then there is little reason to
expect ARC to have a noticeable impact on performance.

It _could_ be that this implies that ARC allows you to get some use
out of a larger shared cache, as it won't get blown away by vacuums
and Seq Scans.  But it is _not_ obvious that this is a necessary
truth.

_Other_ truths we know about are:

 a) If you increase the shared cache, that means more data that is
represented in both the shared cache and the OS buffer cache,
which seems rather a waste;

 b) The larger the shared cache, the more pages there are for the
backend to rummage through before it looks to the filesystem,
and therefore the more expensive cache misses get.  Cache hits
get more expensive, too.  Searching through memory is not
costless.
-- 
(format nil [EMAIL PROTECTED] cbbrowne acm.org)
http://linuxfinances.info/info/linuxdistributions.html
The X-Files are too optimistic.  The truth is *not* out there...
-- Anthony Ord [EMAIL PROTECTED]

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-14 Thread Tom Lane
Kevin Brown [EMAIL PROTECTED] writes:
 Hmm...something just occurred to me about this.

 Would a hybrid approach be possible?  That is, use mmap() to handle
 reads, and use write() to handle writes?

Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
Basically it says that there are no guarantees whatsoever if you try
this.  The SUS text is a bit weaselly (the application must ensure
correct synchronization) but the HPUX mmap man page, among others,
lays it on the line:

 It is also unspecified whether write references to a memory region
 mapped with MAP_SHARED are visible to processes reading the file and
 whether writes to a file are visible to processes that have mapped the
 modified portion of that file, except for the effect of msync().

It might work on particular OSes but I think depending on such behavior
would be folly...

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-09 Thread Kevin Brown
I wrote:
 That said, if it's typical for many changes to made to a page
 internally before PG needs to commit that page to disk, then your
 argument makes sense, and that's especially true if we simply cannot
 have the page written to disk in a partially-modified state (something
 I can easily see being an issue for the WAL -- would the same hold
 true of the index/data files?).

Also, even if multiple changes would be made to the page, with the
page being valid for a disk write only after all such changes are
made, the use of mmap() (in conjunction with an internal buffer that
would then be copied to the mmap()ed memory space at the appropriate
time) would potentially save a system call over the use of write()
(even if write() were used to write out multiple pages).  However,
there is so much lower-hanging fruit than this that an mmap()
implementation almost certainly isn't worth pursuing for this alone.

So: it seems to me that mmap() is worth pursuing only if most internal
buffers tend to be written to only once or if it's acceptable for a
partially modified data/index page to be written to disk (which I
suppose could be true for data/index pages in the face of a rock-solid
WAL).


-- 
Kevin Brown   [EMAIL PROTECTED]

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-09 Thread Tom Lane
Kevin Brown [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 mmap() is Right Out because it does not afford us sufficient control
 over when changes to the in-memory data will propagate to disk.

 ... that's especially true if we simply cannot
 have the page written to disk in a partially-modified state (something
 I can easily see being an issue for the WAL -- would the same hold
 true of the index/data files?).

You're almost there.  Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe.  That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet.  In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk.  (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)

mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon.  This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.

There is another reason for doing things this way.  Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk.  It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes.  If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


[PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Josh Berkus
Folks,

I'm hoping that some of you can shed some light on this.

I've been trying to peg the sweet spot for shared memory using OSDL's 
equipment.   With Jan's new ARC patch, I was expecting that the desired 
amount of shared_buffers to be greatly increased.  This has not turned out to 
be the case.

The first test series was using OSDL's DBT2 (OLTP) test, with 150 
warehouses.   All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM 
system hooked up to a rather high-end storage device (14 spindles).Tests 
were on PostgreSQL 8.0b3, Linux 2.6.7.

Here's a top-level summary:

shared_buffers  % RAM   NOTPM20*
10000.2%1287
23000   5%  1507
46000   10% 1481
69000   15% 1382
92000   20% 1375
115000  25% 1380
138000  30% 1344

* = New Order Transactions Per Minute, last 20 Minutes
 Higher is better.  The maximum possible is 1800.

As you can see, the sweet spot appears to be between 5% and 10% of RAM, 
which is if anything *lower* than recommendations for 7.4!   

This result is so surprising that I want people to take a look at it and tell 
me if there's something wrong with the tests or some bottlenecking factor 
that I've not seen.

in order above:
http://khack.osdl.org/stp/297959/
http://khack.osdl.org/stp/297960/
http://khack.osdl.org/stp/297961/
http://khack.osdl.org/stp/297962/
http://khack.osdl.org/stp/297963/
http://khack.osdl.org/stp/297964/
http://khack.osdl.org/stp/297965/

Please note that many of the Graphs in these reports are broken.  For one 
thing, some aren't recorded (flat lines) and the CPU usage graph has 
mislabeled lines.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Here's a top-level summary:

 shared_buffers% RAM   NOTPM20*
 1000  0.2%1287
 23000 5%  1507
 46000 10% 1481
 69000 15% 1382
 92000 20% 1375
 11500025% 1380
 13800030% 1344

 As you can see, the sweet spot appears to be between 5% and 10% of RAM, 
 which is if anything *lower* than recommendations for 7.4!   

This doesn't actually surprise me a lot.  There are a number of aspects
of Postgres that will get slower the more buffers there are.

One thing that I hadn't focused on till just now, which is a new
overhead in 8.0, is that StrategyDirtyBufferList() scans the *entire*
buffer list *every time it's called*, which is to say once per bgwriter
loop.  And to add insult to injury, it's doing that with the BufMgrLock
held (not that it's got any choice).

We could alleviate this by changing the API between this function and
BufferSync, such that StrategyDirtyBufferList can stop as soon as it's
found all the buffers that are going to be written in this bgwriter
cycle ... but AFAICS that means abandoning the bgwriter_percent knob
since you'd never really know how many dirty pages there were
altogether.

BTW, what is the actual size of the test database (disk footprint wise)
and how much of that do you think is heavily accessed during the run?
It's possible that the test conditions are such that adjusting
shared_buffers isn't going to mean anything anyway.

regards, tom lane

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Josh Berkus
Tom,

 BTW, what is the actual size of the test database (disk footprint wise)
 and how much of that do you think is heavily accessed during the run?
 It's possible that the test conditions are such that adjusting
 shared_buffers isn't going to mean anything anyway.

The raw data is 32GB, but a lot of the activity is incremental, that is 
inserts and updates to recent inserts.Still, according to Mark, most of 
the data does get queried in the course of filling orders.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...

2004-10-08 Thread Christopher Browne
[EMAIL PROTECTED] (Josh Berkus) wrote:
 I've been trying to peg the sweet spot for shared memory using
 OSDL's equipment.  With Jan's new ARC patch, I was expecting that
 the desired amount of shared_buffers to be greatly increased.  This
 has not turned out to be the case.

That doesn't surprise me.

My primary expectation would be that ARC would be able to make small
buffers much more effective alongside vacuums and seq scans than they
used to be.  That does not establish anything about the value of
increasing the size buffer caches...

 This result is so surprising that I want people to take a look at it
 and tell me if there's something wrong with the tests or some
 bottlenecking factor that I've not seen.

I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:

 1.  When it allows a VACUUM not to throw useful data out of 
 the shared cache in that VACUUM now only 'chews' on one
 page of the cache;

 2.  When it allows a Seq Scan to not push useful data out of
 the shared cache, for much the same reason.

I don't imagine either scenario are prominent in the OSDL tests.

Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:

 - Data that passes through the cache also passes through kernel
   cache, so it's recorded twice, and read twice...

 - The more cache pages there are, the more work is needed for
   PostgreSQL to manage them.  That will notably happen anywhere
   that there is a need to scan the cache.

 - If there are any inefficiencies in how the OS kernel manages shared
   memory, as their size scales, well, that will obviously cause a
   slowdown.
-- 
If this was helpful, http://svcs.affero.net/rm.php?r=cbbrowne rate me
http://www.ntlug.org/~cbbrowne/internet.html
One World. One Web. One Program.   -- MICROS~1 hype
Ein Volk, ein Reich, ein Fuehrer   -- Nazi hype
(One people, one country, one leader)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])