Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Curt Sampson [EMAIL PROTECTED] writes: Back when I was working out how to do this, I reckoned that you could use mmap by keeping a write queue for each modified page. Reading, you'd have to read the datum from the page and then check the write queue for that page to see if that datum had been updated, using the new value if it's there. Writing, you'd add the modified datum to the write queue, but not apply the write queue to the page until you'd had confirmation that the corresponding transaction log entry had been written. So multiple writes are no big deal; they just all queue up in the write queue, and at any time you can apply as much of the write queue to the page itself as the current log entry will allow. Seems to me the overhead of any such scheme would swamp the savings from avoiding kernel/userspace copies ... the locking issues alone would be painful. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Tom Lane wrote: Kevin Brown [EMAIL PROTECTED] writes: Hmm...something just occurred to me about this. Would a hybrid approach be possible? That is, use mmap() to handle reads, and use write() to handle writes? Nope. Have you read the specs regarding mmap-vs-stdio synchronization? Basically it says that there are no guarantees whatsoever if you try this. The SUS text is a bit weaselly (the application must ensure correct synchronization) but the HPUX mmap man page, among others, lays it on the line: It is also unspecified whether write references to a memory region mapped with MAP_SHARED are visible to processes reading the file and whether writes to a file are visible to processes that have mapped the modified portion of that file, except for the effect of msync(). It might work on particular OSes but I think depending on such behavior would be folly... Yeah, and at this point it can't be considered portable in any real way because of this. Thanks for the perspective. I should have expected the general specification to be quite broken in this regard, not to mention certain implementations. :-) Good thing there's a lot of lower-hanging fruit than this... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
this. The SUS text is a bit weaselly (the application must ensure correct synchronization) but the HPUX mmap man page, among others, lays it on the line: It is also unspecified whether write references to a memory region mapped with MAP_SHARED are visible to processes reading the file and whether writes to a file are visible to processes that have mapped the modified portion of that file, except for the effect of msync(). It might work on particular OSes but I think depending on such behavior would be folly... Agreed. Only OSes with a coherent file system buffer cache should ever use mmap(2). In order for this to work on HPUX, msync(2) would need to be used. -sc -- Sean Chittenden ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
First off, I'd like to get involved with these tests - pressure of other work only has prevented me. Here's my take on the results so far: I think taking the ratio of the memory allocated to shared_buffers against the total memory available on the server is completely fallacious. That is why they cannnot be explained - IMHO the ratio has no real theoretical basis. The important ratio for me is the amount of shared_buffers against the total size of the database in the benchmark test. Every database workload has a differing percentage of the total database size that represents the working set, or the memory that can be beneficially cached. For the tests that DBT-2 is performing, I say that there is only so many blocks that are worth the trouble caching. If you cache more than this, you are wasting your time. For me, these tests don't show that there is a sweet spot that you should set your shared_buffers to, only that for that specific test, you have located the correct size for shared_buffers. For me, it would be an incorrect inference that this could then be interpreted that this was the percentage of the available RAM where the sweet spot lies for all workloads. The theoretical basis for my comments is this: DBT-2 is essentially a static workload. That means, for a long test, we can work out with reasonable certainty the probability that a block will be requested, for every single block in the database. Given a particular size of cache, you can work out what your overall cache hit ratio is and therfore what your speed up is compared with retrieving every single block from disk (the no cache scenario). If you draw a graph of speedup (y) against cache size as a % of total database size, the graph looks like an upside-down L - i.e. the graph rises steeply as you give it more memory, then turns sharply at a particular point, after which it flattens out. The turning point is the sweet spot we all seek - the optimum amount of cache memory to allocate - but this spot depends upon the worklaod and database size, not on available RAM on the system under test. Clearly, the presence of the OS disk cache complicates this. Since we have two caches both allocated from the same pot of memory, it should be clear that if we overallocate one cache beyond its optimium effectiveness, while the second cache is still in its more is better stage, then we will get reduced performance. That seems to be the case here. I wouldn't accept that a fixed ratio between the two caches exists for ALL, or even the majority of workloads - though clearly broad brush workloads such as OLTP and Data Warehousing do have similar-ish requirements. As an example, lets look at an example: An application with two tables: SmallTab has 10,000 rows of 100 bytes each (so table is ~1 Mb)- one row per photo in a photo gallery web site. LargeTab has large objects within it and has 10,000 photos, average size 10 Mb (so table is ~100Gb). Assuming all photos are requested randomly, you can see that an optimum cache size for this workload is 1Mb RAM, 100Gb disk. Trying to up the cache doesn't have much effect on the probability that a photo (from LargeTab) will be in cache, unless you have a large % of 100Gb of RAM, when you do start to make gains. (Please don't be picky about indexes, catalog, block size etc). That clearly has absolutely nothing at all to do with the RAM of the system on which it is running. I think Jan has said this also in far fewer words, but I'll leave that to Jan to agree/disagree... I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a shared_buffers cache as is required by the database workload, and this should not be constrained to a small percentage of server RAM. Best Regards, Simon Riggs -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Josh Berkus Sent: 08 October 2004 22:43 To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ... Folks, I'm hoping that some of you can shed some light on this. I've been trying to peg the sweet spot for shared memory using OSDL's equipment. With Jan's new ARC patch, I was expecting that the desired amount of shared_buffers to be greatly increased. This has not turned out to be the case. The first test series was using OSDL's DBT2 (OLTP) test, with 150 warehouses. All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM system hooked up to a rather high-end storage device (14 spindles).Tests were on PostgreSQL 8.0b3, Linux 2.6.7. Here's a top-level summary: shared_buffers% RAM NOTPM20* 1000 0.2%1287 23000 5% 1507 46000 10% 1481 69000 15% 1382 92000 20% 1375 11500025% 1380 13800030
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Simon, lots of good stuff clipped If you draw a graph of speedup (y) against cache size as a % of total database size, the graph looks like an upside-down L - i.e. the graph rises steeply as you give it more memory, then turns sharply at a particular point, after which it flattens out. The turning point is the sweet spot we all seek - the optimum amount of cache memory to allocate - but this spot depends upon the worklaod and database size, not on available RAM on the system under test. Hmmm ... how do you explain, then the camel hump nature of the real performance?That is, when we allocated even a few MB more than the optimum ~190MB, overall performance stated to drop quickly. The result is that allocating 2x optimum RAM is nearly as bad as allocating too little (e.g. 8MB). The only explanation I've heard of this so far is that there is a significant loss of efficiency with larger caches. Or do you see the loss of 200MB out of 3500MB would actually affect the Kernel cache that much? Anyway, one test of your theory that I can run immediately is to run the exact same workload on a bigger, faster server and see if the desired quantity of shared_buffers is roughly the same. I'm hoping that you're wrong -- not because I don't find your argument persuasive, but because if you're right it leaves us without any reasonable ability to recommend shared_buffer settings. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Quoth [EMAIL PROTECTED] (Simon Riggs): I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a shared_buffers cache as is required by the database workload, and this should not be constrained to a small percentage of server RAM. I don't think that this particularly follows from what ARC does. What ARC does is to prevent certain conspicuous patterns of sequential accesses from essentially trashing the contents of the cache. If a particular benchmark does not include conspicuous vacuums or sequential scans on large tables, then there is little reason to expect ARC to have a noticeable impact on performance. It _could_ be that this implies that ARC allows you to get some use out of a larger shared cache, as it won't get blown away by vacuums and Seq Scans. But it is _not_ obvious that this is a necessary truth. _Other_ truths we know about are: a) If you increase the shared cache, that means more data that is represented in both the shared cache and the OS buffer cache, which seems rather a waste; b) The larger the shared cache, the more pages there are for the backend to rummage through before it looks to the filesystem, and therefore the more expensive cache misses get. Cache hits get more expensive, too. Searching through memory is not costless. -- (format nil [EMAIL PROTECTED] cbbrowne acm.org) http://linuxfinances.info/info/linuxdistributions.html The X-Files are too optimistic. The truth is *not* out there... -- Anthony Ord [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Kevin Brown [EMAIL PROTECTED] writes: Hmm...something just occurred to me about this. Would a hybrid approach be possible? That is, use mmap() to handle reads, and use write() to handle writes? Nope. Have you read the specs regarding mmap-vs-stdio synchronization? Basically it says that there are no guarantees whatsoever if you try this. The SUS text is a bit weaselly (the application must ensure correct synchronization) but the HPUX mmap man page, among others, lays it on the line: It is also unspecified whether write references to a memory region mapped with MAP_SHARED are visible to processes reading the file and whether writes to a file are visible to processes that have mapped the modified portion of that file, except for the effect of msync(). It might work on particular OSes but I think depending on such behavior would be folly... regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
I wrote: That said, if it's typical for many changes to made to a page internally before PG needs to commit that page to disk, then your argument makes sense, and that's especially true if we simply cannot have the page written to disk in a partially-modified state (something I can easily see being an issue for the WAL -- would the same hold true of the index/data files?). Also, even if multiple changes would be made to the page, with the page being valid for a disk write only after all such changes are made, the use of mmap() (in conjunction with an internal buffer that would then be copied to the mmap()ed memory space at the appropriate time) would potentially save a system call over the use of write() (even if write() were used to write out multiple pages). However, there is so much lower-hanging fruit than this that an mmap() implementation almost certainly isn't worth pursuing for this alone. So: it seems to me that mmap() is worth pursuing only if most internal buffers tend to be written to only once or if it's acceptable for a partially modified data/index page to be written to disk (which I suppose could be true for data/index pages in the face of a rock-solid WAL). -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Kevin Brown [EMAIL PROTECTED] writes: Tom Lane wrote: mmap() is Right Out because it does not afford us sufficient control over when changes to the in-memory data will propagate to disk. ... that's especially true if we simply cannot have the page written to disk in a partially-modified state (something I can easily see being an issue for the WAL -- would the same hold true of the index/data files?). You're almost there. Remember the fundamental WAL rule: log entries must hit disk before the data changes they describe. That means that we need not only a way of forcing changes to disk (fsync) but a way of being sure that changes have *not* gone to disk yet. In the existing implementation we get that by just not issuing write() for a given page until we know that the relevant WAL log entries are fsync'd down to disk. (BTW, this is what the LSN field on every page is for: it tells the buffer manager the latest WAL offset that has to be flushed before it can safely write the page.) mmap provides msync which is comparable to fsync, but AFAICS it provides no way to prevent an in-memory change from reaching disk too soon. This would mean that WAL entries would have to be written *and flushed* before we could make the data change at all, which would convert multiple updates of a single page into a series of write-and- wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction is bad enough, once per atomic action is intolerable. There is another reason for doing things this way. Consider a backend that goes haywire and scribbles all over shared memory before crashing. When the postmaster sees the abnormal child termination, it forcibly kills the other active backends and discards shared memory altogether. This gives us fairly good odds that the crash did not affect any data on disk. It's not perfect of course, since another backend might have been in process of issuing a write() when the disaster happens, but it's pretty good; and I think that that isolation has a lot to do with PG's good reputation for not corrupting data in crashes. If we had a large fraction of the address space mmap'd then this sort of crash would be just about guaranteed to propagate corruption into the on-disk files. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
[PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Folks, I'm hoping that some of you can shed some light on this. I've been trying to peg the sweet spot for shared memory using OSDL's equipment. With Jan's new ARC patch, I was expecting that the desired amount of shared_buffers to be greatly increased. This has not turned out to be the case. The first test series was using OSDL's DBT2 (OLTP) test, with 150 warehouses. All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM system hooked up to a rather high-end storage device (14 spindles).Tests were on PostgreSQL 8.0b3, Linux 2.6.7. Here's a top-level summary: shared_buffers % RAM NOTPM20* 10000.2%1287 23000 5% 1507 46000 10% 1481 69000 15% 1382 92000 20% 1375 115000 25% 1380 138000 30% 1344 * = New Order Transactions Per Minute, last 20 Minutes Higher is better. The maximum possible is 1800. As you can see, the sweet spot appears to be between 5% and 10% of RAM, which is if anything *lower* than recommendations for 7.4! This result is so surprising that I want people to take a look at it and tell me if there's something wrong with the tests or some bottlenecking factor that I've not seen. in order above: http://khack.osdl.org/stp/297959/ http://khack.osdl.org/stp/297960/ http://khack.osdl.org/stp/297961/ http://khack.osdl.org/stp/297962/ http://khack.osdl.org/stp/297963/ http://khack.osdl.org/stp/297964/ http://khack.osdl.org/stp/297965/ Please note that many of the Graphs in these reports are broken. For one thing, some aren't recorded (flat lines) and the CPU usage graph has mislabeled lines. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Josh Berkus [EMAIL PROTECTED] writes: Here's a top-level summary: shared_buffers% RAM NOTPM20* 1000 0.2%1287 23000 5% 1507 46000 10% 1481 69000 15% 1382 92000 20% 1375 11500025% 1380 13800030% 1344 As you can see, the sweet spot appears to be between 5% and 10% of RAM, which is if anything *lower* than recommendations for 7.4! This doesn't actually surprise me a lot. There are a number of aspects of Postgres that will get slower the more buffers there are. One thing that I hadn't focused on till just now, which is a new overhead in 8.0, is that StrategyDirtyBufferList() scans the *entire* buffer list *every time it's called*, which is to say once per bgwriter loop. And to add insult to injury, it's doing that with the BufMgrLock held (not that it's got any choice). We could alleviate this by changing the API between this function and BufferSync, such that StrategyDirtyBufferList can stop as soon as it's found all the buffers that are going to be written in this bgwriter cycle ... but AFAICS that means abandoning the bgwriter_percent knob since you'd never really know how many dirty pages there were altogether. BTW, what is the actual size of the test database (disk footprint wise) and how much of that do you think is heavily accessed during the run? It's possible that the test conditions are such that adjusting shared_buffers isn't going to mean anything anyway. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
Tom, BTW, what is the actual size of the test database (disk footprint wise) and how much of that do you think is heavily accessed during the run? It's possible that the test conditions are such that adjusting shared_buffers isn't going to mean anything anyway. The raw data is 32GB, but a lot of the activity is incremental, that is inserts and updates to recent inserts.Still, according to Mark, most of the data does get queried in the course of filling orders. -- --Josh Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] First set of OSDL Shared Mem scalability results, some wierdness ...
[EMAIL PROTECTED] (Josh Berkus) wrote: I've been trying to peg the sweet spot for shared memory using OSDL's equipment. With Jan's new ARC patch, I was expecting that the desired amount of shared_buffers to be greatly increased. This has not turned out to be the case. That doesn't surprise me. My primary expectation would be that ARC would be able to make small buffers much more effective alongside vacuums and seq scans than they used to be. That does not establish anything about the value of increasing the size buffer caches... This result is so surprising that I want people to take a look at it and tell me if there's something wrong with the tests or some bottlenecking factor that I've not seen. I'm aware of two conspicuous scenarios where ARC would be expected to _substantially_ improve performance: 1. When it allows a VACUUM not to throw useful data out of the shared cache in that VACUUM now only 'chews' on one page of the cache; 2. When it allows a Seq Scan to not push useful data out of the shared cache, for much the same reason. I don't imagine either scenario are prominent in the OSDL tests. Increasing the number of cache buffers _is_ likely to lead to some slowdowns: - Data that passes through the cache also passes through kernel cache, so it's recorded twice, and read twice... - The more cache pages there are, the more work is needed for PostgreSQL to manage them. That will notably happen anywhere that there is a need to scan the cache. - If there are any inefficiencies in how the OS kernel manages shared memory, as their size scales, well, that will obviously cause a slowdown. -- If this was helpful, http://svcs.affero.net/rm.php?r=cbbrowne rate me http://www.ntlug.org/~cbbrowne/internet.html One World. One Web. One Program. -- MICROS~1 hype Ein Volk, ein Reich, ein Fuehrer -- Nazi hype (One people, one country, one leader) ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])