Re: [PERFORM] lru_multiplier and backend page write-outs
Hello, At one point I envisioned making it smart enough to try and handle the scenario you describe--on an idle system, you may very well want to write out dirty and recently accessed buffers if there's nothing else going on. But such behavior is counter-productive on a busy system, which is why a similar mechanism that existed before 8.3 was removed. Making that only happen when idle requires a metric for what busy means, which is tricky to do given the information available to this particular process. Short version: if you never fill the buffer cache, buffers_clean will always be zero, and you'll only see writes by checkpoints and things not operating with the standard client buffer allocation mechanism. Which brings us to... Sure. I am not really out to get the background writer to pre-emptively do idle trickling. Though I can see cases where one might care about this (such as lessening the impact of OS buffer cache delays on checkpoints), it's not what I am after now. One theory: Is it the auto vacuum process? Stracing those I've seen that they very often to writes directly to disk. In order to keep it from using up the whole cache with maintenance overhead, vacuum allocates a 256K ring of buffers and use re-uses ones from there whenever possible. That will generate buffer_backend writes when that ring fills but it has more left to scan. Your theory that all the backend writes are coming from vacuum seems consistant with what you've described. The bit that is inconsistent with this theory, given the above ring buffer desription, is that I saw the backend write-out count increasing constantlyduring the write activity I was generating to the database. However (because in this particular case it was a small database used for some latency related testing), no table was ever large enough that 256k buffers would ever be filled by the process of vacuuming a single table. Most tables would likely have been a handful to a couple of hundred of pages large. In addition, when I say constantly above I mean that the count increases even between successive SELECT:s (of the stat table) with only a second or two in between. In the abscence of long-running vacuum's, that discounts vacuuming because the naptime is 1 minute. In fact this already discounted vacuuming even without the added information you provided above, but I didn't realize when originally posting. The reason I mentioned vacuuming was that the use case is such that we do have a lot of tables constantly getting writes and updates, but they are all small. Anything else known that might be generating the writes, if it is not vacuuming? You might even want to drop the two background writer parameters you've tweaked upwards back down closer to their original values. I get the impression you might have increased those hoping for more background writer work because you weren't seeing any. If you ever do get to where your buffer cache is full and the background writer starts doing something, those could jump from ineffective to wastefully heavy at that point. I tweaked it in order to eliminate backends having to do synchrounous (with respect to the operating system even if not with respect to the underlying device) writes. The idea is that writes to the operating system are less understood/controlled, in terms of any latency they may case. It would be very nice if the backend writes were always zero under normal circumstances (or at least growing very very rarely in edge cases where the JIT policy did not suceed), in order to make it a more relevant and rare observation that the backend write-outs are systematically increasing. On this topic btw, was it considered to allow the administrator to specify a fixed-size margin to use when applying the JIT policy? (The JIT policy and logic itself being exactly the same still.) Especially with larger buffer caches, that would perhaps allow the administrator to make a call to truly eliminate synchronous writes during normal operation, while not adversely affecting anything (if the buffer cache is 1 GB, having a margin of say 50 MB does not really matter much in terms of wasting memory, yet could have a significant impact on eliminating synchronous write-outs). On a system where you really want to keep backend writes to exactly 0 under normal circumstances (discounting vacuuming), and having a large buffer cache (say the one gig), it might be nice to be able to say ok - I have 1 GB of buffer cache. for the purpose of the JIT algorithm, please pretend it's only 900 MB. The result is 100 MB of constantly sized margin, with respect to ensuring writes are asynchronous. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpexaYHLBE8h.pgp Description: PGP signature
Re: [PERFORM] lru_multiplier and backend page write-outs
no table was ever large enough that 256k buffers would ever be filled by the process of vacuuming a single table. Not 256K buffers--256K, 32 buffers. Ok. In addition, when I say constantly above I mean that the count increases even between successive SELECT:s (of the stat table) with only a second or two in between. Writes to the database when only doing read operations are usually related to hint bits: http://wiki.postgresql.org/wiki/Hint_Bits Sorry, I didn't mean to imply read-only operations (I did read the hint bits information a while back though). What I meant was that while I was constantly generating the insert/delete/update activity, I was selecting the bg writer stats with only a second or two in between. The intent was to convey that the count of backend written pages was systematically and constantly (as in a few hundreds per handful of seconds) increasing, in spite of no long running vacuum and the buffer cache not being close to full. On this topic btw, was it considered to allow the administrator to specify a fixed-size margin to use when applying the JIT policy? Right now, there's no way to know exactly what's in the buffer cache without scanning the individual buffers, which requires locking their headers so you can see them consistently. No one process can get the big picture without doing something intrusive like that, and on a busy system the overhead of collecting more data to know how exactly far ahead the cleaning is can drag down overall performance. A lot can happen while the background writer is sleeping. Understood. One next-generation design which has been sketched out but not even prototyped would take cleaned buffers and add them to the internal list of buffers that are free, which right now is usually empty on the theory that cached data is always more useful than a reserved buffer. If you developed a reasonable model for how many buffers you needed and padded that appropriately, that's the easiest way (given the rest of the buffer manager code) to get close to ensuring there aren't any backend writes. Because you've got the OS buffering writes anyway in most cases, it's hard to pin down whether that actually improved worst-case latency though. And moving in that direction always seems to reduce average throughput even in write-heavy benchmarks. Ok. The important thing to remember is that the underlying OS has its own read and write caching mechanisms here, and unless the PostgreSQL ones are measurably better than those you might as well let the OS manage the problem instead. The problem though is that though the OS may be good in the common cases it is designed for, it can have specific features that are directly counter-productive if your goals do not line up with that of the commonly designed-for use case (in particular, if you care about latency a lot and not necessarily about absolute max throughput). For example, in Linux up until recently if not still, there is the 1024 per-inode buffer limit that limited the number of buffers written as a result of expiry, which means that when PostgreSQL does its fsync(), you may end up having a lot more to write out than what would have been the case if the centisecs_expiry had been enforced, regardless of whether PostgreSQL was tuned to write dirty pages out sufficiently aggressively. If the amount built up exceeds the capacity of the RAID controller cache... I had a case where I suspect this was exaserbating the situation. Manually doing a 'sync' on the system every few seconds noticably helped (the theory being, because it forced page write-outs to happen earlier and in smaller storms). It's easy to demonstrate that's happening when you give a decent amount of memory to shared_buffers, it's much harder to prove that's the case for an improved write scheduling algorithm. Stepping back a bit, you might even consider that one reason PostgreSQL has grown as well as it has in scalability is exactly because it's been riding improvements the underlying OS in many of these cases, rather than trying to do all the I/O scheduling itself. Sure. In this case with the backend writes, I am nore interesting in understanding better what is happening and having better indications of when backends block on I/O, than necessarily having a proven improvement in throughput or latency. It makes it easier to reason about what is happening when you *do* have a measured performance problem. Thanks for all the insightful information. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpSaSogvAGtf.pgp Description: PGP signature
[PERFORM] lru_multiplier and backend page write-outs
Hello, I've had the feeling for a while that the pg_stat_bgwriter statistics doesn't work quite the way I have understood it (based on the excellent [1] and the pg docs). I am now monitoring a database that has an lru_multiplier of 4.0, a delay of 200ms and a maxpages of 1000. Current stats: postgres=# select * from pg_stat_bgwriter; checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc ---+-++---+--+-+--- 241 | 17 | 72803 | 0 | 0 | 81015 | 81708 (1 row) This is while the database is undergoing continuous activity (almost exclusively writing), but at a rate that does not saturate underlying storage (caching raid controller, all write ops are fast, cache is never filled). In addition, PostgreSQL is not even close to even filling it's buffer cache. The buffer cache is configured at 1 GB, and the resident size of the PostgreSQL process is only 80-90 MB so far. So even independently of any lru multplier setting, delays and whatever else, I don't see why any backend would ever have to do its own writeouts in order to allocate a page from the buffer cache. One theory: Is it the auto vacuum process? Stracing those I've seen that they very often to writes directly to disk. In any case, the reason I am fixating on buffers_backend is that I am after a clear indication whether any normal backend (non-autovacuum or anything like that) is ever having to block on disk writes, other than WAL fsync:s. Is a non-zero buffers_backend consistent with expected behavior? [1] http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpq5Wyr2Jr4u.pgp Description: PGP signature
Re: [PERFORM] NOW vs CURRENT_DATE
I just discover a big not only big huge difference between NOW() and CURRENT_DATE. Did you already know about it and do you know why ? DELETE FROM blacklist where bl_date (NOW() - interval '2 DAY'); on 6 000 000 of records 699 ms DELETE FROM blacklist where bl_date (CURRENT_DATE - interval '2 DAY'); on 6 000 000 of records Is this a one-off run after each other (e.g. with a ROLLBACK in between)? If so I suspect the difference is due to caching and if you re-run the NOW() version it would also be fast. Also, NOW() is equivalent to CURRENT_TIMESTAMP() rather than CURRENT_DATE(). Perhaps the date vs. timestamp has some implication of how they query is planned. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpJyATTQYR2A.pgp Description: PGP signature
Re: [PERFORM] Large number of tables slow insert
I use Postgresql 8.3.1-1 to store a lot of data coming from a large amount of sensors. In order to have good performances on querying by timestamp on each sensor, I partitionned my measures table for each sensor. Thus I create a lot of tables. I simulated a large sensor network with 3000 nodes so I have ~3000 tables. And it appears that each insert (in separate transactions) in the database takes about 300ms (3-4 insert per second) in tables where there is just few tuples ( 10). I think you can understand that it's not efficient at all because I need to treat a lot of inserts. Do you have any idea why it is that slow ? and how can have good insert ? My test machine: Intel p4 2.4ghz, 512mb ram, Ubuntu Server (ext3) iostat tells me that I do : 0.5MB/s reading and ~6-7MB/s writing while constant insert Have you checked what you are bottlenecking on - CPU or disk? Try iostat/top/etc during the inserts. Also check actual disk utilizatio (iostat -x on linux/freebsd; varies on others) to see what percentage of time the disk/storage device is busy. You say you have 3-4 inserts/second causing 6-7 MB/s writing. That suggests to me the inserts are fairly large. Are they in the MB range, which would account for the I/O? My suspicion is that you are bottlenecking on CPU, since in my experience there is definitely something surprisingly slow about encoding/decoding data at the protocol level or somewhere else that is involved in backend/client communication. I.e, I suspect your client and/or server is spending a lot of CPU time with things not directly related to the actual table inserts. If so, various suggested schemes w.r.t. indexing, table bloat etc won't help at all. In short, 6-7 MB/second would be fairly consistent with INSERT/COPY operations being CPU bound on a modern CPU, in my experience. It may be that this is entirely untrue in your case, but it sounds like a reasonable thing to at least consider. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgp2G4jcm34pt.pgp Description: PGP signature
[PERFORM] Identifying the nature of blocking I/O
[for the purpose of this post, 'blocking' refers to an I/O operation taking a long time for reasons other than the amount of work the I/O operation itself actually implies; not to use of blocking I/O calls or anything like that] Hello, I have a situation in which deterministic latency is a lot more important than throughput. I realize this is a hugely complex topic and that there is inteaction between many different things (pg buffer cache, os buffer cache, raid controller caching, wal buffers, storage layout, etc). I already know several things I definitely want to do to improve things. But in general, it would be very interesting to see, at any given moment, what PostgreSQL backends are actually blocking on from the perspective of PostgreSQL. So for example, if I have 30 COMMIT:s that are active, to know whether it is simply waiting on the WAL fsync or actually waiting on a data fsync because a checkpoint is being created. or similarly, for non-commits whether they are blocking because WAL buffers is full and writing them out is blocking, etc. This would make it easier to observe and draw conclusions when tweaking different things in pg/the os/the raid controller. Is there currently a way of dumping such information? I.e., asking PG what are backends waiting on right now?. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpv5jmsijPCh.pgp Description: PGP signature
[PERFORM] VACUUM ANALYZE blocking both reads and writes to a table
Hello, my understanding, and generally my experience, has been that VACUUM and VACUUM ANALYZE (but not VACUUM FULL) are never supposed to block neither SELECT:s nor UPDATE:s/INSERT:s/DELETE:s to a table. This is seemingly confirmed by reading the explicit locking documentation, in terms of the locks acquired by various forms of vacuuming, and with which other lock modes they conflict. I have now seen it happen twice that a VACUMM ANALYZE has seemingly been the triggering factor to blocking queries. In the first instance, we had two particularly interesting things going on: VACUUM ANALYZE thetable LOCK TABLE thetable IN ACCESS SHARE MODE In addition there was one SELECT from the table, and a bunch of INSERT:s (this is based on pg_stat_activity). While I am unsure of why there is an explicit LOCK going on with ACCESS SHARE MODE (no explicit locking is ever done on this table by the application), it is supposed to be the locking used for selects. I suspect it may be a referential integrity related acquisition generated by PG. The second time it happned, there was again a single SELECT, a bunch of INSERT:s, and then: VACUUM ANALYZE thetable This time there was no explicit LOCK visible. In both cases, actitivy was completely blocked until the VACUUM ANALYZE completed. Does anyone have input on why this could be happening? The PostgreSQL version is 8.2.4[1]. Am I correct in that it *should* not be possible for this to happen? For the next time this happens I will try to have a query prepared that will dump as much relevant information as possible regarding acquired locks. If it makes a difference the SELECT does have a subselect that also selcts from the same table - a MAX(colum) on an indexed column. [1] I did check the ChangeLog for 8.2.x releases above .4, and the 8.3 releases, but did not see anything that indicated locking/conflict related fixes in relation to vacuums. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpzyHqvAItZX.pgp Description: PGP signature
Re: [PERFORM] VACUUM ANALYZE blocking both reads and writes to a table
Hello, No. VACUUM takes an exclusive lock at the end of the operation to truncate empty pages. (If it cannot get the lock then it'll just skip this step.) In 8.2.4 there was a bug that caused it to sleep according to vacuum_delay during the scan to identify possibly empty pages. This was fixed in 8.2.5: [snip revision log] Thank you very much! This does indeed seem to be the likely culprit. Will try to either upgrade, or if not possible in time for the next occurance, confirm that this is what is happening based on pg_locks. Thanks again for the very informative response. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpUjtABeKgmx.pgp Description: PGP signature
Re: [PERFORM] VACUUM ANALYZE blocking both reads and writes to a table
Actually, while on the topic: date: 2007-09-10 13:58:50 -0400; author: alvherre; state: Exp; lines: +6 -2; Remove the vacuum_delay_point call in count_nondeletable_pages, because we hold an exclusive lock on the table at this point, which we want to release as soon as possible. This is called in the phase of lazy vacuum where we truncate the empty pages at the end of the table. Even with the fix the lock is held. Is the operation expected to be fast (for some definition of fast) and in-memory, or is this something that causes significant disk I/O and/or scales badly with table size or similar? I.e., is this enough that, even without the .4 bug, one should not really consider VACUUM ANALYZE non-blocking with respect to other transactions? (I realize various exclusive locks are taken for short periods of time even for things that are officially declared non-blocking; the question is whether this falls into this category.) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgp1Tc16hAGGQ.pgp Description: PGP signature
Re: [PERFORM] shared_buffers in 8.3 w/ lots of RAM on dedicated PG machine
PostgreSQL only uses direct I/O for writing to the WAL; everything else goes through the regular OS buffer cache unless you force it to do otherwise at the OS level (like some Solaris setups do with forcedirectio). This is one reason it still make not make sense to give an extremely high percentage of RAM to PostgreSQL even with improvements in managing it. Ok - thank you for the input (that goes for everyone). Another is that shared_buffers memory has to be reconciled with disk at every checkpoint, where OS buffers do not. Hmm. Am I interpreting that correctly in that dirty buffers need to be flushed to disk at checkpoints? That makes perfect sense - but why would that not be the case with OS buffers? My understanding is that the point of the checkpoint is to essentially obsolete old WAL data in order to recycle the space, which would require flushing the data in question first (i.e., normally you just fsync the WAL, but when you want to recycle space you need fsync() for the barrier and are then free to nuke the old WAL). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: This is a digitally signed message part.
[PERFORM] shared_buffers in 8.3 w/ lots of RAM on dedicated PG machine
Hello, my impression has been that in the past, there has been a general semi-consensus that upping shared_buffers to use the majority of RAM has not generally been recommended, with reliance on the buffer cache instead being the recommendation. Given the changes that have gone into 8.3, in particular with regards to minimizing the impact of large sequential scans, would it be correct to say that given that - enough memory is left for other PG bits (sort mems and whatnot else) - only PG is running on the machine - you're on 64 bit so do not run into address space issues - the database working set is larger than RAM it would be generally advisable to pump up shared_buffers pretty much as far as possible instead of relying on the buffer cache? -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpRe9ntcHta3.pgp Description: PGP signature
Re: [PERFORM] shared_buffers in 8.3 w/ lots of RAM on dedicated PG machine
PostgreSQL still depends on the OS for file access and caching. I think that the current recommendation is to have up to 25% of your RAM in the shared buffer cache. This feels strange. Given a reasonable amount of RAM (let's say 8 GB in this case), I cannot imagine why 75% of that would be efficiently used for anything but the buffer cache (ignoring work_mem, stacks, etc). Obviously the OS will need memory to do it's usual stuff (buffering to do efficient I/O, and so on). But the need for that should not increase with the amount of RAM in the machine, all else being equal. What type of file I/O, other than reading pages of PostgreSQL data which are eligable for the PostgreSQL buffer cache, does PostgreSQL do that would take advantage of the operating system caching so much data? (Assuming the database is not extreme to the point of file system meta data being huge.) If the 25% rule still holds true, even under circumstances where the assumption is that the PostgreSQL buffer cache is more efficient (in terms of hit ratio) at caching PostgreSQL database data pages, it would be useful to understand why in order to understand the trade-offs involved and make appropriate decisions. Or is it a matter of PostgreSQL doing non-direct I/O, such that anything cached in shared_buffers will also be cached by the OS? -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpQMYjRMfywD.pgp Description: PGP signature
[PERFORM] Non-blocking vacuum full
Hello, I was wondering whether any thought has previously been given to having a non-blocking vacuum full, in the sense of space reclamation and table compactation. The motivation is that it is useful to be able to assume that operations that span a table will *roughtly* scale linearly with the size of the table. But when you have a table that over an extended period of time begins small, grows large, and grows small again (where large might be, say, 200 GB), that assumption is most definitely not correct when you're on the downward slope of that graph. Having this assumption remain true simplifies things a lot for certain workloads (= my particular work load ;)). I have only looked very very briefly at the PG code so I don't know how far fetched it is, but my thought was that it should be possible to have a slow background process (similar to normal non-full vacuums nows) that would, instead of registering dead tuples in the FSM, move live tuples around. Combine that slow moving operations with a policy to a new tuple space allocation policy that prefers earlier locations on-disk, it should in time result in a situation where the physical on-disk file contains only dead tuples after a certain percentage location. At this point the file can be truncated, giving space back to the OS as well as eliminating all that dead space from having to be covered by sequential scans on the table. This does of course increase the total cost of all updates and deletes, but would be very useful in some senarios. It also has the interesting property that the scan for live tuples to move need not touch the entire table to be effective; it could by design be applied to the last n percentage of the table, where n would be scaled appropriately with the frequency of the checks relative to update/insert frequency. Other benefits: * Never vacuum full - EVER. Not even after discovering too small max_fsm_pages or too infrequent vacuums and needing to retroactively shrink the table. * Increased locality in general; even if one does not care about the diskspace or sequential scanning. Particularly relevant for low-update frequency tables suffering from sudden shrinkage, where a blocking VACUUM FULL Is not acceptable. * Non-blocking CLUSTER is perhaps suddently more trivial to implement? Or at least SORTOFCLUSTER when you want it for reasons other than perfect order (mostly sorted). Opinions/thoughts? -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpFbOXmSf908.pgp Description: PGP signature
Re: [PERFORM] max_fsm_pages, shared_buffers and checkpoint_segments
increasing checkpoint_segments,which is also a disk thing. However, setting it to 25, and then increasing any of the other 2 variables, the postgresql daemon stops working. meaning it does not start upon reboot. When I bring Sounds like you need to increase your shared memory limits. Unfortunately this will require a reboot on FreeBSD :( See: http://www.postgresql.org/docs/8.2/static/kernel-resources.html Last time I checked PostgreSQL should be complaining about the shared memory on startup rather than silently fail though. Check your logs perhaps. Though I believe the RC script will cause the message to be printed interactively at the console too, if you run it. (Assuming you are using it installed from ports). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Postgres Benchmark Results
- Deferred Transactions, since adding a comment to a blog post doesn't need the same guarantees than submitting a paid order, it makes sense that the application could tell postgres which transactions we care about if power is lost. This will massively boost performance for websites I believe. This would be massively useful. Very often all I care about is that the transaction is semantically committed; that is, that other transactions starting from that moment will see the modifications done. As opposed to actually persisting data to disk. In particular I have a situation where I attempt to utilize available hardware by using concurrency. The problem is that I have to either hugely complicate my client code or COMMIT more often than I would like in order to satisfy dependencies between different transactions. If a deferred/delayed commit were possible I could get all the performance benefit without the code complexity, and with no penalty (because in this case persistence is not important). -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe
Hello, The next question then is whether anything in your postgres configuration is preventing it getting useful performance from the OS. What settings have you changed in postgresql.conf? The only options not commented out are the following (it's not even tweaked for buffer sizes and such, since in this case I am not interested in things like sort performance and cache locality other than as an afterthought): hba_file = '/etc/postgresql/8.1/main/pg_hba.conf' ident_file = '/etc/postgresql/8.1/main/pg_ident.conf' external_pid_file = '/var/run/postgresql/8.1-main.pid' listen_addresses = '*' port = 5432 max_connections = 100 unix_socket_directory = '/var/run/postgresql' ssl = true shared_buffers = 1000 log_line_prefix = '%t ' stats_command_string = on stats_row_level = on autovacuum = on lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' Are you using any unusual settings within the OS itself? No. It's a pretty standard kernel. The only local tweaking done is enabling/disabling various things; there are no special patches used or attempts to create a minimalistic kernel or anything like that. You're forgetting the LIMIT clause. For the straight index scan, the query aborts when the LIMIT is reached having scanned only the specified number of index rows (plus any index entries that turned out to be dead in the heap). For the bitmap scan case, the limit can be applied only after the heap scan is under way, therefore the index scan to build the bitmap will need to scan ~50k rows, not the 10k specified in the limit, so the amount of time spent scanning the index is 50 times larger than in the straight index scan case. Ok - makes sense that it has to scan the entire subset of the index for the value in question. I will have to tweak the CPU/disk costs settings (which I have, on purpose, not yet done). However, I do suspect you have a problem here somewhere, because in my tests the time taken to do the bitmap index scan on 50k rows, with the index in cache, is on the order of 30ms (where the data is cached in shared_buffers) to 60ms (where the data is cached by the OS). That's on a 2.8GHz xeon. This is on a machine with 2.33GHz xeons and I wasn't trying to exaggerate. I timed it and it is CPU bound (in userspace; next to no system CPU usage at all) for about 15 seconds for the case of selecting with a limit of 1. Given that there is no disk activity I can't imagine any buffer sizes or such affecting this other than userspace vs. kernelspace CPU concerns (since obviously the data being worked on is in RAM). Or am I missing something? It is worth noting that the SELECT of fewer entries is entirely disk bound; there is almost no CPU usage whatsoever. Even taking the cumulative CPU usage into account (gut feeling calculation, nothing scientific) and multiplying by 50 you are nowhere near 15 seconds of CPU boundness. So it is indeed strange. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgp5zNDEU58j4.pgp Description: PGP signature
Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe
Hello, I'd always do benchmarks with a realistic value of shared_buffers (i.e. much higher than that). Another thought that comes to mind is that the bitmap index scan does depend on the size of work_mem. Try increasing your shared_buffers to a reasonable working value (say 10%-15% of RAM - I was testing on a machine with 4GB of RAM, using a shared_buffers setting of 5), and increase work_mem to 16364, and see if there are any noticable changes in behaviour. Increasing the buffer size and work_mem did have a significant effect. I can understand it in the case of the heap scan, but I am still surprised at the index scan. Could pg be serializing the entire query as a result of insufficient buffers/work_mem to satisfy multiple concurrent queries? With both turned up, not only is the heap scan no longer visibly CPU bound, I am seeing some nice scaling in terms of disk I/O. I have not yet benchmarked to the point of being able to say whether it's entirely linear, but it certainly seems to at least be approaching the ballpark. Thank you for the help! I guess I made a bad call not tweaking this. My thinking was that I explicitly did not want to turn it up so that I could benchmark the raw performance of disk I/O, rather than having things be cached in memory more than it would already be. But apparantly it had other side-effects I did not consider. Thanks again, -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgp0TSiIQlMjC.pgp Description: PGP signature
Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe
Hello, If you are dealing with timed data or similar, you may consider to partition your table(s). Unfortunately this is not the case; the insertion is more or less random (not quite, but for the purpose of this problem it is). Thanks for the pointers though. That is sure to be useful in some other context down the road. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgp2AH9xvZCzu.pgp Description: PGP signature
Re: [PERFORM] Scaling SELECT:s with the number of disks on a stripe
Hello, SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; I tested this on a 14-way software raid10 on freebsd, using pg 8.1.6, and couldn't reproduce anything like it. With one client I get about 200 disk requests per second, scaling almost exactly linearly for the first 5 or so clients, as expected. At 14 clients it was down to about 150 reqs/sec per client, but the total throughput continued to increase with additional concurrency up to about 60 clients, giving about 3600 reqs/sec (260 per disk, which is about right for 10krpm scsi disks under highly concurrent random loads). Ok. That is very intersting; so there is definitely nothing fundamental in PG that prevents the scaling (even if on FreeBSD). A good question. Have you tried testing the disks directly? e.g. create some huge files, and run a few concurrent random readers on them? That would test the array and the filesystem without involving postgres. I have confirmed that I am seeing expected performance for random short and highly concurrent reads in one large ( 200 GB) file. The I/O is done using libaio however, so depending on implementation I suppose the I/O scheduling behavior of the fs/raid driver might be affected compared to having a number of concurrent threads doing synchronous reads. I will try to confirm performance in a way that will more closely match PostgreSQL's behavior. I have to say though that I will be pretty surprised if the performance is not matched in that test. Is there any chance there is some operation system conditional code in pg itself that might affect this behavior? Some kind of purposeful serialization of I/O for example (even if that sounds like an extremely strange thing to do)? This is entirely expected. With the larger row count, it is more likely (or so the planner estimates) that rows will need to be fetched from adjacent or at least nearby blocks, thus a plan which fetches rows in physical table order rather than index order would be expected to be superior. The planner takes into account the estimated startup cost and per-row cost when planning LIMIT queries; therefore it is no surprise that for larger limits, it switches to a plan with a higher startup cost but lower per-row cost. Roger that, makes sense. I had misunderstood the meaning of the heap scan. Most likely your index is small enough that large parts of it will be cached in RAM, so that the scan of the index to build the bitmap does not need to hit the disk much if at all. Even so however, several seconds of CPU activity to scan the index for a few tens of thousands of entries sounds a bit excessive. Or does it not? Because at that level, the CPU bound period alone is approaching the time it would take to seek for each entry instead. But then I presume the amount of work is similar/the same for the other case, except it's being done at the beginning of the query instead of before each seek. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpK1zQziyuhZ.pgp Description: PGP signature
[PERFORM] Scaling SELECT:s with the number of disks on a stripe
Hello, I am looking to use PostgreSQL for storing some very simple flat data mostly in a single table. The amount of data will be in the hundreds of gigabytes range. Each row is on the order of 100-300 bytes in size; in other words, small enough that I am expecting disk I/O to be seek bound (even if PostgreSQL reads a full pg page at a time, since a page is significantly smaller than the stripe size of the volume). The only important performance characteristics are insertion/deletion performance, and the performance of trivial SELECT queries whose WHERE clause tests equality on one of the columns. Other than absolute performance, an important goal is to be able to scale fairly linearly with the number of underlying disk drives. We are fully willing to take a disk seek per item selected, as long as it scales. To this end I have been doing some benchmarking to see whether the plan is going to be feasable. On a 12 disk hardware stripe, insertion performance does scale somewhat with concurrent inserters. However, I am seeing surprising effects with SELECT:s: a single selecter generates the same amount of disk activity as two concurrent selecters (I was easily expecting about twice). The query is simple: SELECT * FROM test WHERE value = 'xxx' LIMIT 1000; No ordering, no joins, no nothing. Selecting concurrently with two different values of 'xxx' yields the same amount of disk activity (never any significant CPU activity). Note that the total amount of data is too large to fit in RAM ( 500 million rows), and the number of distinct values in the value column is 1. The column in the WHERE clause is indexed. So my first question is - why am I not seeing this scaling? The absolute amount of disk activity with a single selecter is consistent with what I would expect from a SINGLE disk, which is completely expected since I never thought PostgreSQL would introduce disk I/O concurrency on its own. But this means that adding additional readers doing random-access reads *should* scale very well with 12 underlying disks in a stripe. (Note that I have seen fairly similar results on other RAID variants too, including software RAID5 (yes yes I know), in addition to the hardware stripe.) These tests have been done Linux 2.6.19.3 and PostgreSQL 8.1. Secondly, I am seeing a query plan switch after a certain threshold. Observe: perftest=# explain select * from test where val='7433' limit 1000; QUERY PLAN - Limit (cost=0.00..4016.50 rows=1000 width=143) - Index Scan using test_val_ix on test (cost=0.00..206620.88 rows=51443 width=143) Index Cond: ((val)::text = '7433'::text) (3 rows) Now increasing to a limit of 1: perftest=# explain select * from test where val='7433' limit 1; QUERY PLAN -- Limit (cost=360.05..38393.36 rows=1 width=143) - Bitmap Heap Scan on test (cost=360.05..196014.82 rows=51443 width=143) Recheck Cond: ((val)::text = '7433'::text) - Bitmap Index Scan on test_val_ix (cost=0.00..360.05 rows=51443 width=0) Index Cond: ((val)::text = '7433'::text) (5 rows) The interesting part is that the latter query is entirely CPU bound (no disk I/O at all) for an extended period of time before even beginning to read data from disk. And when it *does* start performing disk I/O, the performance is about the same as for the other case. In other words, the change in query plan seems to do nothing but add overhead. What is the bitmap heap scan supposed to be doing that would increase performance above a seek once per matching row plan? I haven't been able to Google my way to what the intended benefit is of a heap scan vs. a plain index scan. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]' Key retrieval: Send an E-Mail to [EMAIL PROTECTED] E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org pgpoVILTP5fwa.pgp Description: PGP signature