Re: [PERFORM] Sorted group by
Original query: explain analyse select * from tracker where objectid 120; QUERY PLAN --- Index Scan using tracker_objectid on tracker (cost=0.00..915152.62 rows=3684504 width=33) (actual time=0.061..5402.608 rows=3790872 loops=1) Index Cond: (objectid 120) Total runtime: 9134.362 ms (3 rows) On Tue, 10 Aug 2010, hubert depesz lubaczewski wrote: select distinct on (group) * from table order by group desc, number desc; This solution is rather obvious, and works on older versions of Postgres. Thanks. However, the burden of sorting by two columns (actually, in our application the group is two column, so sorting by three columns instead) makes this significantly slower than just copying the whole data through our application (which effectively does a hash aggregation). explain analyse select distinct on (objectid, fieldname) objectid, fieldname, sourcename, version from tracker where objectid 120 order by objectid, fieldname, version desc; QUERY PLAN -- Unique (cost=1330828.11..1357953.05 rows=361666 width=34) (actual time=12815.878..22452.737 rows=1782996 loops=1) - Sort (cost=1330828.11..1339869.76 rows=3616658 width=34) (actual time=12815.873..16608.903 rows=3790872 loops=1) Sort Key: objectid, fieldname, version Sort Method: quicksort Memory: 420980kB - Index Scan using tracker_objectid on tracker (cost=0.00..936861.47 rows=3616658 width=34) (actual time=0.061..5441.050 rows=3790872 loops=1) Index Cond: (objectid 120) Total runtime: 24228.724 ms (7 rows) On Tue, 10 Aug 2010, Thomas Kellerer wrote: select group_, value_ from ( select group_, value_, number_, row_number() over (partition by group_ order by value_ desc) as row_num from numbers ) t where row_num = 1 order by group_ desc This looks quite cute, however it is slightly slower than the DISTINCT ON approach. explain analyse select objectid, fieldname, sourcename from (select objectid, fieldname, sourcename, version, row_number() over (partition by objectid, fieldname order by version desc) as row_num from tracker where objectid 120) as t where row_num = 1; QUERY PLAN - Subquery Scan t (cost=1330828.11..1457411.14 rows=18083 width=68) (actual time=12835.553..32220.075 rows=1782996 loops=1) Filter: (t.row_num = 1) - WindowAgg (cost=1330828.11..1412202.92 rows=3616658 width=34) (actual time=12835.541..26471.802 rows=3790872 loops=1) - Sort (cost=1330828.11..1339869.76 rows=3616658 width=34) (actual time=12822.560..16646.112 rows=3790872 loops=1) Sort Key: tracker.objectid, tracker.fieldname, tracker.version Sort Method: quicksort Memory: 420980kB - Index Scan using tracker_objectid on tracker (cost=0.00..936861.47 rows=3616658 width=34) (actual time=0.067..5433.790 rows=3790872 loops=1) Index Cond: (objectid 120) Total runtime: 34002.828 ms (9 rows) On Tue, 10 Aug 2010, Kevin Grittner wrote: select group, value from tbl x where not exists (select * from tbl y where y.group = x.group and y.number x.number); This is a join, which is quite a bit slower: explain analyse select objectid, fieldname, sourcename from tracker as a where not exists(select * from tracker as b where a.objectid = b.objectid and a.fieldname = b.fieldname and a.version b.version and b.objectid 120) and a.objectid 120; QUERY PLAN --- Merge Anti Join (cost=2981427.73..3042564.32 rows=2411105 width=30) (actual time=24834.372..53939.131 rows=1802376 loops=1) Merge Cond: ((a.objectid = b.objectid) AND (a.fieldname = b.fieldname)) Join Filter: (a.version b.version) - Sort (cost=1490713.86..1499755.51 rows=3616658 width=34) (actual time=12122.478..15944.255 rows=3790872 loops=1) Sort Key: a.objectid, a.fieldname Sort Method: quicksort Memory: 420980kB - Index Scan using tracker_objectid on tracker a (cost=0.00..1096747.23 rows=3616658 width=34) (actual time=0.070..5403.235 rows=3790872 loops=1) Index Cond: (objectid 120) - Sort (cost=1490713.86..1499755.51 rows=3616658 width=17) (actual time=12710.564..20952.841 rows=8344994 loops=1) Sort Key: b.objectid, b.fieldname Sort Method: quicksort Memory: 336455kB - Index Scan using
[PERFORM] 32 vs 64 bit build on Solaris Sparc
With a 16 CPU, 32 GB Solaris Sparc server, is there any conceivable reason to use a 32 bit build rather than a 64 bit build? Apparently the Sun PostgreSQL package includes a README that indicates you might want to think twice about using 64 bit because it is slower -- this seems like outdated advice, but I was looking for confirmation one way or the other. Also semi-related question: when building from source, using gcc, enabling debug (but *not* assert) is normally not much of a performance hit. Is the same true if you build with the Sun CC? Thanks in advance for any thoughts/experiences. Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, 24x7 Support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 32 vs 64 bit build on Solaris Sparc
Joseph Conway m...@joeconway.com writes: Also semi-related question: when building from source, using gcc, enabling debug (but *not* assert) is normally not much of a performance hit. Is the same true if you build with the Sun CC? Most non-gcc compilers disable optimization altogether if you enable debug :-(. Perhaps that isn't true of Sun's, but I'd check its documentation before considering --enable-debug for a production build. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Sorted group by
Excerpts from Matthew Wakeling's message of mar ago 10 11:40:16 -0400 2010: I am trying to retrieve, for many sets of rows grouped on a couple of fields, the value of an ungrouped field where the row has the highest value in another ungrouped field. I think this does what you want (schema is from the tenk1 table in the regression database): select string4 as group, (array_agg(stringu1 order by unique1 desc))[1] as value from tenk1 group by 1 ; Please let me know how it performs with your data. The plan is rather simple: regression=# explain analyze select string4 as group, (array_agg(stringu1 order by unique1 desc))[1] as value from tenk1 group by 1 ; QUERY PLAN ─── GroupAggregate (cost=0.00..1685.16 rows=4 width=132) (actual time=22.825..88.922 rows=4 loops=1) - Index Scan using ts4 on tenk1 (cost=0.00..1635.11 rows=1 width=132) (actual time=0.135..33.188 rows=1 loops=1) Total runtime: 89.348 ms (3 filas) -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Testing Sandforce SSD
Greg Smith wrote: * How to test for power failure? I've had good results using one of the early programs used to investigate this class of problems: http://brad.livejournal.com/2116715.html?page=2 FYI, this tool is mentioned in the Postgres documentation: http://www.postgresql.org/docs/9.0/static/wal-reliability.html -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
On Aug 10, 2010, at 9:21 AM, Greg Smith wrote: Scott Carey wrote: Also, the amount of data at risk in a power loss varies between drives. For Intel's drives, its a small chunk of data ( 256K). For some other drives, the cache can be over 30MB of outstanding writes. For some workloads this is acceptable No, it isn't ever acceptable. You can expect the type of data loss you get when a cache fails to honor write flush calls results in catastrophic database corruption. It's not I lost the last few seconds; I never said it was. it's the database is corrupted and won't start after a crash. Which is sometimes acceptables. There is NO GUARANTEE that you won't lose data, ever. An increase in the likelihood is an acceptable tradeoff in some situations, especially when it is small. On ANY power loss event, with or without battery backed caches and such, you should do a consistency check on the system proactively. With less reliable hardware, that task becomes much more of a burden, and is much more likely to require restoring data from somewhere. What is the likelihood that your RAID card fails, or that the battery that reported 'good health' only lasts 5 minutes and you lose data before power is restored? What is the likelihood of human error? Not that far off from the likelihood of power failure in a datacenter with redundant power. One MUST have a DR plan. Never assume that your perfect hardware won't fail. This is why we pound on this topic on this list. A SSD that fails to honor flush requests is completely worthless for anything other than toy databases. Overblown. Not every DB and use case is a financial application or business critical app. Many are not toys at all. Slave, read only DB's (or simply subset tablespaces) ... Indexes. (per application, schema) Tables. (per application, schema) System tables / indexes. WAL. Each has different reliability requirement and consequences from losing recently written data. less than 8K can be fatal to the WAL, or table data. Corrupting some tablespaces is not a big deal. Corrupting others is catastrophic. The problem with the assertion that this hardware is worthless is that it implies that every user, every use case, is at the far end of the reliability requirement spectrum. Yes, that can be a critical requirement for many, perhaps most, DB's. But there are many uses for slightly unsafe storage systems. You can expect significant work to recover any portion of your data after the first unexpected power loss under heavy write load in this environment, during which you're down. We do database corruption recovery at 2ndQuadrant; while I can't talk about the details of some recent incidents, I am not speaking theoretically when I warn about this. I've done the single-user mode recover system tables by hand thing myself at 4AM, on a system with battery backed RAID 10, redundant power, etc. Raid cards die, and 10TB recovery times from backup are long. Its a game of balancing your data loss tolerance with the likelihood of power failure. Both of these variables are highly variable, and not just with 'toy' dbs. If you know what you are doing, you can use 'fast but not completely safe' storage for many things safely. Chance of loss is NEVER zero, do not assume that 'good' hardware is flawless. Imagine a common internet case where synchronous_commit=false is fine. Recovery from backups is a pain (but a daily snapshot is taken of the important tables, and weekly for easily recoverable other stuff). If you expect one power related failure every 2 years, it might be perfectly reasonable to use 'unsafe' SSD's in order to support high transaction load on the risk that that once every 2 year downtime is 12 hours long instead of 30 minutes, and includes losing up to a day's information. Applications like this exist all over the place. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
On Aug 10, 2010, at 11:28 AM, Greg Smith wrote: Brad Nicholson wrote: What about putting indexes on them? If the drive fails and drops writes on those, they could be rebuilt - assuming your system can function without the index(es) temporarily. Dumping indexes on SSD is one of the better uses for them, presuming you can survive what is likely to be an outage from a can the site handle full load? perspective while they rebuild after a crash. As I'm sure Brad is painfully aware of already, index rebuilding in PostgreSQL can take a while. To spin my broken record here again, the main thing to note when you consider that--relocate indexes onto SSD--is that the ones you are most concerned about the performance of were likely to be already sitting in RAM anyway, meaning the SSD speedup doesn't help reads much. So the giant performance boost just isn't there in that case. For an OLTP type system, yeah. But for DW/OLAP and batch processing the gains are pretty big. Those indexes get kicked out of RAM and then pulled back in a lot. I'm talking about a server with 72GB of RAM that can't keep enough indexes in memory to avoid a lot of random access. Putting the indexes on an SSD has lowered the random I/O load on the other drives a lot, letting them get through sequential scans a lot faster. Estimated power failure, once every 18 months (mostly due to human error). Rebuild indexes offline for 40 minutes every 18 months? No problem. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
On Aug 10, 2010, at 11:38 AM, Karl Denninger wrote: Scott Marlowe wrote: On Tue, Aug 10, 2010 at 12:13 PM, Karl Denninger k...@denninger.netmailto:k...@denninger.net wrote: ANY disk that says write is complete when it really is not is entirely unsuitable for ANY real database use. It is simply a matter of time What about read only slaves where there's a master with 100+spinning hard drives getting it right and you need a half dozen or so read slaves? I can imagine that being ok, as long as you don't restart a server after a crash without checking on it. A read-only slave isn't read-only, is it? I mean, c'mon - how does the data get there? IF you mean a server that only accepts SELECTs, does not accept UPDATEs or INSERTs, and on a crash **reloads the entire database from the master**, then ok. ENTIRE database? Depends on your tablespace setup and schema usage pattern. If: * 90% of your data tables are partitioned by date, and untouched a week after insert. Partitions are backed up incrementally. * The remaining 10% of it is backed up daily, and of that 9% can be re-generated from data elsewhere if data is lost. * System catalog and wal are on 'safest of safe' hardware. Then your 'bulk' data on a slave can be on less than flawless hardware. Simply restore the tables from the last week from the master or backup when the (rare) power failure occurs. The remaining data is safe, since it is not written to. Split up your 10% of non-date partitioned data into what needs to be on safe hardware and what does not (maybe some indexes, etc). Most of the time, the incremental cost of getting a BBU is too small to not do it, so the above hardly applies. But if you have data that is known to be read-only, you can do many unconventional things with it safely. Most people who will do this won't reload it after a crash. They'll inspect the database and say ok, and put it back online. Bad Karma will ensue in the future. Anyone going with something unconventional better know what they are doing and not just blindly plug it in and think everything will be OK. I'd never recommend unconventional setups for a user that wasn't an expert and understood the tradeoff. Incidentally, that risk is not theoretical either (I know about this one from hard experience. Fortunately the master was still ok and I was able to force a full-table copy I didn't like it as the database was a few hundred GB, but I had no choice.) Been there with 10TB with hardware that should have been perfectly safe. 5 days of copying, and wishing that pg_dump supported lzo compression so that the dump portion had a chance at keeping up with the much faster restore portion with some level of compression on to save the copy bandwidth. -- Karl karl.vcf
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
Scott Carey wrote: On Aug 10, 2010, at 11:38 AM, Karl Denninger wrote: . Most people who will do this won't reload it after a crash. They'll inspect the database and say ok, and put it back online. Bad Karma will ensue in the future. Anyone going with something unconventional better know what they are doing and not just blindly plug it in and think everything will be OK. I'd never recommend unconventional setups for a user that wasn't an expert and understood the tradeoff. True. Incidentally, that risk is not theoretical either (I know about this one from hard experience. Fortunately the master was still ok and I was able to force a full-table copy I didn't like it as the database was a few hundred GB, but I had no choice.) Been there with 10TB with hardware that should have been perfectly safe. 5 days of copying, and wishing that pg_dump supported lzo compression so that the dump portion had a chance at keeping up with the much faster restore portion with some level of compression on to save the copy bandwidth. Pipe it through ssh -C PS: This works for SLONY and Bucardo too - set up a tunnel and then change the port temporarily.This is especially useful when the DB being COPY'd across has big fat honking BYTEA fields in it, which otherwise expand about 400% - or more - on the wire. -- Karl attachment: karl.vcf -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
A number of amusing aspects to this discussion. - I've carried out similar tests using the Intel X-25M with both PG and DB2 (both on linux). While it is a simple matter to build parallel databases on DB2, on HDD and SSD, with buffers and tablespaces and logging and on and on set to recreate as many scenarios as one wishes using a single engine instance, not so for PG. While PG is the best OS database, from a tuning and admin point of view there's rather a long way to go. No one should think that retail SSD should be used to support an enterprise database. People have gotten lulled into thinking otherwise as a result of the blurring of the two use cases in the HDD world where the difference is generally just QA. - All flash SSD munge the byte stream, some (SandForce controlled in particular) more than others. Industrial strength flash SSD can have 64 internal channels, written in parallel; they don't run on commodity controllers. Treating SSD as just a faster HDD is a trip on the road to perdition. Industrial strength (DRAM) SSDs have been used by serious database folks for a couple of decades, but not the storefront semi-professionals who pervade the web start up world. - The value of SSD in the database world is not as A Faster HDD(tm). Never was, despite the naive' who assert otherwise. The value of SSD is to enable BCNF datastores. Period. If you're not going to do that, don't bother. Silicon storage will never reach equivalent volumetric density, ever. SSD will never be useful in the byte bloat world of xml and other flat file datastores (resident in databases or not). Industrial strength SSD will always be more expensive/GB, and likely by a lot. (Re)factoring to high normalization strips out an order of magnitude of byte bloat, increases native data integrity by as much, reduces much of the redundant code, and puts the ACID where it belongs. All good things, but not effortless. You're arguing about the wrong problem. Sufficiently bulletproof flash SSD exist and have for years, but their names are not well known (no one on this thread has named any), but neither the Intel parts nor any of their retail cousins have any place in the mix except development machines. Real SSD have MTBFs measured in decades; OEMs have qualified such parts, but you won't find them on the shelf at Best Buy. You need to concentrate on understanding what can be done with such drives that can't be done with vanilla HDD that cost 1/50 the dollars. Just being faster won't be the answer. Removing the difference between sequential file processing and true random access is what makes SSD worth the bother; makes true relational datastores second nature rather than rocket science. Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Questions on query planner, join types, and work_mem
Tom Lane wrote: Hannu Krosing ha...@2ndquadrant.com writes: Of course there are more variables than just *_page_cost, so if you nail down any other one, you may end with less than 1 for both page costs. I have always used seq_page_cost = 1 in my thinking and adjusted others relative to it. Right, seq_page_cost = 1 is sort of the traditional reference point, but you don't have to do it that way. The main point here is that for an all-in-RAM database, the standard page access costs are too high relative to the CPU effort costs: regression=# select name, setting from pg_settings where name like '%cost'; name | setting --+- cpu_index_tuple_cost | 0.005 cpu_operator_cost| 0.0025 cpu_tuple_cost | 0.01 random_page_cost | 4 seq_page_cost| 1 (5 rows) To model an all-in-RAM database, you can either dial down both random_page_cost and seq_page_cost to 0.1 or so, or set random_page_cost to 1 and increase all the CPU costs. The former is less effort ;-) It should be noted also that there's not all that much evidence backing up the default values of the cpu_xxx_cost variables. In the past those didn't matter much because I/O costs always swamped CPU costs anyway. But I can foresee us having to twiddle those defaults and maybe refine the CPU cost model more, as all-in-RAM cases get more common. This confused me. If we are assuing the data is in effective_cache_size, why are we adding sequential/random page cost to the query cost routines? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Questions on query planner, join types, and work_mem
On Wed, Aug 11, 2010 at 9:42 PM, Bruce Momjian br...@momjian.us wrote: This confused me. If we are assuing the data is in effective_cache_size, why are we adding sequential/random page cost to the query cost routines? See the comments for index_pages_fetched(). We basically assume that all data starts uncached at the beginning of each query - in fact, each plan node. effective_cache_size only measures the chances that if we hit the same block again later in the execution of something like a nested-loop-with-inner-indexscan, it'll still be in cache. It's an extremely weak knob, and unless you have tables or indices that are larger than RAM, the only mistake you can make is setting it too low. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] performance sol10 zone (fup)
On Wed, Aug 4, 2010 at 1:16 AM, Heiko L. hei...@fh-lausitz.de wrote: Hallo, Im running pg-8,pgpoolII on sol10-zone. I noticed late you mention 8.3.1. Two points, you're missing 1 year of updates, bug fixes, security patches etc. Assuming this version was fast before, we'll assume it's not the cause of this problem, however, you're asking for trouble with a version that old. There are bugs that might not bite you today, but may well in the future. Please upgrade to 8.3.11. After update sol10u7, queries on coltype timestamp are very slow. System: sparc, 2GB RAM Is it possible you had an index that was working that now isn't? Are the queries you included the real ones or approximations? It looks like you have a bunch of seq scans happening. If they're all happening on the same table or small set of them, then a lot of queries should be able to access them in any order together in 8.3 Are sequential scans normal for this query when it runs fast? What does vmstat 10 and / or iostat -xd 10 have to say while this is running? -- To understand recursion, one must first understand recursion. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
Scott Carey wrote: What is the likelihood that your RAID card fails, or that the battery that reported 'good health' only lasts 5 minutes and you lose data before power is restored? What is the likelihood of human error? These are all things that happen sometimes, sure. The problem with the cheap SSDs is that they happen downright often if you actually test for it. If someone is aware of the risk and makes an informed decision, fine. But most of the time I see articles like the one that started this thread that are oblivious to the issue, and that's really bad. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Completely un-tuned Postgresql benchmark results: SSD vs desktop HDD
gnuo...@rcn.com wrote: Sufficiently bulletproof flash SSD exist and have for years, but their names are not well known (no one on this thread has named any) The models perceived as bulletproof are the really dangerous ones to deploy. First, people let their guard down and stop being as paranoid as they should be when they use them. Second, it becomes much more difficult for them to justify buying more than one of the uber-SSD. That combination makes it easier to go back to having a single copy of their data, and there's a really bad road to wander down. The whole idea that kicked off this thread was to enable building systems cheap enough to allow making more inexpensive copies of the data. My systems at home for example follow this model to some degree. There's not a single drive more expensive than $100 to be found here, but everything important to me is sitting on four of them in two systems within seconds after I save it. However, even here I've found it worth dropping enough money for a real battery-backed write cache, to reduce the odds of write corruption on the more important of the servers. Not doing so would be a dangerously cheap decision. That's similar to how I feel about SSDs right now too. You need them to be expensive enough that corruption is unusual rather than expected after a crash--it's ridiculous to not spend enough to get something that's not completely broken by design--while not spending so much that you can't afford to deploy many of them. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance