Re: [PERFORM] perf problem with huge table
Hi all, i am trying to move my app from M$sql to PGsql, but i need a bit of help :) Except from all the other good advises about partitioning the dataset and such there is another aspect to keep in mind. When you have a large dataset and your queries become IO-bound the tuple density is going to hit you in 2 ways. Your dataset seems to have a natural clustering around the time, which is also what you would use for the partitioning. That also means that if you sort of have the clustering of data on disk you would have the tuples you need to satisfy a query on the same page or pages close to. The cost of checking visibillity for a tuple is to some degree a function of the tuple size, so if you can do anything to increase the tuple density that will most likely benefit speed in two ways: * You increace the likelyhood that the next tuple was in the same page and then dont result in a random I/O seek. * You increace the total amount of tuples you have sitting in your system cache in the same amount of pages (memory) so they dont result in a random I/O seek. So .. if you are carrying around columns you dont really need, then throw them away. (that could be colums that trivially can be computed bases on other colums), but you need to do your own testing here. To stress the first point theres a sample run on a fairly old desktop with one SATA drive. testtable has the id integer and a data which is 486 bytes of text. testtable2 has the id integer and a data integer. both filled with 10M tuples and PG restarted and rand drop caches before to simulate totally disk bound system. testdb=# select count(id) from testtable where id 800 and id 850; count 49 (1 row) Time: 7909.464 ms testdb=# select count(id) from testtable2 where id 800 and id 850; count 49 (1 row) Time: 2149.509 ms In this sample.. 4 times faster, the query does not touch the data column. (on better hardware you'll most likely see better results). If the columns are needed, you can push less frequently used columns to a 1:1 relation.. but that gives you some administrative overhead, but then you can desice at query time if you want the extra random seeks to access data. You have the same picture the other way around if your queries are accession data sitting in TOAST, you'll be paying double random IO-cost for getting the tuple. So it is definately a tradeoff, that should be done with care. I've monkeypatched my own PG using this patch to toy around with criteria to send the less frequently used data to a TOAST table. http://article.gmane.org/gmane.comp.db.postgresql.devel.general/135158/match= Google vertical partition for more, this is basically what it is. (I belive this could benefit my own application, so I'm also trying to push some interest into the area). -- Jesper -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!
Kevin Grittner wrote: Jesper Krogh jes...@krogh.cc wrote: Sorry if it is obvious.. but what filesystem/OS are you using and do you have BBU-writeback on the main data catalog also? Sorry for not providing more context. ATHENA:/var/pgsql/data # uname -a Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux ATHENA:/var/pgsql/data # cat /etc/SuSE-release SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 PATCHLEVEL = 2 File system is xfs noatime,nobarrier for all data; OS is on ext3. I *think* the pg_xlog mirrored pair is hanging off the same BBU-writeback controller as the big RAID, but I'd have to track down the hardware tech to confirm, and he's out today. System has 16 Xeon CPUs and 64 GB RAM. I would be surprised if the RAID controller had a BBU-writeback cache. I don't think having xlog share a BBU-writeback makes things slower, and if it does, I would love for someone to explain why. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Dell PERC H700/H800
Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. Matthew -- An optimist sees the glass as half full, a pessimist as half empty, and an engineer as having redundant storage capacity. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Re: 512,600ms query becomes 7500ms... but why? Postgres 8.3 query planner quirk?
On Wed, Feb 10, 2010 at 8:52 PM, Bryce Nesbitt bry...@obviously.com wrote: If you guys succeed in making this class of query perform, you'll have beat out the professional consulting firm we hired, which was all but useless! The query is usually slow, but particular combinations of words seem to make it obscenely slow. Heh heh heh professional consulting firm. production=# EXPLAIN ANALYZE SELECT context_key FROM article_words WHERE word_key = 3675; --- Index Scan using article_words_wc on article_words (cost=0.00..21433.53 rows=11309 width=4) (actual time=0.025..7.579 rows=4003 loops=1) Index Cond: (word_key = 3675) Total runtime: 11.704 ms That's surprisingly inaccurate. Since this table is large: production=# explain analyze select count(*) from article_words; Aggregate (cost=263831.63..263831.64 rows=1 width=0) (actual time=35851.654..35851.655 rows=1 loops=1) - Seq Scan on words (cost=0.00..229311.30 rows=13808130 width=0) (actual time=0.043..21281.124 rows=13808184 loops=1) Total runtime: 35851.723 ms ...you may need to crank up the statistics target. I would probably try cranking it all the way up to the max, though there is a risk that might backfire, in which case you'll need to decrease it again. ALTER TABLE article_words ALTER COLUMN word_key SET STATISTICS 1000; That's probably not going to fix your whole problem, but it should be interesting to see whether it makes things better or worse and by how much. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] perf problem with huge table
Dave Crooke wrote: On Wed, Feb 10, 2010 at 5:30 PM, Jon Lewison jlewis...@gmail.com mailto:jlewis...@gmail.com wrote: Just a nit, but Oracle implements MVCC. 90% of the databases out there do. Sorry, I spoke imprecisely. What I meant was the difference in how the rows are stored internally in Oracle, the main tablespace contains only the newest version of a row, which is (where possible) updated in place - queries in a transaction that can still see an older version have to pull it from the UNDO tablespace (rollback segments in Oracle 8 and older). In Postgres, all versions of all rows are in the main table, and have validity ranges associated with them (this version of this row existed between transaction ids x and y). Once a version goes out of scope, it has to be garbage collected by the vacuuming process so the space can be re-used. In general, this means Oracle is faster *if* you're only doing lots of small transactions (consider how these different models handle an update to a single field in a single row) but it is more sensitive to the scale of transactions doing a really big transaction against a database with an OLTP workload can upset Oracle's digestion as it causes a lot of UNDO lookups, PG's performance is a lot more predictable in this regard. Both models have benefits and drawbacks ... when designing a schema for performance it's important to understand these differences. I find partitioning pretty useful in this scenario if the data allows is. Aging out data just means dropping a partition rather than a delete statement. Forgot to say this - yes, absolutely agree dropping a table is a lot cheaper than a transactional delete. In general, I think partitioning is more important / beneficial with PG's style of MVCC than with Oracle or SQL-Server (which I think is closer to Oracle than PG). I would like to disagree here a little bit Where Oracle's table partitioning is coming in very handy is for example when you have to replace the data of a big (read-only) table on a regularly basis (typically the replicated data from another system). In this case, you just create a partitioned table of exact the same columns/indexes whatsoever as the data table. To load, you then do load the data into the partitioned table, i.e. - truncate the partitioned table, disable constraints, drop indexes - load the data into the partitioned table - rebuild all indexes etc. on the partitioned table during all this time (even if it takes hours) the application can still access the data in the data table without interfering the bulk load. Once you have prepared the data in the partitioned table, you - exchange the partition with the data table wich is a dictionary operation, that means, the application is (if ever) only blocked during this operation which is in the sub-seconds range. If you have to do this with convetional updates or deletes/inserts resp. then this might not even be possible in the given timeframe. just as an example Leo p.s. just to make it a little bit clearer about the famous ORA-01555: Oracle is not forgetting the data as the Oracle RDBMS is of course also ACID-compliant. The ORA-01555 can happen - when the rollback tablespace is really to small to hold all the data changed in the transaction (which I consider a configuration error) - when a long running (read) transaction is trying to change a record which is already updated AND COMMITTED by another transaction. The key here is, that a second transaction has changed a record which is also needed by the first transaction and the second transaction commited the work. Committing the change means, the data in the rollback segment is no longer needed, as it can be read directly from the data block (after all it is commited and this means valid and visible to other transactions). If the first transaction now tries to read the data from the rollback segment to see the unchanged state, it will still succeed (it is still there, nothing happend until now to the rollback segment). The problem of the ORA-01555 shows up only, if now a third transaction needs space in the rollback segment. As the entry from the first/second transaction is marked committed (and therefore no longer needed), it is perfectly valid for transaction #3 to grab this rollback segment and to store its old value there. If THEN (and only then) comes transaction #1 again, asking for the old, unchanged value when the transaction started, THEN the famous ORA-01555 is raised as this value is now overwritten by transaction #3. Thats why in newer versions you have to set the retention time of the rollback blocks/segments to a value bigger than your expected longest transaction. This will decrease the likelihood of the ORA-01555 drastically (but it is still not zero, as you could easily construct an example where it still will fail with ORA-0155 as a transaction can still run longer than you
Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!
On Thu, Feb 11, 2010 at 4:29 AM, Bruce Momjian br...@momjian.us wrote: Kevin Grittner wrote: Jesper Krogh jes...@krogh.cc wrote: Sorry if it is obvious.. but what filesystem/OS are you using and do you have BBU-writeback on the main data catalog also? Sorry for not providing more context. ATHENA:/var/pgsql/data # uname -a Linux ATHENA 2.6.16.60-0.39.3-smp #1 SMP Mon May 11 11:46:34 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux ATHENA:/var/pgsql/data # cat /etc/SuSE-release SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 PATCHLEVEL = 2 File system is xfs noatime,nobarrier for all data; OS is on ext3. I *think* the pg_xlog mirrored pair is hanging off the same BBU-writeback controller as the big RAID, but I'd have to track down the hardware tech to confirm, and he's out today. System has 16 Xeon CPUs and 64 GB RAM. I would be surprised if the RAID controller had a BBU-writeback cache. I don't think having xlog share a BBU-writeback makes things slower, and if it does, I would love for someone to explain why. I believe in the past when this discussion showed up it was mainly due to them being on the same file system (and then not with pg_xlog separate) that made the biggest difference. I recall there being a noticeable performance gain from having two file systems on the same logical RAID device even. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Dell PERC H700/H800
On Thu, 2010-02-11 at 12:39 +, Matthew Wakeling wrote: Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. That's interesting. I know that IBM at least on some of their models have done the same. Glad I use HP :) Joshua D. Drake Matthew -- An optimist sees the glass as half full, a pessimist as half empty, and an engineer as having redundant storage capacity. -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering Respect is earned, not gained through arbitrary and repetitive use or Mr. or Sir. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Dell PERC H700/H800
On Thu, Feb 11, 2010 at 5:39 AM, Matthew Wakeling matt...@flymine.org wrote: Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. Yet one more reason I'm glad I no longer source servers from Dell. I just ask my guy at Aberdeen if he thinks drive X is a good choice, we discuss it like adults and I make my decision. And I generally listen to him because he's usually right. But I'd spit nails if my my RAID controller refused to work with whatever drives I decided to plug into it. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!
Kevin Grittner kevin.gritt...@wicourts.gov wrote: Jesper Krogh jes...@krogh.cc wrote: Sorry if it is obvious.. but what filesystem/OS are you using and do you have BBU-writeback on the main data catalog also? File system is xfs noatime,nobarrier for all data; OS is on ext3. I *think* the pg_xlog mirrored pair is hanging off the same BBU-writeback controller as the big RAID, but I'd have to track down the hardware tech to confirm Another example of why I shouldn't trust my memory. Per the hardware tech: OS: /dev/sda is RAID1 - 2 x 2.5 15k SAS disk pg_xlog: /dev/sdb is RAID1 - 2 x 2.5 15k SAS disk These reside on a ServeRAID-MR10k controller with 256MB BB cache. data:/dev/sdc is RAID5 - 30 x 3.5 15k SAS disk These reside on the DS3200 disk subsystem with 512MB BB cache per controller and redundant drive loops. At least I had the file systems and options right. ;-) -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!
Kevin Grittner wrote: Another example of why I shouldn't trust my memory. Per the hardware tech: OS: /dev/sda is RAID1 - 2 x 2.5 15k SAS disk pg_xlog: /dev/sdb is RAID1 - 2 x 2.5 15k SAS disk These reside on a ServeRAID-MR10k controller with 256MB BB cache. data:/dev/sdc is RAID5 - 30 x 3.5 15k SAS disk These reside on the DS3200 disk subsystem with 512MB BB cache per controller and redundant drive loops. Hmm, so maybe the performance benefit is not from it being on a separate array, but from it being RAID1 instead of RAID5? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!
* Alvaro Herrera alvhe...@commandprompt.com [100211 12:58]: Hmm, so maybe the performance benefit is not from it being on a separate array, but from it being RAID1 instead of RAID5? Or the cumulative effects of: 1) Dedicated spindles/Raid1 2) More BBU cache available (I can't imagine the OS pair writing much) 3) not being queued behind data writes before getting to controller 3) Not waiting for BBU cache to be available (which is shared with all data writes) which requires RAID5 writes to complete... Really, there's *lots* of variables here. The basics being that WAL on the same FS as data, on a RAID5, even with BBU is worse than WAL on a dedicated set of RAID1 spindles with it's own BBU. Wow! ;-) -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [PERFORM] moving pg_xlog -- yeah, it's worth it!
Aidan Van Dyk ai...@highrise.ca wrote: Alvaro Herrera alvhe...@commandprompt.com wrote: Hmm, so maybe the performance benefit is not from it being on a separate array, but from it being RAID1 instead of RAID5? Or the cumulative effects of: 1) Dedicated spindles/Raid1 2) More BBU cache available (I can't imagine the OS pair writing much) 3) not being queued behind data writes before getting to controller 3) Not waiting for BBU cache to be available (which is shared with all data writes) which requires RAID5 writes to complete... Really, there's *lots* of variables here. The basics being that WAL on the same FS as data, on a RAID5, even with BBU is worse than WAL on a dedicated set of RAID1 spindles with it's own BBU. Wow! Sure, OK, but what surprised me was that a set of 15 read-only queries (with pretty random reads) took almost twice as long when the WAL files were on the same file system. That's with OS writes being only about 10% of reads, and *that's* with 128 GB of RAM which keeps a lot of the reads from having to go to the disk. I would not have expected that a read-mostly environment like this would be that sensitive to the WAL file placement. (OK, I *did* request the separate file system for them anyway, but I thought it was going to be a marginal benefit, not something this big.) -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Dell PERC H700/H800
2010/2/11 James Mansion ja...@mansionfamily.plus.com: Matthew Wakeling wrote: Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. I think that's potentially FUD. Its all about 'Dell qualified drives'. I can't see anything that suggests that Dell will OEM drives and somehow tag them so that the drive must have come from them. Of course they are big enough that they could have special BIOS I guess, but I read it that the drive types (and presumably revisions thereof) had to be recognised by the controller from a list, which presumably can be reflashed, which is not quite saying that if some WD enterprise drive model is 'qualified' then you have to buy it from Dell.. Do you have any further detail? For example: SAMSUNG MCCOE50G, 50GB SSD which you can buy only from Dell. It's unknown at Samsung page. I think they can easy order own model. -- Łukasz Jagiełło System Administrator G-Forces Web Management Polska sp. z o.o. (www.gforces.pl) Ul. Kruczkowskiego 12, 80-288 Gdańsk Spółka wpisana do KRS pod nr 246596 decyzją Sądu Rejonowego Gdańsk-Północ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Dell PERC H700/H800
On Thu, Feb 11, 2010 at 1:11 PM, James Mansion ja...@mansionfamily.plus.com wrote: Matthew Wakeling wrote: Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. Matthew I think that's potentially FUD. Its all about 'Dell qualified drives'. I can't see anything that suggests that Dell will OEM drives and somehow tag them so that the drive must have come from them. Of course they are big enough that they could have special BIOS I guess, but I read it that the drive types (and presumably revisions thereof) had to be recognised by the controller from a list, which presumably can be reflashed, which is not quite saying that if some WD enterprise drive model is 'qualified' then you have to buy it from Dell.. Do you have any further detail? In the post to the dell mailing list ( http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041335.html ) It was pointed out that the user had installed Seagate ES.2 drives, which are enterprise class drives that have been around a while and are kind of the standard SATA enterprise clas drives and are listed so by Seagate: http://www.seagate.com/www/en-us/products/servers/barracuda_es/barracuda_es.2 These drives were marked as BLOCKED and unusable by the system. The pdf linked to in the dell forum specifically states that the hard drives are loaded with a dell specific firmware. The PDF seems otherwise free of useful information, and is mostly a marketing tool as near as I can tell. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Dell PERC H700/H800
Matthew Wakeling wrote: Just a heads up - apparently the more recent Dell RAID controllers will no longer recognise hard discs that weren't sold through Dell. http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ As one of the comments points out, that kind of makes them no longer SATA or SAS compatible, and they shouldn't be allowed to use those acronyms any more. Matthew I think that's potentially FUD. Its all about 'Dell qualified drives'. I can't see anything that suggests that Dell will OEM drives and somehow tag them so that the drive must have come from them. Of course they are big enough that they could have special BIOS I guess, but I read it that the drive types (and presumably revisions thereof) had to be recognised by the controller from a list, which presumably can be reflashed, which is not quite saying that if some WD enterprise drive model is 'qualified' then you have to buy it from Dell.. Do you have any further detail? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance