Re: [PERFORM] Finding bottleneck
I think I have a solution for you. You have posted that you presently have these RAID volumes and behaviors: sda: data (10 spindles, raid10) sdb: xlog & clog (2 spindles, raid1) sdc: os and other stuff Usually iostat (2 second interval) says: avg-cpu: %user %nice %sys %iowait %idle 32.380.00 12.8811.6243.12 Device: tps kB_read/skB_wrtn/skB_readkB_wrtn sda 202.00 1720.00 0.00 3440 0 sdb 152.50 4.00 2724.00 8 5448 sdc 0.00 0.000.00 0 0 And during checkpoint: avg-cpu: %user %nice %sys %iowait %idle 31.25 0.00 14.75 54.000.00 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sda3225.50 1562.00 35144.00 3124 70288 sdb 104.5010.00 2348.00 20 4696 sdc 0.00 0.00 0.00 0 0 During checkpoints sda is becoming saturated, essentially halting all other DB activity involving sda. A lesser version of the porblem is probably occurring every time multiple entities on sda are being accessed simultaneously, particularly simultaneous writes. My Proposed Solution: Put comment and its index on it's own dedicated RAID volume. Put comment_archive and its index on its own dedicated RAID volume. Put the rest of the tables currently part of "data" on their own dedicated RAID volume. Put the rest if the indexes to the tables currently part of "data" on their own dedicated RAID volume. Put xlog on its own dedicated RAID volume. The general idea here is to put any tables or indexes that tend to require simultaneous access, particularly write access, on different spindles. Like all things, there's a point of diminishing returns that is dependent on the HW used and the DB load. If you must wring every last bit of IO out of the HD subsystem, a more exact set of spindle assignments can be made by analyzing your queries and then 1) make sure writes that tend to be simultaneous are to different spindles, then (if you still need better IO) 2) make sure reads that tend to be simultaneous are to different spindles. At some point, your controller will become the bottleneck. At some point beyond that, the IO channels on the mainboard will become the bottleneck. My suggestion should get you to within 80-90% of optimal if I've understood the implications of your posts correctly. The other suggestion I'd make is to bump your RAM from 16GB to 32GB as soon as you can afford it and then tune your PostgreSQL parameters to make best use of it. The more RAM resident your DB, the better. Hope this helps, Ron Peacetree ===Original Message Follows=== From: Kari Lavikka To: Merlin Moncure Subject: Re: Finding bottleneck Date: Mon, 8 Aug 2005 19:19:09 +0300 (EETDST) -- Actually I modified postgresql.conf a bit and there isn't commit delay any more. That didn't make noticeable difference though.. Workload is generated by a website with about 1000 dynamic page views a second. Finland's biggest site among youths btw. Anyway, there are about 70 tables and here's some of the most important: relname | reltuples --+- comment | 1.00723e+08 comment_archive | 9.12764e+07 channel_comment | 6.93912e+06 image| 5.80314e+06 admin_event | 5.1936e+06 user_channel | 3.36877e+06 users| 325929 channel | 252267 Queries to "comment" table are mostly IO-bound but are performing quite well. Here's an example: (SELECT u.nick, c.comment, c.private, c.admin, c.visible, c.parsable, c.uid_sender, to_char(c.stamp, 'DD.MM.YY HH24:MI') AS stamp, c.comment_id FROM comment c INNER JOIN users u ON u.uid = c.uid_sender WHERE u.status = 'a' AND c.image_id = 15500900 AND c.uid_target = 780345 ORDER BY uid_target DESC, image_id DESC, c.comment_id DESC) LIMIT 36 And explain analyze: Limit (cost=0.00..6.81 rows=1 width=103) (actual time=0.263..17.522 rows=12 loops=1) -> Nested Loop (cost=0.00..6.81 rows=1 width=103) (actual time=0.261..17.509 rows=12 loops=1) -> Index Scan Backward using comment_uid_target_image_id_comment_id_20050527 on "comment" c (cost=0.00..3.39 rows=1 width=92) (actual time=0.129..16.213 rows=12 loops=1) Index Cond: ((uid_target = 780345) AND (image_id = 15500900)) -> Index Scan using users_pkey on users u (cost=0.00..3.40 rows=1 width=15) (actual time=0.084..0.085 rows=1 loops=12)
Re: [PERFORM] Need for speed
At 05:15 AM 8/17/2005, Ulrich Wisser wrote: Hello, thanks for all your suggestions. I can see that the Linux system is 90% waiting for disc io. A clear indication that you need to improve your HD IO subsystem. At that time all my queries are *very* slow. To be more precise, your server performance at that point is essentially equal to your HD IO subsystem performance. My scsi raid controller and disc are already the fastest available. Oh, REALLY? This is the description of the system you gave us: "We have a box with Linux Fedora Core 3, Postgres 7.4.2 Intel(R) Pentium(R) 4 CPU 2.40GHz 2 scsi 76GB disks (15.000RPM, 2ms)" The is far, Far, FAR from the "the fastest available" in terms of SW, OS, CPU host, _or_ HD subsystem. The "fastest available" means 1= you should be running 8.0.3 2= you should be running the latest stable 2.6 based kernel 3= you should be running an Opteron based server 4= Fibre Channel HDs are higher performance than SCSI ones. 5= (and this is the big one) YOU NEED MORE SPINDLES AND A HIGHER END RAID CONTROLLER. The absolute "top of the line" for RAID controllers is something based on Fibre Channel from Xyratex (who make the RAID engines for EMC and NetApps), Engino (the enterprise division of LSI Logic who sell mostly to IBM. Apple has a server based on an Engino card), dot-hill (who bought Chaparral among others). I suspect you can't afford them even if they would do business with you. The ante for a FC-based RAID subsystem in this class is in the ~$32K to ~$128K range, even if you buy direct from the actual RAID HW manufacturer rather than an OEM like In the retail commodity market, the current best RAID controllers are probably the 16 and 24 port versions of the Areca cards ( www.areca.us ). They come darn close to saturating the the Real World Peak Bandwidth of a 64b 133MHz PCI-X bus. I did put pg_xlog on another file system on other discs. The query plan uses indexes and "vacuum analyze" is run once a day. That To avoid aggregating to many rows, I already made some aggregation tables which will be updated after the import from the Apache logfiles. That did help, but only to a certain level. I believe the biggest problem is disc io. Reports for very recent data are quite fast, these are used very often and therefor already in the cache. But reports can contain (and regulary do) very old data. In that case the whole system slows down. To me this sounds like the recent data is flushed out of the cache and now all data for all queries has to be fetched from disc. My machine has 2GB memory, ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Need for speed
At 05:15 AM 8/17/2005, Ulrich Wisser wrote: Hello, thanks for all your suggestions. I can see that the Linux system is 90% waiting for disc io. A clear indication that you need to improve your HD IO subsystem if possible. At that time all my queries are *very* slow. To be more precise, your server performance at that point is essentially equal to your HD IO subsystem performance. My scsi raid controller and disc are already the fastest available. Oh, REALLY? This is the description of the system you gave us: "We have a box with Linux Fedora Core 3, Postgres 7.4.2 Intel(R) Pentium(R) 4 CPU 2.40GHz 2 scsi 76GB disks (15.000RPM, 2ms)" The is far, Far, FAR from the "the fastest available" in terms of SW, OS, CPU host, _or_ HD subsystem. The "fastest available" means 1= you should be running PostgreSQL 8.0.3 2= you should be running the latest stable 2.6 based kernel 3= you should be running an Opteron based server 4= Fibre Channel HDs are slightly higher performance than SCSI ones. 5= (and this is the big one) YOU NEED MORE SPINDLES AND A HIGHER END RAID CONTROLLER. Your description of you workload was: "one of our services is click counting for on line advertising. We do this by importing Apache log files every five minutes. This results in a lot of insert and delete statements. At the same time our customers shall be able to do on line reporting." There are two issues here: 1= your primary usage is OLTP-like, but you are also expecting to do reports against the same schema that is supporting your OLTP-like usage. Bad Idea. Schemas that are optimized for reporting and other data mining like operation are pessimal for OLTP-like applications and vice versa. You need two schemas: one optimized for lots of inserts and deletes (OLTP-like), and one optimized for reporting (data-mining like). 2= 2 spindles, even 15K rpm spindles, is minuscule. Real enterprise class RAID subsystems have at least 10-20x that many spindles, usually split into 6-12 sets dedicated to different groups of tables in the DB. Putting xlog on its own dedicated spindles is just the first step. The absolute "top of the line" for RAID controllers is something based on Fibre Channel from Xyratex (who make the RAID engines for EMC and NetApps), Engino (the enterprise division of LSI Logic who sell mostly to IBM. Apple has a server based on an Engino card), or dot-hill (who bought Chaparral among others). I suspect you can't afford them even if they would do business with you. The ante for a FC-based RAID subsystem in this class is in the ~$32K to ~$128K range, even if you buy direct from the actual RAID HW manufacturer rather than an OEM like EMC, IBM, or NetApp who will 2x or 4x the price. OTOH, these subsystems will provide OLTP or OLTP-like DB apps with performance that is head-and-shoulders better than anything else to be found. Numbers like 50K-200K IOPS. You get what you pay for. In the retail commodity market where you are more realistically going to be buying, the current best RAID controllers are probably the Areca cards ( www.areca.us ). They come darn close to saturating the Real World Peak Bandwidth of a 64b 133MHz PCI-X bus and have better IOPS numbers than their commodity brethren. However, _none_ of the commodity RAID cards have IOPS numbers anywhere near as high as those mentioned above. To avoid aggregating to many rows, I already made some aggregation tables which will be updated after the import from the Apache logfiles. That did help, but only to a certain level. I believe the biggest problem is disc io. Reports for very recent data are quite fast, these are used very often and therefor already in the cache. But reports can contain (and regulary do) very old data. In that case the whole system slows down. To me this sounds like the recent data is flushed out of the cache and now all data for all queries has to be fetched from disc. I completely agree. Hopefully my above suggestions make sense and are of use to you. My machine has 2GB memory, ...and while we are at it, OLTP like apps benefit less from RAM than data mining ones, but still 2GB of RAM is just not that much for a real DB server... Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] limit number of concurrent callers to a stored
At 09:40 PM 8/17/2005, Alan Stange wrote: is there a simple way to limit the number of concurrent callers to a stored proc? The problem we have is about 50 clients come and perform the same operation at nearly the same time. Typically, this query takes a few seconds to run, but in the case of this thundering herd the query time drops to 70 seconds or much more. The query can return up to 15MB of data. I'm assuming there is some significant write activity going on at some point as a result of the query, since MVCC should not care about concurrent read activity? Is that "a few seconds each query" or "a few seconds total if we run 50 queries sequentially but 70+ seconds per query if we try to run 50 queries concurrently"? A) If the former, "a few seconds" * 50 can easily be 70+ seconds, and things are what you should expect. Getting higher performance in that situation means reducing per query times, which may or may not be easy. Looking at the stored procedure code with an eye towards optimization would be a good place to start. B) If the later, then table access contention is driving performance into the ground, and there are a few things you can try: 1= lock the table(s) under these circumstances so only one query of the 50 can be acting on it at a time. If the table(s) is/are small enough to be made RAM resident, this may be a particularly low-cost, low-effort, reasonable solution. 2= put a queue into place and only let some small number n of queries run against the table(s) concurrently. Adjust n until you get best performance. There are a few ways this could be done. 3= Buy a SSD and put the table(s) in question on it. IIRC, 3.5" format SSDs that can "drop in" replace HDs are available in up to 147GB capacities. The machine is a dual opteron, 8 GB memory, lots of fiber channel disk, Linux 2.6, etc. So, I'm thinking that a semaphore than will block more than N clients from being in the core of the function at one time would be a good thing. This will only help in case "B" above. If you go the "hard" route of using systems programming, you will have a lot of details that must be paid attention to correctly or Bad Things (tm) will happen. Putting the semaphore in place is the tip of the iceberg. Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] extremly low memory usage
At 01:55 PM 8/18/2005, John Arbash Meinel wrote: Jeremiah Jahn wrote: >here's an example standard query. Ireally have to make the first hit go >faster. The table is clustered as well on full_name as well. 'Smith%' >took 87 seconds on the first hit. I wonder if I set up may array wrong. >I remeber see something about DMA access versus something else, and >choose DMA access. LVM maybe? > > It would be nice if you would format your queries to be a little bit easier to read before posting them. However, I believe I am reading it correctly, to say that the index scan on identity is not your slow point. In fact, as near as I can tell, it only takes 52ms to complete. The expensive parts are the 4915 lookups into the litigant_details (each one takes approx 4ms for a total of ~20s). And then you do it again on case_data (average 3ms each * 4906 loops = ~15s). How big are litigant_details and case_data? If they can fit in RAM, preload them using methods like the "cat to /dev/null" trick and those table lookups will be ~100-1000x faster. If they won't fit into RAM but the machine can be expanded to hold enough RAM to fit the tables, it's well worth the ~$75-$150/GB to upgrade the server so that the tables will fit into RAM. If they can't be made to fit into RAM as atomic entities, you have a few choices: A= Put the data tables and indexes on separate dedicated spindles and put litigant_details and case_data each on their own dedicated spindles. This will lower seek conflicts. Again it this requires buying some more HDs, it's well worth it. B= Break litigant_details and case_data into a set of smaller tables (based on something sane like the first n characters of the primary key) such that the smaller tables easily fit into RAM. Given that you've said only 10GB/60GB is "hot", this could work very well. Combine it with "A" above (put all the litigant_details sub tables on one dedicated spindle set and all the case_data sub tables on another spindle set) for added oomph. C= Buy a SSD big enough to hold litigant_details and case_data and put them there. Again, this can be combined with "A" and "B" above to lessen the size of the SSD needed. So there is no need for preloading your indexes on the identity table. It is definitely not the bottleneck. So a few design bits, which may help your database. Why is "actor_id" a text field instead of a number? You could try creating an index on "litigant_details (actor_id, count_ori)" so that it can do just an index lookup, rather than an index+ filter. Yes, that certainly sounds like it would be more efficient. More importantly, though, the planner seems to think the join of identity to litigant_details will only return 1 row, not 5000. Do you regularly vacuum analyze your tables? Just as a test, try running: set enable_nested_loop to off; And then run EXPLAIN ANALYZE again, just to see if it is faster. You probably need to increase some statistics targets, so that the planner can design better plans. > -> Nested Loop (cost=0.00..20411.83 rows=2 width=173) (actual time=12.891..38317.017 rows=4906 loops=1) >-> Nested Loop (cost=0.00..20406.48 rows=1 width=159)(actual time=12.826..23232.106 rows=4906 loops=1) > -> Nested Loop (cost=0.00..20403.18 rows=1 width=138) (actual time=12.751..22885.439 rows=4906 loops=1) > Join Filter: (("outer".case_id)::text = ("inner".case_id)::text) > -> Index Scan using name_speed on identity (cost=0.00..1042.34 rows=4868 width=82) (actual time=0.142..52.538 > rows=4915 loops=1) >Index Cond: (((full_name)::text >= 'MILLER'::character varying) AND ((full_name)::text < 'MILLES'::character varying)) >Filter: (((court_ori)::text = 'IL081025J'::text) AND ((full_name)::text ~~ 'MILLER%'::text)) >-> Index Scan using lit_actor_speed on litigant_details (cost=0.00..3.96 rows=1 width=81) (actual > time=4.631..4.635 rows=1 loops=4915) > Index Cond: (("outer".actor_id)::text = (litigant_details.actor_id)::text) > Filter: ('IL081025J'::text = (court_ori)::text) > -> Seq Scan on court (cost=0.00..3.29 rows=1 width=33) (actual time=0.053..0.062 rows=1 loops=4906) > Filter: ('IL081025J'::text = (id)::text) > -> Index Scan using case_speed on case_data (cost=0.00..5.29 rows=3 width=53) (actual time=3.049..3.058 > rows=1 loops=4906) > Index Cond: (('IL081025J'::text = (case_data.court_ori)::text) AND ((case_data.case_id)::text = > ("outer".case_id)::text)) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] sustained update load of 1-2k/sec
Alex mentions a nice setup, but I'm pretty sure I know how to beat that IO subsystems HW's performance by at least 1.5x or 2x. Possibly more. (No, I do NOT work for any vendor I'm about to discuss.) Start by replacing the WD Raptors with Maxtor Atlas 15K II's. At 5.5ms average access, 97.4MB/s outer track throughput, 85.9MB/s average, and 74.4 MB/s inner track throughput, they have the best performance characteristics of any tested shipping HDs I know of. (Supposedly the new SAS versions will _sustain_ ~98MB/s, but I'll believe that only if I see it under independent testing). In comparison, the numbers on the WD740GD are 8.1ms average access, 71.8, 62.9, and 53.9 MB/s outer, average and inner track throughputs respectively. Be prepared to use as many of them as possible (read: as many you can afford) if you want to maximize transaction rates, particularly for small transactions like this application seems to be mentioning. Next, use a better RAID card. The TOL enterprise stuff (Xyratex, Engino, Dot-hill) is probably too expensive, but in the commodity market benchmarks indicate that that Areca's 1GB buffer RAID cards currently outperform all the other commodity RAID stuff. 9 Atlas II's per card in a RAID 5 set, or 16 per card in a RAID 10 set, should max the RAID card's throughput and come very close to, if not attaining, the real world peak bandwidth of the 64b 133MHz PCI-X bus they are plugged into. Say somewhere in the 700-800MB/s range. Repeat the above for as many independent PCI-X buses as you have for a very fast commodity RAID IO subsystem. Two such configured cards used in the dame manner as mentioned by Alex should easily attain 1.5x - 2x the transaction numbers mentioned by Alex unless there's a bottleneck somewhere else in the system design. Hope this helps, Ron Peacetree At 08:40 AM 8/19/2005, Alex Turner wrote: I have managed tx speeds that high from postgresql going even as high as 2500/sec for small tables, but it does require a good RAID controler card (yes I'm even running with fsync on). I'm using 3ware 9500S-8MI with Raptor drives in multiple RAID 10s. The box wasn't too $$$ at just around $7k. I have two independant controlers on two independant PCI buses to give max throughput. on with a 6 drive RAID 10 and the other with two 4 drive RAID 10s. Alex Turner NetEconomist On 8/19/05, Mark Cotner <[EMAIL PROTECTED]> wrote: > Hi all, > I bet you get tired of the same ole questions over and > over. > > I'm currently working on an application that will poll > thousands of cable modems per minute and I would like > to use PostgreSQL to maintain state between polls of > each device. This requires a very heavy amount of > updates in place on a reasonably large table(100k-500k > rows, ~7 columns mostly integers/bigint). Each row > will be refreshed every 15 minutes, or at least that's > how fast I can poll via SNMP. I hope I can tune the > DB to keep up. > > The app is threaded and will likely have well over 100 > concurrent db connections. Temp tables for storage > aren't a preferred option since this is designed to be > a shared nothing approach and I will likely have > several polling processes. > > Here are some of my assumptions so far . . . > > HUGE WAL > Vacuum hourly if not more often > > I'm getting 1700tx/sec from MySQL and I would REALLY > prefer to use PG. I don't need to match the number, > just get close. > > Is there a global temp table option? In memory tables > would be very beneficial in this case. I could just > flush it to disk occasionally with an insert into blah > select from memory table. > > Any help or creative alternatives would be greatly > appreciated. :) > > 'njoy, > Mark > > > -- > Writing software requires an intelligent person, > creating functional art requires an artist. > -- Unknown > > > ---(end of broadcast)--- > TIP 3: Have you checked our extensive FAQ? > >http://www.postgresql.org/docs/faq > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] sustained update load of 1-2k/sec
At 09:58 AM 8/19/2005, Andreas Pflug wrote: The 1-2k xact/sec for MySQL seems suspicious, sounds very much like write-back cached, not write-through, esp. considering that heavy concurrent write access isn't said to be MySQLs strength... Don't be suspicious. I haven't seen the code under discussion, but I have seen mySQL easily achieve these kinds of numbers using the myISAM storage engine in write-through cache mode. myISAM can be =FAST=. Particularly when decent HW is thrown at it. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] sustained update load of 1-2k/sec
At 12:34 PM 8/19/2005, Jeffrey W. Baker wrote: On Fri, 2005-08-19 at 10:54 -0400, Ron wrote: > Maxtor Atlas 15K II's. > Areca's 1GB buffer RAID cards The former are SCSI disks and the latter is an SATA controller. The combination would have a transaction rate of approximately 0. You are evidently thinking of the Areca ARC-11xx controllers (and you are certainly right for that HW combination ;-) ). Those are not the only product Areca makes that can be upgraded to a 1GB cache. Until SAS infrastructure is good enough, U320 SCSI and FC HD's remain the top performing HD's realistically available. At the most fundamental, your DBMS is only as good as your HD IO subsystem, and your HD IO subsystem is only as good as your HDs. As others have said here, skimping on your HDs is _not_ a good design choice where DBMSs are concerned. As an aside, the Atlas 15K II's are now available in SAS: http://www.maxtor.com/portal/site/Maxtor/menuitem.ba88f6d7cf664718376049b291346068/?channelpath=/en_us/Products/SCSI%20Hard%20Drives/Atlas%2015K%20Family/Atlas%2015K%20II%20SAS I haven't seen independent benches on them, so I explicitly referenced the U320 Atlas 15K II's known performance numbers instead. As I said, Maxtor is claiming even better for the SAS version of the Atlas 15K II. None of the SAS <-> PCI-X or PCI-E RAID cards I know of are ready for mass market yet, although a few are in beta.. I can vouch for the Areca controllers, however. You can certainly achieve pgbench transaction rates in the hundreds per second even with only 5 7200RPM disks and 128MB cache. Don't forget to buy the battery. Agreed. Hope this is helpful, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] extremly low memory usage
At 01:18 PM 8/19/2005, John A Meinel wrote: Jeremiah Jahn wrote: > Sorry about the formatting. > > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > >>Jeremiah Jahn wrote: >> >> ... >>The expensive parts are the 4915 lookups into the litigant_details (each >>one takes approx 4ms for a total of ~20s). >>And then you do it again on case_data (average 3ms each * 4906 loops = >>~15s). > > Is there some way to avoid this? > Well, in general, 3ms for a single lookup seems really long. Maybe your index is bloated by not vacuuming often enough. Do you tend to get a lot of updates to litigant_details? Given that the average access time for a 15Krpm HD is in the 5.5-6ms range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) table accesses is requiring a seek. This implies a poor match between physical layout and access pattern. If I understand correctly, the table should not be very fragmented given that this is a reasonably freshly loaded DB? That implies that the fields being looked up are not well sorted in the table compared to the query pattern. If the entire table could fit in RAM, this would be far less of a consideration. Failing that, the physical HD layout has to be improved or the query pattern has to be changed to reduce seeks. There are a couple possibilities at this point. First, you can REINDEX the appropriate index, and see if that helps. However, if this is a test box, it sounds like you just did a dump and reload, which wouldn't have bloat in an index. Another possibility. Is this the column that you usually use when pulling information out of litigant_details? If so, you can CLUSTER litigant_details on the appropriate index. This will help things be close together that should be, which decreases the index lookup costs. However, if this is not the common column, then you probably will slow down whatever other accesses you may have on this table. After CLUSTER, the current data will stay clustered, but new data will not, so you have to continually CLUSTER, the same way that you might VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as expensive as a VACUUM FULL. Be aware of this, but it might vastly improve your performance, so it would be worth it. CLUSTER can be a very large maintenance overhead/problem if the table(s) in question actually need to be "continually" re CLUSTER ed. If there is no better solution available, then you do what you have to, but it feels like there should be a better answer here. Perhaps the DB schema needs examining to see if it matches up well with its real usage? Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] sustained update load of 1-2k/sec
At 03:31 PM 8/19/2005, Alex Turner wrote: Don't forget that Ultra 320 is the speed of the bus, not each drive. No matter how many honking 15k disks you put on a 320MB bus, you can only get 320MB/sec! and have so many outstanding IO/s on the bus. Of course. This is exactly why multi-channel SCSI and multichannel Fibre Channel cards exist; and why external RAID enclosures usually have multiple such cards in them... Even moderately acceptable U320 SCSI cards are dual channel at this point (think Adaptec dual channel AHA's), and Quad channel ones are just as common. The Quads will, of course, saturate a 64b 133MHz PCI-X bus. _IF_ the chipset on them can keep up. The current kings of RAID card performance are all Fibre Channel based, and all the ones I know of are theoretically capable of saturating a 64b 133MHz PCI-X bus. Again, _IF_ the chipset on them can keep up. Most commodity RAID card have neither adequate CPU nor enough buffer. Regardless of the peripheral IO technology they use. Not so with SATA! Each drive is on it's own bus, and you are only limited by the speed of your PCI-X Bus, which can be as high as 800MB/sec at 133Mhz/64bit. That's the Theory anyway, and latency should be lower as well. OTOH, as my wife likes to say "In theory, Theory and Practice are the same. In practice, they almost never are." You are only getting the performance you mention as long as your card can keep up with multiplexing N IO streams, crunching RAID 5 XORs (assuming you are using RAID 5), etc, etc. As I'm sure you know, "The chain is only as strong as its weakest link.". Most commodity SATA RAID cards brag about being able to pump 300MB/s (they were all over LW SF bragging about this!?), which in this context is woefully unimpressive. Sigh. I'm impressed with the Areca cards because they usually have CPUs that actually can come close to pushing the theoretical IO limit of the bus they are plugged into; and they can be upgraded to (barely) acceptable buffer amounts (come on, manufacturers! 4GB of DDR PC3200 is only -2- DIMMs, and shortly that will be enough to hold 8GB of DDR PC3200. Give us more buffer!). It's cheap and it's fast - all you have to do is pay for the enclosure, which can be a bit pricey, but there are some nice 24bay and even 40bay enclosures out there for SATA. I've even seen 48 bay ones. However, good enclosures, particularly for larger numbers of HDs, are examples of non-trivial engineering and priced accordingly. Too many times I see people buy "bargain" enclosures and set themselves and their organizations up for some _very_ unpleasant times that could easily have been avoided by being careful to buy quality products. "Pay when you buy or pay much more later." Yes a 15k RPM drive will give you better seek time and better peak through put, but put them all on a single U320 bus and you won't see much return past a stripe size of 3 or 4 Agreed. Same holds for 2Gbps FC. Haven't tested 4Gbps FC personally yet, but I'm told the limit is higher in the manner you'd expect. If it's raw transactions per second data warehouse style, it's all about the xlog baby which is sequential writes, and all about large block reads, which is sequential reads. Alex Turner NetEconomist P.S. Sorry if i'm a bit punchy, I've been up since yestarday with server upgrade nightmares that continue ;) My condolences and sympathies. I've definitely been there and done that. Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] extremly low memory usage
At 04:11 PM 8/19/2005, Jeremiah Jahn wrote: On Fri, 2005-08-19 at 14:23 -0500, John A Meinel wrote: > Ron wrote: > > At 01:18 PM 8/19/2005, John A Meinel wrote: > > > >> Jeremiah Jahn wrote: > >> > Sorry about the formatting. > >> > > >> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote: > >> > > >> >>Jeremiah Jahn wrote: > >> >> > >> >> > >> > >> ... > >> > >> >>The expensive parts are the 4915 lookups into the litigant_details > >> (each > >> >>one takes approx 4ms for a total of ~20s). > >> >>And then you do it again on case_data (average 3ms each * 4906 loops = > >> >>~15s). > >> > > >> > Is there some way to avoid this? > >> > > >> > >> Well, in general, 3ms for a single lookup seems really long. Maybe your > >> index is bloated by not vacuuming often enough. Do you tend to get a lot > >> of updates to litigant_details? > > > > > > Given that the average access time for a 15Krpm HD is in the 5.5-6ms > > range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single > > lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case) > > table accesses is requiring a seek. > > I think LVM may be a problem, since it also seems to break things up on the file system. My access time on the seek should be around 1/7th the 15Krpm I believe since it's a 14 disk raid 10 array. And no other traffic at the moment. Oops. There's a misconception here. RAID arrays increase _throughput_ AKA _bandwidth_ through parallel access to HDs. OTOH, access time is _latency_, and that is not changed. Access time for a RAID set is equal to that of the slowest access time, AKA highest latency, HD in the RAID set. > Well, from what he has said, the total indexes are < 1GB and he has 6GB > of ram. So everything should fit. Not to mention he is only accessing > 5000/several million rows. I table spaced some of the indexes and they are around 211066880 bytes for the name_speed index and 149825330 for the lit_actor_speed index tables seem to be about a gig. Hmm. And you think you are only using 250MB out of your 6GB of RAM? Something doesn't seem to add up here. From what's been posted, I'd expect much more RAM to be in use. > > This implies a poor match between physical layout and access pattern. > > This seems to be the case. But since this is not the only query, it may > be that other access patterns are more important to optimize for. > > > > > If I understand correctly, the table should not be very fragmented given > > that this is a reasonably freshly loaded DB? That implies that the > > fields being looked up are not well sorted in the table compared to the > > query pattern. > > > > If the entire table could fit in RAM, this would be far less of a > > consideration. Failing that, the physical HD layout has to be improved > > or the query pattern has to be changed to reduce seeks. > > > > > > ... > > >> After CLUSTER, the current data will stay clustered, but new data will > >> not, so you have to continually CLUSTER, the same way that you might > >> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as > >> expensive as a VACUUM FULL. Be aware of this, but it might vastly > >> improve your performance, so it would be worth it. > > > > > > CLUSTER can be a very large maintenance overhead/problem if the table(s) > > in question actually need to be "continually" re CLUSTER ed. > > > > If there is no better solution available, then you do what you have to, > > but it feels like there should be a better answer here. > > > > Perhaps the DB schema needs examining to see if it matches up well with > > its real usage? > > > > Ron Peacetree > > > > I certainly agree that CLUSTER is expensive, and is an on-going > maintenance issue. If it is the normal access pattern, though, it may be > worth it. The query I've sent you is one of the most common I get just change the name. I handle about 180K of them a day mostly between 8 and 5. The clustering has never really been a problem. Like I said before I do it about once a week. I handle about 3000 update an hour consisting of about 1000-3000 statement per update. ie about 2.5 million updates per hour. In the last few months or so I've filtered these down to about 400K update/delete/insert statements per hour. 2.5M updates per hour = ~695 updates per second. 400K per hour = ~112 updates per sec. These should be well wi
Re: [PERFORM] extremly low memory usage
At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: > Jeremiah Jahn wrote: > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > > > > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would > > have thought is would be faster with my raid setup. I think I'm going to > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3 > > system. Maybe that will help. Still trying to get suggestions for a > > stripe size. > > > > I don't think 150MB/s is out of the realm for a 14 drive array. > How fast is time dd if=/dev/zero of=testfile bs=8192 count=100 > time dd if=/dev/zero of=testfile bs=8192 count=100 100+0 records in 100+0 records out real1m24.248s user0m0.381s sys 0m33.028s > (That should create a 8GB file, which is too big to cache everything) > And then how fast is: > time dd if=testfile of=/dev/null bs=8192 count=100 time dd if=testfile of=/dev/null bs=8192 count=100 100+0 records in 100+0 records out real0m54.139s user0m0.326s sys 0m8.916s and on a second run: real0m55.667s user0m0.341s sys 0m9.013s > > That should give you a semi-decent way of measuring how fast the RAID > system is, since it should be too big to cache in ram. about 150MB/Sec. Is there no better way to make this go faster...? Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them doing raw sequential IO like this should be capable of at ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K II's to devices external to the RAID array. _IF_ the controller setup is high powered enough to keep that kind of IO rate up. This will require a controller or controllers providing dual channel U320 bandwidth externally and quad channel U320 bandwidth internally. IOW, it needs a controller or controllers talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized IO buffer as well. AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic MegaRAID controllers. What I don't know is which exact one yours is, nor do I know if it (or any of the MegaRAID controllers) are high powered enough. Talk to your HW supplier to make sure you have controllers adequate to your HD's. ...and yes, your average access time will be in the 5.5ms - 6ms range when doing a physical seek. Even with RAID, you want to minimize seeks and maximize sequential IO when accessing them. Best to not go to HD at all ;-) Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] extremly low memory usage
I'm reposting this because my mailer hiccuped when I sent it the first time. If this results in a double post, I apologize. At 02:53 PM 8/20/2005, Jeremiah Jahn wrote: On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote: > Jeremiah Jahn wrote: > > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote: > > > > > it's cached alright. I'm getting a read rate of about 150MB/sec. I would > > have thought is would be faster with my raid setup. I think I'm going to > > scrap the whole thing and get rid of LVM. I'll just do a straight ext3 > > system. Maybe that will help. Still trying to get suggestions for a > > stripe size. > > > > I don't think 150MB/s is out of the realm for a 14 drive array. > How fast is time dd if=/dev/zero of=testfile bs=8192 count=100 > time dd if=/dev/zero of=testfile bs=8192 count=100 100+0 records in 100+0 records out real1m24.248s user0m0.381s sys 0m33.028s > (That should create a 8GB file, which is too big to cache everything) > And then how fast is: > time dd if=testfile of=/dev/null bs=8192 count=100 time dd if=testfile of=/dev/null bs=8192 count=100 100+0 records in 100+0 records out real0m54.139s user0m0.326s sys 0m8.916s and on a second run: real0m55.667s user0m0.341s sys 0m9.013s > > That should give you a semi-decent way of measuring how fast the RAID > system is, since it should be too big to cache in ram. about 150MB/Sec. Is there no better way to make this go faster...? Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them doing raw sequential IO like this should be capable of at ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's, ~7*79MB/s= 553MB/s if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K II's to devices external to the RAID array. _IF_ the controller setup is high powered enough to keep that kind of IO rate up. This will require a controller or controllers providing dual channel U320 bandwidth externally and quad channel U320 bandwidth internally. IOW, it needs a controller or controllers talking 64b 133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized IO buffer as well. AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic MegaRAID controllers. What I don't know is which exact one yours is, nor do I know if it (or any of the MegaRAID controllers) are high powered enough. Talk to your HW supplier to make sure you have controllers adequate to your HD's. ...and yes, your average access time will be in the 5.5ms - 6ms range when doing a physical seek. Even with RAID, you want to minimize seeks and maximize sequential IO when accessing them. Best to not go to HD at all ;-) Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] extremly low memory usage
At 02:16 PM 8/20/2005, Jeremiah Jahn wrote: I'm just watching gnome-system-monoitor. Which after careful consideration.and looking at dstat means I'm on CRACKGSM isn't showing cached memory usageI asume that the cache memory usage is where data off of the disks would be cached...? memory output from dstat is this for a few seconds: ---procs--- --memory-usage- ---paging-- --disk/sdadisk/sdb- swap--- total-cpu-usage run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read write|_used _free|usr sys idl wai hiq siq 0 0 0|1336M 10M 4603M 17M| 490B 833B|3823B 3503k:1607k 4285k| 160k 2048M| 4 1 89 7 0 0 1 0 0|1337M 10M 4600M 18M| 0 0 | 0 0 : 0 464k| 160k 2048M| 25 0 75 0 0 0 1 0 0|1334M 10M 4596M 25M| 0 0 | 0 0 : 056k| 160k 2048M| 21 4 75 0 0 0 Then the "low memory usage" was a chimera. Excellent! Given the evidence in this thread, IMO you should upgrade your box to 16GB of RAM ASAP. That should be enough to cache most, if not all, of the 10GB of the "hot" part of your DB; thereby dedicating your HD subsystem as much as possible to writes (which is unavoidable HD IO). As I've posted before, at $75-$150/GB, it's well worth the investment whenever you can prove it will help as we have here. Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] extremly low memory usage
I'm resending this as it appears not to have made it to the list. At 10:54 AM 8/21/2005, Jeremiah Jahn wrote: On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > Ron wrote: > > Well, since you can get a read of the RAID at 150MB/s, that means that > it is actual I/O speed. It may not be cached in RAM. Perhaps you could > try the same test, only using say 1G, which should be cached. [EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100 100+0 records in 100+0 records out real0m8.885s user0m0.299s sys 0m6.998s This is abysmally slow. [EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100 100+0 records in 100+0 records out real0m1.654s user0m0.232s sys 0m1.415s This transfer rate is the only one out of the 4 you have posted that is in the vicinity of where it should be. The raid array I have is currently set up to use a single channel. But I have dual controllers in the array. And dual external slots on the card. The machine is brand new and has pci-e backplane. So you have 2 controllers each with 2 external slots? But you are currently only using 1 controller and only one external slot on that controller? > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > > doing raw sequential IO like this should be capable of at > > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's BTW I'm using Seagate Cheetah 15K.4's OK, now we have that nailed down. > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > > MegaRAID controllers. What I don't know is which exact one yours is, > > nor do I know if it (or any of the MegaRAID controllers) are high > > powered enough. PERC4eDC-PCI Express, 128MB Cache, 2-External Channels Looks like they are using the LSI Logic MegaRAID SCSI 320-2E controller. IIUC, you have 2 of these, each with 2 external channels? The specs on these appear a bit strange. They are listed as being a PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= 2GB/s, yet they are also listed as only supporting dual channel U320= 640MB/s when they could easily support quad channel U320= 1.28GB/s. Why bother building a PCI-Ex8 card when only a PCI-Ex4 card (which is a more standard physical format) would've been enough? Or if you are going to build a PCI-Ex8 card, why not support quad channel U320? This smells like there's a problem with LSI's design. The 128MB buffer also looks suspiciously small, and I do not see any upgrade path for it on LSI Logic's site. "Serious" RAID controllers from companies like Xyratex, Engino, and Dot-hill can have up to 1-2GB of buffer, and there's sound technical reasons for it. See if there's a buffer upgrade available or if you can get controllers that have larger buffer capabilities. Regardless of the above, each of these controllers should still be good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing raw sequential IO if you plug 3-4 fast enough HD's into each SCSI channel. Cheetah 15K.4's certainly are fast enough. Optimal setup is probably to split each RAID 1 pair so that one HD is on each of the SCSI channels, and then RAID 0 those pairs. That will also protect you from losing the entire disk subsystem if one of the SCSI channels dies. That 128MB of buffer cache may very well be too small to keep the IO rate up, and/or there may be a more subtle problem with the LSI card, and/or you may have a configuration problem, but _something(s)_ need fixing since you are only getting raw sequential IO of ~100-150MB/s when it should be above 500MB/s. This will make the most difference for initial reads (first time you load a table, first time you make a given query, etc) and for any writes. Your HW provider should be able to help you, even if some of the HW in question needs to be changed. You paid for a solution. As long as this stuff is performing at so much less then what it is supposed to, you have not received the solution you paid for. BTW, on the subject of RAID stripes IME the sweet spot tends to be in the 64KB to 256KB range (very large, very read heavy data mines can want larger RAID stripes.). Only experimentation will tell you what results in the best performance for your application. I'm not really worried about the writing, it's the reading the reading that needs to be faster. Initial reads are only going to be as fast as your HD subsystem, so there's a reason for making the HD subsystem faster even if all you care about is reads. In addition, I'll repeat my previous advice that upgrading to 16GB of RAM would be well worth it for you. Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] extremly low memory usage
At 10:54 AM 8/21/2005, Jeremiah Jahn wrote: On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote: > Ron wrote: > > Well, since you can get a read of the RAID at 150MB/s, that means that > it is actual I/O speed. It may not be cached in RAM. Perhaps you could > try the same test, only using say 1G, which should be cached. [EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100 100+0 records in 100+0 records out real0m8.885s user0m0.299s sys 0m6.998s This is abysmally slow. [EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100 100+0 records in 100+0 records out real0m1.654s user0m0.232s sys 0m1.415s This transfer rate is the only one out of the 4 you have posted that is in the vicinity of where it should be. The raid array I have is currently set up to use a single channel. But I have dual controllers in the array. And dual external slots on the card. The machine is brand new and has pci-e backplane. So you have 2 controllers each with 2 external slots? But you are currently only using 1 controller and only one external slot on that controller? > > Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them > > doing raw sequential IO like this should be capable of at > > ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's BTW I'm using Seagate Cheetah 15K.4's OK, now we have that nailed down. > > AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic > > MegaRAID controllers. What I don't know is which exact one yours is, > > nor do I know if it (or any of the MegaRAID controllers) are high > > powered enough. PERC4eDC-PCI Express, 128MB Cache, 2-External Channels Looks like they are using the LSI Logic MegaRAID SCSI 320-2E controller. IIUC, you have 2 of these, each with 2 external channels? The specs on these appear a bit strange. They are listed as being a PCI-Ex8 card, which means they should have a max bandwidth of 20Gb/s= 2GB/s, yet they are also listed as only supporting dual channel U320= 640MB/s when they could easily support quad channel U320= 1.28GB/s. Why bother building a PCI-Ex8 card when only a PCI-Ex4 card (which is a more standard physical format) would've been enough? Or if you are going to build a PCI-Ex8 card, why not support quad channel U320? This smells like there's a problem with LSI's design. The 128MB buffer also looks suspiciously small, and I do not see any upgrade path for it on LSI Logic's site. "Serious" RAID controllers from companies like Xyratex, Engino, and Dot-hill can have up to 1-2GB of buffer, and there's sound technical reasons for it. See if there's a buffer upgrade available or if you can get controllers that have larger buffer capabilities. Regardless of the above, each of these controllers should still be good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing raw sequential IO if you plug 3-4 fast enough HD's into each SCSI channel. Cheetah 15K.4's certainly are fast enough. Optimal setup is probably to split each RAID 1 pair so that one HD is on each of the SCSI channels, and then RAID 0 those pairs. That will also protect you from losing the entire disk subsystem if one of the SCSI channels dies. That 128MB of buffer cache may very well be too small to keep the IO rate up, and/or there may be a more subtle problem with the LSI card, and/or you may have a configuration problem, but _something(s)_ need fixing since you are only getting raw sequential IO of ~100-150MB/s when it should be above 500MB/s. This will make the most difference for initial reads (first time you load a table, first time you make a given query, etc) and for any writes. Your HW provider should be able to help you, even if some of the HW in question needs to be changed. You paid for a solution. As long as this stuff is performing at so much less then what it is supposed to, you have not received the solution you paid for. BTW, on the subject of RAID stripes IME the sweet spot tends to be in the 64KB to 256KB range (very large, very read heavy data mines can want larger RAID stripes.). Only experimentation will tell you what results in the best performance for your application. I'm not really worried about the writing, it's the reading the reading that needs to be faster. Initial reads are only going to be as fast as your HD subsystem, so there's a reason for making the HD subsystem faster even if all you care about is reads. In addition, I'll repeat my previous advice that upgrading to 16GB of RAM would be well worth it for you. Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Need for speed 2
At 03:10 AM 8/25/2005, Ulrich Wisser wrote: I realize I need to be much more specific. Here is a more detailed description of my hardware and system design. Pentium 4 2.4GHz Memory 4x DIMM DDR 1GB PC3200 400MHZ CAS3, KVR Motherboard chipset 'I865G', two IDE channels on board First suggestion: Get better server HW. AMD Opteron based dual processor board is the current best in terms of price/performance ratio, _particularly_ for DB applications like the one you have described. Such mainboards cost ~$400-$500. RAM will cost about $75-$150/GB. Opteron 2xx are ~$200-$700 apiece. So a 2P AMD system can be had for as little as ~$850 + the cost of the RAM you need. In the worst case where you need 24GB of RAM (~$3600), the total comes in at ~$4450. As you can see from the numbers, buying only what RAM you actually need can save you a great deal on money. Given what little you said about how much of your DB is frequently accessed, I'd suggest buying a server based around the 2P 16 DIMM slot IWill DK88 mainboard (Tyan has announced a 16 DIMM slot mainboard, but I do not think it is actually being sold yet.). Then fill it with the minimum amount of RAM that will allow the "working set" of the DB to be cached in RAM. In the worst case where DB access is essentially uniform and essentially random, you will need 24GB of RAM to hold the 22GB DB + OS + etc. That worst case is _rare_. Usually DB's have a working set that is smaller than the entire DB. You want to keep that working set in RAM. If you can't identify the working set, buy enough RAM to hold the entire DB. In particular, you want to make sure that any frequently accessed read only tables or indexes are kept in RAM. The "read only" part is very important. Tables (and their indexes) that are frequently written to _have_ to access HD. Therefore you get much less out of having them in RAM. Read only tables and their indexes can be loaded into tmpfs at boot time thereby keeping out of the way of the file system buffer cache. tmpfs does not save data if the host goes down so it is very important that you ONLY use this trick with read only tables. The other half of the trick is to make sure that the file system buffer cache does _not_ cache whatever you have loaded into tmpfs. 2x SEAGATE BARRACUDA 7200.7 80GB 7200RPM ATA/100 (software raid 1, system, swap, pg_xlog) ADAPTEC SCSI RAID 2100S ULTRA160 32MB 1-CHANNEL 2x SEAGATE CHEETAH 15K.3 73GB ULTRA320 68-PIN WIDE (raid 1, /var/lib/pgsql) Second suggestion: you need a MUCH better IO subsystem. In fact, given that you have described this system as being primarily OLTP like, this is more important that the above server HW. Best would be to upgrade everything, but if you are strapped for cash, upgrade the IO subsystem first. You need many more spindles and a decent RAID card or cards. You want 15Krpm (best) or 10Krpm HDs. As long as all of the HD's are at least 10Krpm, more spindles is more important than faster spindles. If it's a choice between more 10Krpm discs or fewer 15Krpm discs, buy the 10Krpm discs. Get the spindle count as high as you RAID cards can handle. Whatever RAID cards you get should have as much battery backed write buffer as possible. In the commodity market, presently the highest performance RAID cards I know of, and the ones that support the largest battery backed write buffer, are made by Areca. Database size on disc is 22GB. (without pg_xlog) Find out what the working set, ie the most frequently accessed portion, of this 22GB is and you will know how much RAM is worth having. 4GB is definitely too little! Please find my postgresql.conf below. Third suggestion: make sure you are running a 2.6 based kernel and at least PG 8.0.3. Helping beta test PG 8.1 might be an option for you as well. Putting pg_xlog on the IDE drives gave about 10% performance improvement. Would faster disks give more performance? What my application does: Every five minutes a new logfile will be imported. Depending on the source of the request it will be imported in one of three "raw click" tables. (data from two months back, to be able to verify customer complains) For reporting I have a set of tables. These contain data from the last two years. My app deletes all entries from today and reinserts updated data calculated from the raw data tables. The raw data tables seem to be read only? If so, you should buy enough RAM to load them into tmpfs at boot time and have them be completely RAM resident in addition to having enough RAM for the OS to cache an appropriate amount of the rest of the DB. The queries contain no joins only aggregates. I have several indexes to speed different kinds of queries. My problems occur when one users does a report that contains too much old data. In that case all cache mechanisms will fail and disc io is the limiting factor. If one query contains so much data, that a f
Re: [PERFORM] Read/Write block sizes
At 03:45 PM 8/25/2005, Josh Berkus wrote: Jeff, > Ask me sometime about my replacement for GNU sort. Â It uses the same > sorting algorithm, but it's an order of magnitude faster due to better > I/O strategy. Â Someday, in my infinite spare time, I hope to demonstrate > that kind of improvement with a patch to pg. Since we desperately need some improvements in sort performance, I do hope you follow up on this. -- --Josh I'll generalize that. IMO we desperately need any and all improvements in IO performance. Even more so than we need improvements in sorting or sorting IO performance. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Read/Write block sizes
At 04:49 PM 8/25/2005, Chris Browne wrote: [EMAIL PROTECTED] (Ron) writes: > At 03:45 PM 8/25/2005, Josh Berkus wrote: >> > Ask me sometime about my replacement for GNU sort. Â It uses the >> > same sorting algorithm, but it's an order of magnitude faster due >> > to better I/O strategy. Â Someday, in my infinite spare time, I >> > hope to demonstrate that kind of improvement with a patch to pg. >> >>Since we desperately need some improvements in sort performance, I >>do hope you follow up on this. > > I'll generalize that. IMO we desperately need any and all > improvements in IO performance. Even more so than we need > improvements in sorting or sorting IO performance. That's frankly a step backwards. Feel free to "specialise" that instead. We can agree to disagree, I'm cool with that. I'm well aware that a Systems Approach to SW Architecture is not always popular in the Open Source world. Nonetheless, my POV is that if we want to be taken seriously and beat "the big boys", we have to do everything smarter and faster, as well as cheaper, than they do. You are not likely to be able to do that consistently without using some of the "icky" stuff one is required to study as part of formal training in the Comp Sci and SW Engineering fields. A patch that improves some specific aspect of performance is a thousand times better than any sort of "desperate desire for any and all improvements in I/O performance." minor twisting of my words: substituting "desire" for "need". The need is provable. Just put "the big 5" (SQL Server, Oracle, DB2, mySQL, and PostgreSQL) into some realistic benches to see that. Major twisting of my words: the apparent implication by you that I don't appreciate improvements in the IO behavior of specific things like sorting as much as I'd appreciate more "general" IO performance improvements. Performance optimization is best done as an iterative improvement process that starts with measuring where the need is greatest, then improving that greatest need by the most you can, then repeating the whole cycle. _Every_ improvement in such a process is a specific improvement, even if the improvement is a decision to re-architect the entire product to solve the current biggest issue. Improving sorting IO is cool. OTOH, if pg's biggest IO problems are elsewhere, then the amount of overall benefit we will get from improving sorting IO is going to be minimized until we improve the bigger problem(s). Amdahl's Law. The "specialized patch" is also pointedly better in that a *confidently submitted* patch is likely to be way better than any sort of "desperate clutching at whatever may come to hand." Another distortion of my statement and POV. I never suggested nor implied any sort of "desperate clutching...". We have _measurable_ IO issues that need to be addressed in order for pg to be a better competitor in the marketplace. Just as we do with sorting performance. Far too often, I see people trying to address performance problems via the "desperate clutching at whatever seems near to hand," and that generally turns out very badly as a particular result of the whole "desperate clutching" part. If you can get a sort improvement submitted, that's a concrete improvement... As I said, I'm all in favor of concrete, measurable improvement. I do not think I ever stated I was in favor of anything else. You evidently are mildly ranting because you've seen some examples of poor SW Engineering Discipline/Practice by people with perhaps inadequate skills for the issues they were trying to address. We all have. "90% of everything is Jreck (eg of too low a quality)." OTOH, I do not think I've given you any reason to think I lack such Clue, nor do I think my post was advocating such thrashing. My post was intended to say that we need an Overall Systems Approach to pg optimization rather than just applying what compiler writer's call "peephole optimizations" to pg. No more, no less. I apologize if I somehow misled you, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Inefficient queryplan for query with
At 10:27 AM 8/27/2005, Tom Lane wrote: Arjen van der Meijden <[EMAIL PROTECTED]> writes: > But appareantly there is a bug in the explain mechanism of the 8.1devel > I'm using (I downloaded a nightly 25 august somewhere in the morning > (CEST)), since it returned: > ERROR: bogus varno: 9 Yeah, someone else sent in a test case for this failure (or at least one with a similar symptom) yesterday. I'll try to fix it today. > Is a nested loop normally so much (3x) more costly than a hash join? Or > is it just this query that gets estimated wronly? There's been some discussion that we are overestimating the cost of nestloops in general, because we don't take into account that successive scans of the inner relation are likely to find many pages already in cache from the earlier scans. So far no one's come up with a good cost model to use for this, though. regards, tom lane It certainly seems common in the EXPLAIN ANALYZE output I see that the (estimated) cost of Nested Loop is far higher than the actual time measured. What happened when someone tried the naive approach of telling the planner to estimate the cost of a nested loop based on fitting whatever entities are involved in the nested loop in RAM as much as possible? When there are multiple such mappings, use whichever one results in the lowest cost for the NL in question. Clearly, this should lead to an underestimate of the cost of the constant of operation involved, but since nested loops have the only polynomial growth function of the planner's choices, NL's should still have a decent chance of being more expensive than other choices under most circumstances. In addition, if those costs are based on actual measurements of how long it takes to do such scans then the estimated cost has a decent chance of being fairly accurate under such circumstances. It might not work well, but it seems like a reasonable first attempt at a solution? Ron Peacetree ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] RAID Configuration Sugestion
At 08:37 AM 8/30/2005, Alvaro Nunes Melo wrote: Hello, We are about to install a new PostgreSQL server, and despite of being a very humble configuration compared to the ones we see in the list, it's the biggest one we've got till now. The server is a Dual Xeon 3.0 with 2 GB RAM and two SCSI disks. Our main doubt is what is the best configuration for the disks. We are thinking about use them in a RAID-0 array. Is this the best option? What do you suggest on partitioning? Separate partitions for the OS, data and pg_xlog? This is _very_ modest HW. Unless your DB and/or DB load is similarly modest, you are not going to be happy with the performance of your DBMS. At a minimum, for safety reasons you want 4 HDs: 2 for a RAID 1 set for the DB, and 2 for a RAID 1 set for the OS + pg_xlog. 2 extra HDs, even SCSI HDs, is cheap. Especially when compared to the cost of corrupted or lost data. HD's and RAM are cheap enough that you should be able to upgrade in more ways, but do at least that "upgrade"! Beyond that, the best ways to spend you limited $ are highly dependent on your exact DB and its usage pattern. Ron Peacetree ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID Configuration Sugestion
At 12:56 PM 8/30/2005, Joshua D. Drake wrote: Ron wrote: At 08:37 AM 8/30/2005, Alvaro Nunes Melo wrote: Hello, We are about to install a new PostgreSQL server, and despite of being a very humble configuration compared to the ones we see in the list, it's the biggest one we've got till now. The server is a Dual Xeon 3.0 with 2 GB RAM and two SCSI disks. Our main doubt is what is the best configuration for the disks. We are thinking about use them in a RAID-0 array. Is this the best option? What do you suggest on partitioning? Separate partitions for the OS, data and pg_xlog? This is _very_ modest HW. Unless your DB and/or DB load is similarly modest, you are not going to be happy with the performance of your DBMS. Well that is a pretty blanket statement. I have many customers who happily run in less hardware that what is mentioned above. It all depends on the application itself and how the database is utilized. If your customers "run happily" on 2 HD's, then IME they have very modest DB storage and/or DB performance needs. For safety reasons, the best thing to do if you only have 2 HD's is to run them as a RAID 1 with everything on them. The slightly better performing but considerably less safe alternative is to put the OS + logs on 1 HD and the DB on the other. Any resemblance to a semi-serious OLTP load will reduce either such system to an HD IO bound one with poor IO rates. If, as above, your DBMS is bounded by the performance of one HD, then you are AT BEST getting the raw IO rate of such a device: say ~70-80MB/s in average sustained raw sequential IO. Files system overhead and any seeking behavior will rapidly reduce that number to considerably less. Consider that the CPU <-> memory IO subsystem is easily capable of ~3.2GBps. So you are talking about slowing the DB server to at most ~1/40, maybe even as little as ~1/200, its potential under such circumstances. If your DB can fit completely in RAM and/or does light duty write IO, this may not be a serious issue. OTOH, once you start using those HD's to any reasonable extent, most of the rest of the investment you've made in server HW is wasted. As I keep saying, the highest priority in purchasing a DBMS is to make sure you have enough HD IO bandwidth. RAM comes second, and CPU is a distant third. At a minimum, for safety reasons you want 4 HDs: 2 for a RAID 1 set for the DB, and 2 for a RAID 1 set for the OS + pg_xlog. 2 extra HDs, even SCSI HDs, is cheap. Especially when compared to the cost of corrupted or lost data. Your real test is going to be prototyping the performance you need. A single RAID 1 mirror (don't use RAID 0) may be more than enough. However based on the fact that you speced Xeons my guess is you spent money on CPUs when you should have spent money on hard drives. I agree with Josh on both points. Don't use RAID 0 for persistent data unless you like losing data. Spend more on HDs and RAM and less on CPU's (fast FSB is far more important than high clock rate. In general buy the highest FSB with the slowest clock rate.). If fact, if you are that strapped for cash, exchange those 2 SCSI HD's for their $ equivalent in SATA HD's. The extra spindles will be well worth it. If you still have the budget, I would suggest considering either what Ron suggested or possibly using a 4 drive RAID 10 instead. IME, with only 4 HDs, it's usually better to split them them into two RAID 1's (one for the db, one for everything else including the logs) than it is to put everything on one RAID 10. YMMV. Ron Peacetree ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] RAID Configuration Sugestion
At 03:27 PM 8/30/2005, Joshua D. Drake wrote: If you still have the budget, I would suggest considering either what Ron suggested or possibly using a 4 drive RAID 10 instead. IME, with only 4 HDs, it's usually better to split them them into two RAID 1's (one for the db, one for everything else including the logs) than it is to put everything on one RAID 10. YMMV. Really? That's interesting. My experience is different, I assume SCSI? Software/Hardware Raid? The issue exists regardless of technologies used, although the technology used does affect when things become an irritation or serious problem. The issue with "everything on the same HD set" seems to be that under light loads anything works reasonably well, but as load increases contention between DB table access, OS access, and xlog writes can cause performance problems. In particular, _everything_ else hangs while logs are being written with "everything on the same HD set". Thus leaving you with the nasty choices of small log writes that cause more seeking behavior, and the resultant poor overall HD IO performance, or large log writes that basically freeze the server until they are done. Having the logs on a different HD, and if possible different IO bus, reduces this effect to a minimum and seems to be a better choice than the "shared everything" approach. Although this effect seems largest when there are fewest HDs, the general pattern is that one should use as many spindles as one can make use of and that they should be as dedicated as possible in their purpose(s). That's why the TPC bench marked systems tend to have literally 100's of HD's and they tend to be split into very focused purposes. Ron Peacetree ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID Configuration Sugestion
At 08:04 PM 8/30/2005, Michael Stone wrote: On Tue, Aug 30, 2005 at 07:02:28PM -0400, Ron wrote: purpose(s). That's why the TPC bench marked systems tend to have literally 100's of HD's and they tend to be split into very focused purposes. Of course, TPC benchmark systems are constructed such that cost and storage capacity are irrelevant--in the real world things tend to be more complicated. The scary thing is that I've worked on RW production systems that bore a striking resemblance to a TPC benchmark system. As you can imagine, they uniformly belonged to BIG organizations (read: lot's 'o $$$) who were using the systems for mission critical stuff where either it was company existence threatening for the system to be done, or they would lose much $$$ per min of down time, or both. Financial institutions, insurance companies, central data mines for Fortune 2000 companies, etc _all_ build systems that push the state of the art in how much storage can be managed and how many HDs, CPUs, RAM DIMMs, etc are usable. Historically, this has been the sole province of Oracle and DB2 on the SW side and equally outrageously priced custom HW. Clearly, I'd like to see PostgreSQL change that ;-) Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID Configuration Sugestion
At 08:43 PM 8/30/2005, Michael Stone wrote: On Tue, Aug 30, 2005 at 08:41:40PM -0400, Ron wrote: The scary thing is that I've worked on RW production systems that bore a striking resemblance to a TPC benchmark system. As you can imagine, they uniformly belonged to BIG organizations (read: lot's 'o $$$) who were using the systems for mission critical stuff where either it was company existence threatening for the system to be done, or they would lose much $$$ per min of down time, or both. Yeah, and that market is relevant to someone with one dell server and 2 hard disks how? Because successful small companies that _start_ with one small server and 2 HDs grow to _become_ companies that need far more HW; ...and in the perfect world their SW scales to their increased needs... _Without_ exponentially increasing their costs or overhead (as Oracle and DB2 currently do) THIS is the real long term promise of OS DBMS. Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Poor performance on HP Package Cluster
Your HD raw IO rate seems fine, so the problem is not likely to be with the HDs. That consistent ~10x increase in how long it takes to do an import or a select is noteworthy. This "smells" like an interconnect problem. Was the Celeron locally connected to the HDs while the new Xeons are network connected? Getting 10's or even 100's of MBps throughput out of local storage is much easier than it is to do over a network. 1GbE is required if you want HDs to push 72.72MBps over a network, and not even one 10GbE line will allow you to match local buffered IO of 1885.34MBps. What size are those network connects (Server A <-> storage, Server B <-> storage, Server A <-> Server B)? Ron Peacetree At 10:16 AM 9/1/2005, Ernst Einstein wrote: I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP DL380 G4 with MSA Storage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] Raid1). The system is running under Suse Linux Enterprise Server. My problem is, that the performance is very low. On our old Server ( Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10 minutes. ( 1,1GB data ). One of the DL380 it takes more than 90 minutes... Selects response time have also been increased. Celeron 3 sec, Xeon 30-40sec. I'm trying to fix the problem for two day's now, googled a lot, but i don't know what to do. Top says, my CPU spends ~50% time with wait io. top - 14:07:34 up 22 min, 3 users, load average: 1.09, 1.04, 0.78 Tasks: 74 total, 3 running, 71 sleeping, 0 stopped, 0 zombie Cpu(s): 50.0% us, 5.0% sy, 0.0% ni, 0.0% id, 45.0% wa, 0.0% hi, 0.0% si Mem: 6050356k total, 982004k used, 5068352k free,60300k buffers Swap: 2097136k total,0k used, 2097136k free, 786200k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND 9939 postgres 18 0 254m 143m 140m R 49.3 2.48:35.43 postgres:postgres plate [local] INSERT 9938 postgres 16 0 13720 1440 1120 S 4.9 0.00:59.08 psql -d plate -f dump.sql 10738 root 15 0 3988 1120 840 R 4.9 0.00:00.05 top -d 0.2 1 root 16 0 640264 216 S 0.0 0.0 0:05.03 init[3] 2 root 34 19 0 0 0 S 0.0 0.0 0:00.00 [ksoftirqd/0] vmstat 1: ClusterNode2 root $ vmstat 1 procs ---memory-- ---swap-- -io --system--cpu r b swpd freebuff cachesi sobi bo in cs us sy id wa 1 0 0 5032012 60888 82100800 216 6938 1952 5049 40 8 15 37 0 1 0 5031392 60892 82163200 0 8152 2126 5725 45 6 0 49 0 1 0 5030896 60900 82214400 0 8124 2052 5731 46 6 0 47 0 1 0 5030400 60908 82276800 0 8144 2124 5717 44 7 0 50 1 0 0 5029904 60924 82327200 0 8304 2062 5763 43 7 0 49 I've read (2004), that Xeon may have problems with content switching - is the problem still existing? Can I do something to minimize the problem? postgresql.conf: shared_buffers = 28672 effective_cache_size = 40 random_page_cost = 2 shmall & shmmax are set to 268435456 hdparm: ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1 /dev/cciss/c0d0p1: Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Massive performance issues
This should be able to run _very_ fast. At 01:42 PM 9/1/2005, Matthew Sackman wrote: Hi, I'm having performance issues with a table consisting of 2,043,133 rows. The schema is: \d address Table "public.address" Column| Type | Modifiers --++--- postcode_top | character varying(2) | not null postcode_middle | character varying(4) | not null postcode_bottom | character varying(7) | not null postcode | character varying(10) | not null property_type| character varying(15) | not null sale_type| character varying(10) | not null flat_extra | character varying(100) | not null number | character varying(100) | not null street | character varying(100) | not null locality_1 | character varying(100) | not null locality_2 | character varying(100) | not null city | character varying(100) | not null county | character varying(100) | not null Indexes: "address_city_index" btree (city) "address_county_index" btree (county) "address_locality_1_index" btree (locality_1) "address_locality_2_index" btree (locality_2) "address_pc_bottom_index" btree (postcode_bottom) "address_pc_middle_index" btree (postcode_middle) "address_pc_top_index" btree (postcode_top) "address_pc_top_middle_bottom_index" btree (postcode_top, postcode_middle, postcode_bottom) "address_pc_top_middle_index" btree (postcode_top, postcode_middle) "address_postcode_index" btree (postcode) "address_property_type_index" btree (property_type) "address_street_index" btree (street) "street_prefix" btree (lower("substring"((street)::text, 1, 1))) IOW, each row takes ~1KB on HD. First suggestion: format your HD to use 8KB pages with 1KB segments. That'll out each row down on HD as an atomic unit. 8KB pages also "play nice" with pg. At 1KB per row, this table takes up ~2.1GB and should fit into RAM fairly easily on a decently configured DB server (my _laptop_ has 2GB of RAM after all...) Since you are using ~2.1GB for 2 years worth of data, 15 years worth should take no more than 2.1GB*7.5= 15.75GB. If you replace some of those 100 char fields with integers for code numbers and have an auxiliary table for each of those fields mapping the code numbers to the associated 100 char string, you should be able to shrink a row considerably. Your target is to have each row take <= 512B. Once a row fits into one 512B sector on HD, there's a no point in making it smaller unless you can shrink it enough to fit 2 rows into one sector (<= 256B). Once two rows fit into one sector, there's no point shrinking a row unless you can make 3 rows fit into a sector. Etc. Assuming each 100 char (eg 100B) field can be replaced with a 4B int, each row could be as small as 76B. That makes 85B per row the goal as it would allow you to fit 6 rows per 512B HD sector. So in the best case your table will be 12x smaller in terms of real HD space. Fitting one (or more) row(s) into one sector will cut down the real space used on HD for the table to ~7.88GB (or 1.32GB in the best case). Any such streamlining will make it faster to load, make the working set that needs to be RAM for best performance smaller, etc, etc. This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. Upgrade pg to 8.0.3 and make sure you have enough RAM for your real day to day load. Frankly, RAM is so cheap ($75-$150/GB), I'd just upgrade the machine to 4GB as a matter of course. P4's have PAE, so if your mainboard can hold it, put more than 4GB of RAM in if you find you need it. Since you are describing your workload as being predominantly reads, you can get away with far less HD capability as long as you crank up RAM high enough to hold the working set of the DB. The indications from the OP are that you may very well be able to hold the entire DB in RAM. That's a big win whenever you can achieve it. After these steps, there may still be performance issues that need attention, but the DBMS should be _much_ faster. Ron Peacetree ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
At 04:25 PM 9/1/2005, Tom Lane wrote: Ron <[EMAIL PROTECTED]> writes: > ... Your target is to have each row take <= 512B. Ron, are you assuming that the varchar fields are blank-padded or something? I think it's highly unlikely that he's got more than a couple hundred bytes per row right now --- at least if the data is what it sounds like. As it stands, each row will take 55B - 748B and each field is variable in size up to the maximums given in the OP's schema. Since pg uses an underlying OS FS, and not a native one, there will be extra FS overhead no matter what we do, particularly to accommodate such flexibility... The goal is to minimize overhead and maximize regularity in layout. The recipe I know for HD IO speed is in keeping the data small, regular, and as simple as possible. Even better, if the table(s) can be made RAM resident, then searches, even random ones, can be very fast. He wants a 1000x performance improvement. Going from disk resident to RAM resident should help greatly in attaining that goal. In addition, by replacing as many variable sized text strings as possible with ints, the actual compare functions he used as examples should run faster as well. The upthread comment about strcoll() set off some alarm bells in my head. If the database wasn't initdb'd in C locale already, try making it so. Also, use a single-byte encoding if you can (LatinX is fine, Unicode not). Good thoughts I hadn't had. > Upgrade pg to 8.0.3 and make sure you have enough RAM for your real > day to day load. Newer PG definitely better. Some attention to the configuration parameters might also be called for. I fear though that these things are probably just chipping at the margins ... I don't expect 8.0.3 to be a major performance improvement. I do expect it to be a major _maintenance_ improvement for both him and those of us trying to help him ;-) The performance difference between not having the working set of the DB fit into RAM during ordinary operation vs having it be so (or better, having the whole DB fit into RAM during ordinary operation) has been considerably more effective than "chipping at the margins" IME. Especially so if the HD IO subsystem is wimpy. Ron Peacetree ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
don't think I've set anything to UTF8 or such like. > > I need to get to the stage where I can run queries such as: > > select street, locality_1, locality_2, city from address > > where (city = 'Nottingham' or locality_2 = 'Nottingham' > >or locality_1 = 'Nottingham') > > and upper(substring(street from 1 for 1)) = 'A' > > group by street, locality_1, locality_2, city > > order by street > > limit 20 offset 0 > > This might be a lot quicker than pulling all the records like in your example > queries... Yes, that certainly does seem to be the case - around 4 seconds. But I need it to be 10 times faster (or thereabouts) otherwise I have big problems! *beats drum* Get it in RAM, Get it in RAM, ... Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Massive performance issues
At 06:22 PM 9/1/2005, Matthew Sackman wrote: On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote: > > Since I assume you are not going to run anything with the string > "unstable" in its name in production (?!), why not try a decent > production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a > OS more representative of what you are likely (or at least what is > safe...) to run in production? Well, you see, as ever, it's a bit complicated. The company I'm doing the development for has been subcontracted to do it and the contractor was contracted by the actual "client". So there are two companies involved in addition to the "client". Sadly, the "client" actually has dictated things like "it will be deployed on FreeBSD and thou shall not argue". At least get them to promise they will use a release the BSD folks mark "stable"! At this point in time, I actually have very little information about the specification of the boxen that'll be running this application. This is something I'm hoping to solve very soon. The worst part of it is that I'm not going have direct (ssh) access to the box and all configuration changes will most likely have to be relayed through techies at the "client" so fine tuning this is going to be a veritable nightmare. IME, what you have actually just said is "It will not be possible to safely fine tune the DB unless or until I have direct access; and/or someone who does have direct access is correctly trained." Ick. > >> > I need to get to the stage where I can run queries such as: > >> > select street, locality_1, locality_2, city from address > >> > where (city = 'Nottingham' or locality_2 = 'Nottingham' > >> >or locality_1 = 'Nottingham') > >> > and upper(substring(street from 1 for 1)) = 'A' > >> > group by street, locality_1, locality_2, city > >> > order by street > >> > limit 20 offset 0 > >> > >> This might be a lot quicker than pulling all the records like in > >your example > >> queries... > > > >Yes, that certainly does seem to be the case - around 4 seconds. But I > >need it to be 10 times faster (or thereabouts) otherwise I have big > >problems! > > *beats drum* Get it in RAM, Get it in RAM, ... Ok, but I currently have 2 million rows. When this launches in a couple of weeks, it'll launch with 5 million+ and then gain > a million a year. At my previously mentioned optimum of 85B per row, 2M rows is 170MB. 5M rows is 425MB. Assuming the gain of 1M rows per year, that's +85MB per year for this table. Up to 2GB DIMMs are currently standard, and 4GB DIMMs are just in the process of being introduced. Mainboards with anything from 4 to 16 DIMM slots are widely available. IOW, given the description you've provided this DB should _always_ fit in RAM. Size the production system such that the entire DB fits into RAM during ordinary operation with an extra 1GB of RAM initially tossed on as a safety measure and the client will be upgrading the HW because it's obsolete before they run out of room in RAM. I think the upshot of this all is 4GB RAM as a minimum and judicious use of normalization so as to avoid more expensive string comparisons and reduce table size is my immediate plan (along with proper configuration of pg). My suggestion is only slightly different. Reduce table size(s) and up the RAM to the point where the whole DB fits comfortably in RAM. You've got the rare opportunity to build a practical Memory Resident Database. It should run like a banshee when you're done. I'd love to see the benches on the final product. Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] 8.x index insert performance
At 09:43 AM 11/14/2005, Kelly Burkhart wrote: On Fri, 2005-11-11 at 18:02 -0500, Tom Lane wrote: > > There very well could be a pattern in the data which could affect > > things, however, I'm not sure how to identify it in 100K rows out of > > 100M. > > I conjecture that the problem areas represent places where the key > sequence is significantly "more random" than it is elsewhere. Hard > to be more specific than that though. > OK, I understand the pattern now. My two tables hold orders, and order state transitions. Most orders have two transitions: creation and termination. The problem happens when there is a significant number of orders where termination is happening a long time after creation, causing order_transition rows with old ord_id values to be inserted. This is valid, so I have to figure out a way to accomodate it. Perhaps a small schema change would help? Instead of having the order state transitions explicitly listed in the table, why not create two new tables; 1 for created orders and 1 for terminated orders. When an order is created, its ord_id goes into the CreatedOrders table. When an order is terminated, its ord_id is added to the TerminatedOrders table and then deleted from the CreatedOrders table. Downsides to this approach are some extra complexity and that you will have to make sure that system disaster recovery includes making sure that no ord_id appears in both the CreatedOrders and TerminatedOrdes tables. Upsides are that the insert problem goes away and certain kinds of accounting and inventory reports are now easier to create. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases
Got some hard numbers to back your statement up? IME, the Areca 1160's with >= 1GB of cache beat any other commodity RAID controller. This seems to be in agreement with at least one independent testing source: http://print.tweakers.net/?reviews/557 RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any commodity HW solution, but their price point is considerably higher. ...on another note, I completely agree with the poster who says we need more cache on RAID controllers. We should all be beating on the RAID HW manufacturers to use standard DIMMs for their caches and to provide 2 standard DIMM slots in their full height cards (allowing for up to 8GB of cache using 2 4GB DIMMs as of this writing). It should also be noted that 64 drive chassis' are going to become possible once 2.5" 10Krpm SATA II and FC HDs become the standard next year (48's are the TOTL now). We need controller technology to keep up. Ron At 12:16 AM 11/16/2005, Alex Turner wrote: Not at random access in RAID 10 they aren't, and anyone with their head screwed on right is using RAID 10. The 9500S will still beat the Areca cards at RAID 10 database access patern. Alex. On 11/15/05, Dave Cramer <[EMAIL PROTECTED]> wrote: > Luke, > > Have you tried the areca cards, they are slightly faster yet. > > Dave > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote: > > > > > > I agree - you can get a very good one from www.acmemicro.com or > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM > > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read > > performance on these (with tuning) on Linux using the xfs filesystem, > > which is one of the most critical factors for large databases. > > > > > Note that you want to have your DBMS use all of the CPU and disk channel > > bandwidth you have on each query, which takes a parallel database like > > Bizgres MPP to achieve. > > > > > Regards, > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases
You _ARE_ kidding right? In what hallucination? The performance numbers for the 1GB cache version of the Areca 1160 are the _grey_ line in the figures, and were added after the original article was published: "Note: Since the original Dutch article was published in late January, we have finished tests of the 16-port Areca ARC-1160 using 128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to 12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The performance graphs have been updated to include the ARC-1160 results. Discussions of the results have not been updated, however. " With 1GB of cache, the 1160's beat everything else in almost all of the tests they participated in. For the few where they do not win hands down, the Escalade's (very occasionally) essentially tie. These are very easy to read full color graphs where higher is better and the grey line representing the 1GB 1160's is almost always higher on the graph than anything else. Granted the Escalades seem to give them the overall best run for their money, but they still are clearly second best when looking at all the graphs and the CPU utilization numbers in aggregate. Ron At 12:08 PM 11/16/2005, Alex Turner wrote: Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10, the 3ware controllers beat the Areca card. Alex. On 11/16/05, Ron <[EMAIL PROTECTED]> wrote: > Got some hard numbers to back your statement up? IME, the Areca > 1160's with >= 1GB of cache beat any other commodity RAID > controller. This seems to be in agreement with at least one > independent testing source: > > http://print.tweakers.net/?reviews/557 > > RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any > commodity HW solution, but their price point is considerably higher. > > ...on another note, I completely agree with the poster who says we > need more cache on RAID controllers. We should all be beating on the > RAID HW manufacturers to use standard DIMMs for their caches and to > provide 2 standard DIMM slots in their full height cards (allowing > for up to 8GB of cache using 2 4GB DIMMs as of this writing). > > It should also be noted that 64 drive chassis' are going to become > possible once 2.5" 10Krpm SATA II and FC HDs become the standard next > year (48's are the TOTL now). We need controller technology to keep up. > > Ron > > At 12:16 AM 11/16/2005, Alex Turner wrote: > >Not at random access in RAID 10 they aren't, and anyone with their > >head screwed on right is using RAID 10. The 9500S will still beat the > >Areca cards at RAID 10 database access patern. > > > >Alex. > > > >On 11/15/05, Dave Cramer <[EMAIL PROTECTED]> wrote: > > > Luke, > > > > > > Have you tried the areca cards, they are slightly faster yet. > > > > > > Dave > > > > > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote: > > > > > > > > > > > > > > > > > > I agree - you can get a very good one from www.acmemicro.com or > > > > > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA > > > > > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM > > > > > > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read > > > > > > performance on these (with tuning) on Linux using the xfs filesystem, > > > > > > which is one of the most critical factors for large databases. > > > > > > > > > > > > > > > Note that you want to have your DBMS use all of the CPU and disk channel > > > > > > bandwidth you have on each query, which takes a parallel database like > > > > > > Bizgres MPP to achieve. > > > > > > > > > > > > > > > Regards, > > > > > > >---(end of broadcast)--- > >TIP 2: Don't 'kill -9' the postmaster > > > > ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases
Amendment: there are graphs where the 1GB Areca 1160's do not do as well. Given that they are mySQL specific and that similar usage scenarios not involving mySQL (as well as most of the usage scenarios involving mySQL; as I said these did not follow the pattern of the rest of the benchmarks) show the usual pattern of the 1GB 1160's in 1st place or tied for 1st place, it seems reasonable that mySQL has something to due with the aberrant results in those 2 (IIRC) cases. Ron At 03:57 PM 11/16/2005, Ron wrote: You _ARE_ kidding right? In what hallucination? The performance numbers for the 1GB cache version of the Areca 1160 are the _grey_ line in the figures, and were added after the original article was published: "Note: Since the original Dutch article was published in late January, we have finished tests of the 16-port Areca ARC-1160 using 128MB, 512MB and 1GB cache configurations and RAID 5 arrays of up to 12 drives. The ARC-1160 was using the latest 1.35 beta firmware. The performance graphs have been updated to include the ARC-1160 results. Discussions of the results have not been updated, however. " With 1GB of cache, the 1160's beat everything else in almost all of the tests they participated in. For the few where they do not win hands down, the Escalade's (very occasionally) essentially tie. These are very easy to read full color graphs where higher is better and the grey line representing the 1GB 1160's is almost always higher on the graph than anything else. Granted the Escalades seem to give them the overall best run for their money, but they still are clearly second best when looking at all the graphs and the CPU utilization numbers in aggregate. Ron At 12:08 PM 11/16/2005, Alex Turner wrote: Yes - that very benchmark shows that for a MySQL Datadrive in RAID 10, the 3ware controllers beat the Areca card. Alex. On 11/16/05, Ron <[EMAIL PROTECTED]> wrote: > Got some hard numbers to back your statement up? IME, the Areca > 1160's with >= 1GB of cache beat any other commodity RAID > controller. This seems to be in agreement with at least one > independent testing source: > > http://print.tweakers.net/?reviews/557 > > RAID HW from Xyratex, Engino, or Dot Hill will _destroy_ any > commodity HW solution, but their price point is considerably higher. > > ...on another note, I completely agree with the poster who says we > need more cache on RAID controllers. We should all be beating on the > RAID HW manufacturers to use standard DIMMs for their caches and to > provide 2 standard DIMM slots in their full height cards (allowing > for up to 8GB of cache using 2 4GB DIMMs as of this writing). > > It should also be noted that 64 drive chassis' are going to become > possible once 2.5" 10Krpm SATA II and FC HDs become the standard next > year (48's are the TOTL now). We need controller technology to keep up. > > Ron > > At 12:16 AM 11/16/2005, Alex Turner wrote: > >Not at random access in RAID 10 they aren't, and anyone with their > >head screwed on right is using RAID 10. The 9500S will still beat the > >Areca cards at RAID 10 database access patern. > > > >Alex. > > > >On 11/15/05, Dave Cramer <[EMAIL PROTECTED]> wrote: > > > Luke, > > > > > > Have you tried the areca cards, they are slightly faster yet. > > > > > > Dave > > > > > > On 15-Nov-05, at 7:09 AM, Luke Lonergan wrote: > > > > > > > > > > > > > > > > > > I agree - you can get a very good one from www.acmemicro.com or > > > > > > www.rackable.com with 8x 400GB SATA disks and the new 3Ware 9550SX SATA > > > > > > RAID controller for about $6K with two Opteron 272 CPUs and 8GB of RAM > > > > > > on a Tyan 2882 motherboard. We get about 400MB/s sustained disk read > > > > > > performance on these (with tuning) on Linux using the xfs filesystem, > > > > > > which is one of the most critical factors for large databases. > > > > > > > > > > > > > > > Note that you want to have your DBMS use all of the CPU and disk channel > > > > > > bandwidth you have on each query, which takes a parallel database like > > > > > > Bizgres MPP to achieve. > > > > > > > > > > > > > > > Regards, > > > > > > >---(end of broadcast)--- > >TIP 2: Don't 'kill -9' the postmaster > > > > ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases
While I agree with you in principle that pg becomes CPU bound relatively easily compared to other DB products (at ~110-120MBps according to a recent thread), there's a bit of hyperbole in your post. a. There's a big difference between the worst performing 1C x86 ISA CPU available and the best performing 2C one (IIRC, that's the 2.4GHz, 1MB L2 cache AMDx2 4800+ as of this writing) b. Two 2C CPU's vs one 1C CPU means that a pg process will almost never be waiting on other non pg processes. It also means that 3-4 pg processes, CPU bound or not, can execute in parallel. Not an option with one 1C CPU. c. Mainboards with support for multiple CPUs and lots' of RAM are _not_ the cheap ones. d. No one should ever use RAID 0 for valuable data. Ever. So at the least you need 4 HD's for a RAID 10 set (RAID 5 is not a good option unless write performance is unimportant. 4HD RAID 5 is particularly not a good option.) e. The server usually needs to talk to things over a network connection. Often performance here matters. Mainboards with 2 1GbE NICs and/or PCI-X (or PCI-E) slots for 10GbE cards are not the cheap ones. f. Trash HDs mean poor IO performance and lower reliability. While TOTL 15Krpm 4Gb FC HDs are usually overkill (Not always. It depends on context.), you at least want SATA II HDs with NCQ or TCQ support. And you want them to have a decent media warranty- preferably a 5 year one if you can get it. Again, these are not the cheapest HD's available. g. Throughput limitations say nothing about latency considerations. OLTP-like systems _want_ HD spindles. AMAP. Even non OLTP-like systems need a fair number of spindles to optimize HD IO: dedicated WAL set, multiple dedicated DB sets, dedicated OS and swap space set, etc, etc. At 50MBps ASTR, you need 16 HD's operating in parallel to saturate the bandwidth of a PCI-X channel. That's ~8 independent pg tasks (queries using different tables, dedicated WAL IO, etc) running in parallel. Regardless of application domain. h. Decent RAID controllers and HBAs are not cheap either. Even SW RAID benefits from having a big dedicated RAM buffer to talk to. While the above may not cost you $80K, it sure isn't costing you $1K either. Maybe ~$15-$20K, but not $1K. Ron At 01:07 AM 11/18/2005, Luke Lonergan wrote: Greg, On 11/17/05 9:17 PM, "Greg Stark" <[EMAIL PROTECTED]> wrote: > Ok, a more productive point: it's not really the size of the database that > controls whether you're I/O bound or CPU bound. It's the available I/O > bandwidth versus your CPU speed. Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1. A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Regards, - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases
Breaking the ~120MBps pg IO ceiling by any means is an important result. Particularly when you get a ~2x improvement. I'm curious how far we can get using simple approaches like this. At 10:13 AM 11/18/2005, Luke Lonergan wrote: Dave, On 11/18/05 5:00 AM, "Dave Cramer" <[EMAIL PROTECTED]> wrote: > > Now there's an interesting line drawn in the sand. I presume you have > numbers to back this up ? > > This should draw some interesting posts. Part 2: The answer System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel. On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that’s 66 MB/s. Not the efficiency I’d hope from the onboard SATA controller that I’d like, I would have expected to get 85% of the 100MB/s raw read performance. Have you tried the large read ahead trick with this system? It would be interesting to see how much it would help. It might even be worth it to do the experiment at all of [default, 2x default, 4x default, 8x default, etc] read ahead until either a) you run out of resources to support the desired read ahead, or b) performance levels off. I can imagine the results being very enlightening. System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. It’s running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel. Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) That’s 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, let’s try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s. The above experiment would seem useful here as well. Summary: OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s. - Luke I respect your honesty in reporting results that were different then your expectations or previously taken stance. Alan Stange's comment re: the use of direct IO along with your comments re: async IO and mem copies plus the results of these experiments could very well point us directly at how to most easily solve pg's CPU boundness during IO. [HACKERS] are you watching this? Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] weird performances problem
If I understand your HW config correctly, all of the pg stuff is on the same RAID 10 set? If so, give WAL its own dedicated RAID 10 set. This looks like the old problem of everything stalling while WAL is being committed to HD. This concept works for other tables as well. If you have a tables that both want services at the same time, disk arm contention will drag performance into the floor when they are on the same HW set. Profile your HD access and put tables that want to be accessed at the same time on different HD sets. Even if you have to buy more HW to do it. Ron At 04:56 AM 11/22/2005, Guillaume Smet wrote: Qingqing Zhou wrote: Someone is doing a massive *write* at this time, which makes your query *read* quite slow. Can you find out which process is doing write? Indexes should be in memory so I don't expect a massive write to slow down the select queries. sdb is the RAID10 array dedicated to our data so the postgresql process is the only one to write on it. I'll check which write queries are running because there should really be a few updates/inserts on our db during the day. On a four days log analysis, I have the following: SELECT 403,964 INSERT 574 UPDATE 393 DELETE 26 So it's not really normal to have a massive write during the day. Thanks for your help -- Guillaume ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] weird performances problem
At 09:26 AM 11/22/2005, Guillaume Smet wrote: Ron wrote: If I understand your HW config correctly, all of the pg stuff is on the same RAID 10 set? No, the system and the WAL are on a RAID 1 array and the data on their own RAID 10 array. As has been noted many times around here, put the WAL on its own dedicated HD's. You don't want any head movement on those HD's. As I said earlier, there's only a few writes in the database so I'm not really sure the WAL can be a limitation: IIRC, it's only used for writes isn't it? When you reach a WAL checkpoint, pg commits WAL data to HD... ...and does almost nothing else until said commit is done. Don't you think we should have some io wait if the database was waiting for the WAL? We _never_ have any io wait on this server but our CPUs are still 30-40% idle. _Something_ is doing long bursts of write IO on sdb and sdb1 every 30 minutes or so according to your previous posts. Profile your DBMS and find out what. A typical top we have on this server is: 15:22:39 up 24 days, 13:30, 2 users, load average: 3.86, 3.96, 3.99 156 processes: 153 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpuusernice systemirq softirq iowaitidle total 50.6%0.0%4.7% 0.0% 0.6%0.0% 43.8% cpu00 47.4%0.0%3.1% 0.3% 1.5%0.0% 47.4% cpu01 43.7%0.0%3.7% 0.0% 0.5%0.0% 51.8% cpu02 58.9%0.0%7.7% 0.0% 0.1%0.0% 33.0% cpu03 52.5%0.0%4.1% 0.0% 0.1%0.0% 43.0% Mem: 3857224k av, 3307416k used, 549808k free, 0k shrd, 80640k buff 2224424k actv, 482552k in_d, 49416k in_c Swap: 4281272k av, 10032k used, 4271240k free 2602424k cached As you can see, we don't swap, we have free memory, we have all our data cached (our database size is 1.5 GB). Context switch are between 10,000 and 20,000 per seconds. That's actually a reasonably high CS rate. Again, why? This concept works for other tables as well. If you have tables that both want services at the same time, disk arm contention will drag performance into the floor when they are on the same HW set. Profile your HD access and put tables that want to be accessed at the same time on different HD sets. Even if you have to buy more HW to do it. I use iostat and I can only see a little write activity and no read activity on both raid arrays. Remember it's not just the overall amount, it's _when_and _where_ the write activity takes place. If you have almost no write activity, but whenever it happens it all happens to the same place by multiple things contending for the same HDs, your performance during that time will be poor. Since the behavior you are describing fits that cause very well, I'd see if you can verify that's what's going on. Ron ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] weird performances problem
At 10:26 AM 11/22/2005, Guillaume Smet wrote: Ron, First of all, thanks for your time. Happy to help. As has been noted many times around here, put the WAL on its own dedicated HD's. You don't want any head movement on those HD's. Yep, I know that. That's just we supposed it was not so important if it was nearly a readonly database which is wrong according to you. It's just good practice with pg that pg-xlog should always get it's own dedicated HD set. OTOH, I'm not at all convinced given the scant evidence so far that it is the primary problem here; particularly since if I understand you correctly, px-xlog is not on sdb or sdb1 where you are having the write storm. _Something_ is doing long bursts of write IO on sdb and sdb1 every 30 minutes or so according to your previous posts. It's not every 30 minutes. It's a 20-30 minutes slow down 3-4 times a day when we have a high load. Thanks for the correction and I apologize for the misunderstanding. Clearly the first step is to instrument sdb and sdb1 so that you understand exactly what is being accessed and written on them. Possibilities that come to mind: a) Are some of your sorts requiring more than 32MB during high load? If sorts do not usually require HD temp files and suddenly do, you could get behavior like yours. b) Are you doing too many 32MB sorts during high load? Same comment as above. c) Are you doing some sort of data updates or appends during high load that do not normally occur? d) Are you constantly doing "a little" write IO that turns into a write storm under high load because of queuing issues? Put the scaffolding in needed to trace _exactly_ what's happening on sdb and sdb1 throughout the day and then examine the traces over a day, a few days, and a week. I'll bet you will notice some patterns that will be helpful in identifying what's going on. Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Open request for benchmarking input
At 03:15 PM 11/26/2005, Luke Lonergan wrote: I suggest specifying a set of basic system / HW benchmarks to baseline the hardware before each benchmark is run. This has proven to be a major issue with most performance tests. My pick for I/O is bonnie++. Your equipment allows you the opportunity to benchmark all 5 machines running together as a cluster - this is important to measure maturity of solutions for high performance warehousing. Greenplum can provide you a license for Bizgres MPP for this purpose. ...and detailed config / tuning specs as well for it or everyone is probably wasting their time. For instance, it seems fairly clear that the default 8KB table size and default read ahead size are both pessimal, at least for non OLTP-like apps. In addition, there's been a reasonable amount of evidence that xfs should be the file system of choice for pg. Things like optimal RAID strip size, how to allocate tables to various IO HW, and what levels of RAID to use for each RAID set also have to be defined. The 16x SATA drives should be great, provided you have a high performance RAID adapter configured properly. You should be able to get 800MB/s of sequential scan performance by using a card like the 3Ware 9550SX. I've also heard that the Areca cards are good (how good?). Configuration of the I/O must be validated though - I've seen as low as 25MB/s from a misconfigured system. The Areca cards, particularly with 1-2GB of buffer cache, are the current commodity RAID controller performance leader. Better performance can be gotten out of HW from vendors like Xyratex, but it will cost much more. Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases
At 01:18 AM 11/27/2005, Luke Lonergan wrote: For data warehousing its pretty well open and shut. To use all cpus and io channels on each query you will need mpp. Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. More than a bit of hyperbole there Luke. Some common RW scenarios: Dual 1GbE NICs => 200MBps => 5TB in 5x10^12/2x10^8= 25000secs= ~6hrs57mins. Network stuff like re-transmits of dropped packets can increase this, so network SLA's are critical. Dual 10GbE NICs => ~1.6GBps (10GbE NICs can't yet do over ~800MBps apiece) => 5x10^12/1.6x10^9= 3125secs= ~52mins. SLA's are even moire critical here. If you are pushing 5TB around on a regular basis, you are not wasting your time & money on commodity <= 300MBps RAID HW. You'll be using 800MBps and 1600MBps high end stuff, which means you'll need ~1-2hrs to sequentially scan 5TB on physical media. Clever use of RAM can get a 5TB sequential scan down to ~17mins. Yes, it's a lot of data. But sequential scan times should be in the mins or low single digit hours, not days. Particularly if you use RAM to maximum advantage. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases
At 02:11 PM 11/27/2005, Luke Lonergan wrote: Ron, On 11/27/05 9:10 AM, "Ron" <[EMAIL PROTECTED]> wrote: > Clever use of RAM can get a 5TB sequential scan down to ~17mins. > > Yes, it's a lot of data. But sequential scan times should be in the > mins or low single digit hours, not days. Particularly if you use > RAM to maximum advantage. Unfortunately, RAM doesn't help with scanning from disk at all. I agree with you if you are scanning a table "cold", having never loaded it before, or if the system is not (or can't be) set up properly with appropriate buffers. However, outside of those 2 cases there are often tricks you can use with enough RAM (and no, you don't need RAM equal to the size of the item(s) being scanned) to substantially speed things up. Best case, you can get performance approximately equal to that of a RAM resident scan. WRT using network interfaces to help - it's interesting, but I think what you'd want to connect to is other machines with storage on them. Maybe. Or maybe you want to concentrate your storage in a farm that is connected by network or Fiber Channel to the rest of your HW. That's what a NAS or SAN is after all. "The rest of your HW" nowadays is often a cluster of RAM rich hosts. Assuming 64GB per host, 5TB can be split across ~79 hosts if you want to make it all RAM resident. Most don't have that kind of budget, but thankfully it is not usually necessary to make all of the data RAM resident in order to obtain if not all of the performance benefits you'd get if all of the data was. Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] pg_dump slow
At 08:35 AM 11/30/2005, Franklin Haut wrote: Hi i´m using PostgreSQL on windows 2000, the pg_dump take around 50 minutes to do backup of 200Mb data ( with no compression, and 15Mb with compression), Compression is reducing the data to 15/200= 3/40= 7.5% of original size? but in windows XP does not pass of 40 seconds... :( You mean that 40 secs in pg_dump under Win XP crashes, and therefore you have a WinXP problem? Or do you mean that pg_dump takes 40 secs to complete under WinXP and 50 minutes under W2K and therefore you have a W2K problem? In fact, either 15MB/40secs= 375KBps or 200MB/40secs= 5MBps is _slow_, so there's a problem under either platform! This happens with 8.1 and version 8.0, somebody passed for the same situation? It will be that a configuration in the priorities of the exists processes ? in Windows XP the processing of schemes goes 70% and constant accesses to the HardDisk, while that in windows 2000 it does not pass of 3%. Assuming Win XP completes the dump, the first thing to do is *don't use W2K* M$ has stopped supporting it in anything but absolutely minimum fashion anyway. _If_ you are going to use an M$ OS you should be using WinXP. (You want to pay licensing fees for your OS, but you are using free DB SW? Huh? If you are trying to save $$$, use Open Source SW like Linux or *BSD. pg will perform better under it, and it's cheaper!) Assuming that for some reason you can't/won't migrate to a non-M$ OS, the next problem is the slow HD IO you are getting under WinXP. What is the HW involved here? Particularly the HD subsystem and the IO bus(es) it is plugged into? For some perspective, Raw HD average IO rates for even reasonably modern 7200rpm HD's is in the ~50MBps per HD range. Top of the line 15Krpm SCSI and FC HD's have raw average IO rates of just under 80MBps per HD as of this post. Given that most DB's are not on 1 HD (if you DB _is_ on only 1 HD, change that ASAP before you lose data...), for anything other than a 2 HD RAID 1 set I'd expect raw HD average IO rates to be at least 100MBps. If you are getting >= 100MBps of average HD IO, you should be getting > 5MBps during pg_dump, and certainly > 375MBps! Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: RES: [PERFORM] pg_dump slow
At 12:27 PM 11/30/2005, Richard Huxton wrote: Franklin Haut wrote: Hi, Yes, my problem is that the pg_dump takes 40 secs to complete under WinXP and 50 minutes under W2K! The same database, the same hardware!, only diferrent Operational Systems. The hardware is:Pentium4 HT 3.2 GHz 1024 MB Memory Get the RAM up to at least 4096MB= 4GB for a DB server. 4 1GB DIMMs or 2 2GB DIMMS are ~ the same $$ as a HD (~$250-$300 US) and well worth the expense. HD 120GB SATA "b" is "bit". "B" is "Byte". I made the correction. You have =1= HD? and you are using it for everything: OS, pq, swap, etc? Very Bad Idea. At the very least, a DB server should have the OS on separate spindles from pg, and pg tables should be on something like a 4 HD RAID 10. At the very least. DB servers are about HDs. Lots and lots of HDs compared to anything outside the DB realm. Start thinking in terms of at least 6+ HD's attached to the system in question (I've worked on system with literally 100's). Usually only a few of these are directly attached to the DB server and most are attached by LAN or FC. But the point remains: DBs and DB servers eat HDs in prodigious quantities. There have been reports of very slow network performance on Win2k systems with the default configuration. You'll have to check the archives for details I'm afraid. This might apply to you. Unless you are doing IO across a network, this issue will not apply to you. By default W2K systems often had a default TCP/IP packet size of 576B and a tiny RWIN. Optimal for analog modems talking over noisy POTS lines, but horrible for everything else Packet size needs to be boosted to 1500B, the maximum. RWIN should be boosted to _at least_ the largest number <= 2^16 that you can use without TCP scaling. Benchmark network IO rates. Then TCP scaling should be turned on and RWIN doubled and network IO benched again. Repeat until there is no performance benefit to doubling RWIN or you run out of RAM that you can afford to toss at the problem or you hit the max for RWIN (very doubtful). If you're happy that doesn't affect you then I'd look at the disk system - perhaps XP has newer drivers than Win2k. I'll reiterate: Do _not_ run a production DB server on W2K. M$ has obsoleted the platform and that it is not supported _nor_ any of reliable, secure, etc. etc. A W2K based DB server, particularly one with a connection to the Internet, is a ticking time bomb at this point. Get off W2K as a production platform ASAP. Take to your CEO/Dean/whatever you call your Fearless Leader if you have to. Economically and probably performance wise, it's best to use an Open Source OS like Linux or *BSD. However, if you must use M$, at least use OS's that M$ is actively supporting. Despite M$ marketing propaganda and a post in this thread to the contrary, you =CAN= often run a production DB server under WinXP and not pay M$ their usurious licensing fees for W2003 Server or any of their other products with "server" in the title. How much RAM and how many CPUs you want in your DB server is the main issue. For a 1P, <= 4GB RAM vanilla box, WinXp will work just fine. What do the MS performance-charts show is happening? Specifically, CPU and disk I/O. His original post said ~3% CPU under W2K and ~70% CPU under WinXP Ron ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: RES: [PERFORM] pg_dump slow
ral or illegal for anyone to use the industry standard layered architecture of having a DB connection layer separate from a Queuing system. M$MQ is provided _specifically_ for that use. Certainly "twiddling the bits" inside a M$ OS violates the EULA, and I'm not advocating anything of the kind. OTOH, that Draconian EULA is yet _another_ reason to get rid of M$ OS's in one's organization. When I buy something, it is _mine_. You can tell me you won't support it if I modify it, but it's the height of Hubris to tell me that I'm not allowed to modify SW I paid for and own. Tell your managers/employers at M$ that Customer Service and Respecting Customers =keeps= customers. The reverse loses them. Period. > how many CPUs you want in your DB server is the main issue. For a > 1P, <= 4GB RAM vanilla box, WinXp will work just fine. Now, who is guilty of propaganda here? There is no propaganda here. The statement is accurate in terms of the information given. The biggest differentiations among M$ licenses is the CPU and RAM limit. Also, your comments regarding hard disks while correct in the general sense are not helpful. This is clearly not a disk bandwidth problem. As Evidenced By? His IO numbers are p*ss poor for any reasonable RAID setup, and 375KBps is bad even for a single HD. He's claiming this is local IO, not network, so that possibility is out. If you feel this is "clearly not a disk bandwidth problem", I fail to see your evidence or your alternative hypothesis. > >What do the MS performance-charts show is happening? Specifically, > >CPU and disk I/O. > His original post said ~3% CPU under W2K and ~70% CPU under WinXP Slow performance in extraction of bytea column strongly suggests tcp/ip. issue. I bet if you blanked out bytea column pg_dump will be fast. Franlin: are you making pg_dump from local or remote box and is this a clean install? Try fresh patched win2k install and see what happens. He claimed this was local, not network. It is certainly an intriguing possibility that W2K and WinXP handle bytea differently. I'm not competent to comment on that however. Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] 15,000 tables
Agreed. Also the odds of fs corruption or data loss are higher in a non journaling fs. Best practice seems to be to use a journaling fs but to put the fs log on dedicated spindles separate from the actual fs or pg_xlog. Ron At 01:40 PM 12/1/2005, Tino Wildenhain wrote: Am Donnerstag, den 01.12.2005, 10:07 -0800 schrieb Gavin M. Roy: > Hi Michael, > > I'm a fan of ReiserFS, and I can be wrong, but I believe using a > journaling filesystem for the PgSQL database could be slowing things > down. Have a 200G+ database, someone pulling the power plug or a regular reboot after a year or so. Wait for the fsck to finish. Now think again :-) ++Tino ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] 15,000 tables
Agreed, and I apologize for the imprecision of my post below. I should have written: "Best practice seems to be to use a journaling fs and log metadata only and put it on separate dedicated spindles." I've seen enough HD failures that I tend to be paranoid and log the metadata of fs dedicated to WAL as well, but that may very well be overkill. Ron At 01:57 PM 12/1/2005, Tom Lane wrote: Ron <[EMAIL PROTECTED]> writes: > Agreed. Also the odds of fs corruption or data loss are higher in a > non journaling fs. Best practice seems to be to use a journaling fs > but to put the fs log on dedicated spindles separate from the actual > fs or pg_xlog. I think we've determined that best practice is to journal metadata only (not file contents) on PG data filesystems. PG does expect the filesystem to remember where the files are, so you need metadata protection, but journalling file content updates is redundant with PG's own WAL logging. On a filesystem dedicated to WAL, you probably do not need any filesystem journalling at all --- we manage the WAL files in a way that avoids changing metadata for a WAL file that's in active use. A conservative approach would be to journal metadata here too, though. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] two disks - best way to use them?
At 01:58 PM 12/2/2005, Rick Schumeyer wrote: I installed another drive in my linux pc in an attempt to improve performance on a large COPY to a table with a geometry index. Based on previous discussion, it seems there are three things competing for the hard drive: 1) the input data file 2) the pg table 3) the WAL What is the best way to distribute these among two drives? From Tom's comments I would think that the pg table and the WAL should be separate. Does it matter where the input data is? Best is to have 3 HD or HD sets, one for each of the above. With only 2, and assuming the input file is too large to fit completely into RAM at once, I'd test to see whether: a= input on one + pg table & WAL on the other, or b= WAL on one + pg table & input file on the other is best. If the input file can be made 100% RAM resident, then use c= pg table on one + WAL and input file on the other. The big goal here is to minimize HD head seeks. Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] BLCKSZ
At 04:32 PM 12/5/2005, Olleg wrote: Tom Lane wrote: Olleg Samoylov <[EMAIL PROTECTED]> writes: I try to test this. Linux, memory page 4kb, disk page 4kb. I set BLCKSZ to 4kb. I get some performance improve, but not big, may be because I have 4Gb on test server (amd64). It's highly unlikely that reducing BLCKSZ is a good idea. There are bad side-effects on the maximum index entry size, maximum number of tuple fields, etc. Yes, when I set BLCKSZ=512, database dont' work. With BLCKSZ=1024 database very slow. (This was surprise me. I expect increase performance in 8 times with 1024 BLCKSZ. :) ) No wonder pg did not work or was very slow BLCKSZ= 512 or 1024 means 512 or 1024 *Bytes* respectively. That's 1/16 and 1/8 the default 8KB BLCKSZ. As I already see in this maillist, increase of BLCKSZ reduce performace too. Where? BLCKSZ as large as 64KB has been shown to improve performance. If running a RAID, BLCKSZ of ~1/2 the RAID stripe size seems to be a good value. May be exist optimum value? Theoretically BLCKSZ equal memory/disk page/block size may reduce defragmentation drawback of memory and disk. Of course there's an optimal value... ...and of course it is dependent on your HW, OS, and DB application. In general, and in a very fuzzy sense, "bigger is better". pg files are laid down in 1GB chunks, so there's probably one limitation. Given the HW you have mentioned, I'd try BLCKSZ= 65536 (you may have to recompile your kernel) and a RAID stripe of 128KB or 256KB as a first guess. In any case, when you didn't say *what* you tested, it's impossible to judge the usefulness of the change. regards, tom lane I test performace on database test server. This is copy of working billing system to test new features and experiments. Test task was one day traffic log. Average time of a one test was 260 minutes. How large is a record in your billing system? You want it to be an integer divisor of BLCKSZ (so for instance odd sizes in Bytes are BAD), Beyond that, you application domain matters. OLTP like systems need low latency access for frequent small transactions. Data mining like systems need to do IO in as big a chunk as the HW and OS will allow. Probably a good idea for BLCKSZ to be _at least_ max(8KB, 2x record size) Postgresql 7.4.8. Server dual Opteron 240, 4Gb RAM. _Especially_ with that HW, upgrade to at least 8.0.x ASAP. It's a good idea to not be running pg 7.x anymore anyway, but it's particularly so if you are running 64b SMP boxes. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] two disks - best way to use them?
At 12:52 AM 12/6/2005, Thomas Harold wrote: David Lang wrote: in that case you logicly have two disks, so see the post from Ron earlier in this thread. And it's a very nice performance gain. Percent spent waiting according to "top" is down around 10-20% instead of 80-90%. While I'm not prepared to benchmark, database performance is way up. The client machines that are writing the data are running closer to 100% CPU (before they were well below 50% CPU utilization). For accuracy's sake, which exact config did you finally use? How did you choose the config you finally used? Did you test the three options or just pick one? Ron ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Can this query go faster???
At 04:43 AM 12/6/2005, Joost Kraaijeveld wrote: Hi, Is it possible to get this query run faster than it does now, by adding indexes, changing the query? SELECT customers.objectid FROM prototype.customers, prototype.addresses WHERE customers.contactaddress = addresses.objectid ORDER BY zipCode asc, housenumber asc LIMIT 1 OFFSET 283745 Explain: Limit (cost=90956.71..90956.71 rows=1 width=55) -> Sort (cost=90247.34..91169.63 rows=368915 width=55) Sort Key: addresses.zipcode, addresses.housenumber -> Hash Join (cost=14598.44..56135.75 rows=368915 width=55) Hash Cond: ("outer".contactaddress = "inner".objectid) -> Seq Scan on customers (cost=0.00..31392.15 rows=368915 width=80) -> Hash (cost=13675.15..13675.15 rows=369315 width=55) -> Seq Scan on addresses (cost=0.00..13675.15 rows=369315 width=55) The customers table has an index on contactaddress and objectid. The addresses table has an index on zipcode+housenumber and objectid. TIA customer names, customers.objectid, addresses, and addresses.objectid should all be static (addresses do not change, just the customers associated with them; and once a customer has been assigned an id that better never change...). To me, this sounds like the addresses and customers tables should be duplicated and then physically laid out in sorted order by .objectid in one set and by the "human friendly" associated string in the other set. Then a finding a specific .objectid or it's associated string can be done in at worse O(lgn) time assuming binary search instead of O(n) time for a sequential scan. If pg is clever enough, it might be able to do better than that. IOW, I'd try duplicating the addresses and customers tables and using the appropriate CLUSTERed Index on each. I know this breaks Normal Form. OTOH, this kind of thing is common practice for data mining problems on static or almost static data. Hope this is helpful, Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
At 02:50 PM 12/24/2005, Frank Wiles wrote: On Wed, 21 Dec 2005 22:31:54 -0500 Juan Casero <[EMAIL PROTECTED]> wrote: > Sorry folks. I had a couple of glasses of wine as I wrote this. > Anyway I originally wanted the box to have more than two drives so I > could do RAID 5 but that is going to cost too much. Also, contrary > to my statement below it seems to me I should run the 32 bit > postgresql server on the 64 bit kernel. Would you agree this will > probably yield the best performance?I know it depends alot on the > system but for now this database is about 20 gigabytes. Not too large > right now but it may grow 5x in the next year. You definitely DO NOT want to do RAID 5 on a database server. That is probably the worst setup you could have, I've seen it have lower performance than just a single hard disk. RAID 1 and RAID 1+0 are optimal, but you want to stay far away from RAID 5. IMHO RAID 5 is only useful on near line backup servers or Samba file servers where space is more important than speed. That's a bit misleading. RAID 5 excels when you want read speed but don't care as much about write speed. Writes are typical ~2/3 the speed of reads on a typical decent RAID 5 set up. Side Note: Some years ago Mylex had a family of fast (for the time) RAID 5 HW controllers that actually read and wrote at the same speed. IBM bought them to kill them and protect LSI Logic. Mylex X24's (?IIRC the model number correctly?) are still reasonable HW. So if you have tables that are read often and written to rarely or not at all, putting them on RAID 5 is optimal. In both data mining like and OLTP like apps there are usually at least some such tables. RAID 1 is good for stuff where speed doesn't matter and all you are looking for is an insurance policy. RAID 10 is the best way to get high performance on both reads and writes, but it has a significantly greater cost for the same amount of usable physical media. If you've got the budget or are dealing with small enough physical storage needs, by all means use RAID 10. OTOH, if you are dealing with large enterprise class apps like Sarbanes Oxley compliance, medical and/or insurance, etc, etc, the storage needs can get so large that RAID 10 for everything or even most things is not possible. Even if economically feasible. RAID levels are like any other tool. Each is useful in the proper circumstances. Happy holidays, Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
At 04:42 PM 12/24/2005, Joshua D. Drake wrote: If you've got the budget or are dealing with small enough physical storage needs, by all means use RAID 10. OTOH, if you are dealing with large enterprise class apps like Sarbanes Oxley compliance, medical and/or insurance, etc, etc, the storage needs can get so large that RAID 10 for everything or even most things is not possible. Even if economically feasible. RAID levels are like any other tool. Each is useful in the proper circumstances. There is also RAID 50 which is quite nice. The "quite nice" part that Joshua is referring to is that RAID 50 gets most of the write performance of RAID 10 w/o using nearly as many HD's as RAID 10. OTOH, there still is a significant increase in the number of HD's used, and that means MBTF's become more frequent but you are not getting protection levels you would with RAID 10. IME RAID 50 gets mixed reviews. My two biggest issues are a= Admin of RAID 50 is more complex than the other commonly used versions (1, 10, 5, and 6) b= Once a HD failure takes place, you suffer a _permenent_ performance drop, even after the automatic volume rebuild, until you take the entire RAID 50 array off line, reinitialize it, and rebuild it from scratch. IME "a" and "b" make RAID 50 inappropriate for any but the biggest and most dedicated of DB admin groups. YMMV, Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
At 04:54 PM 12/24/2005, David Lang wrote: raid 5 is bad for random writes as you state, but how does it do for sequential writes (for example data mining where you do a large import at one time, but seldom do other updates). I'm assuming a controller with a reasonable amount of battery-backed cache. The issue with RAID 5 writes centers on the need to recalculate checksums for the ECC blocks distributed across the array and then write the new ones to physical media. Caches help, and the bigger the cache the better, but once you are doing enough writes fast enough (and that doesn't take much even with a few GBs of cache) the recalculate-checksums-and-write-new-ones overhead will decrease the write speed of real data. Bear in mind that the HD's _raw_ write speed hasn't been decreased. Those HD's are pounding away as fast as they can for you. Your _effective_ or _data level_ write speed is what decreases due to overhead. Side Note: people often forget the other big reason to use RAID 10 over RAID 5. RAID 5 is always only 2 HD failures from data loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss unless you get unlucky and lose both members of a RAID 1 set. This can be seen as an example of the classic space vs. time trade off in performance tuning. You can use 2x the HDs you need and implement RAID 10 for best performance and reliability or you can dedicate less HD's to RAID and implement RAID 5 for less (write) performance and lower reliability. TANSTAAFL. Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
At 08:35 AM 12/27/2005, Michael Stone wrote: On Mon, Dec 26, 2005 at 10:11:00AM -0800, David Lang wrote: what slows down raid 5 is that to modify a block you have to read blocks from all your drives to re-calculate the parity. this interleaving of reads and writes when all you are logicly doing is writes can really hurt. (this is why I asked the question that got us off on this tangent, when doing new writes to an array you don't have to read the blocks as they are blank, assuming your cacheing is enough so that you can write blocksize*n before the system starts actually writing the data) Correct; there's no reason for the controller to read anything back if your write will fill a complete stripe. That's why I said that there isn't a "RAID 5 penalty" assuming you've got a reasonably fast controller and you're doing large sequential writes (or have enough cache that random writes can be batched as large sequential writes). Sorry. A decade+ RWE in production with RAID 5 using controllers as bad as Adaptec and as good as Mylex, Chaparral, LSI Logic (including their Engino stuff), and Xyratex under 5 different OS's (Sun, Linux, M$, DEC, HP) on each of Oracle, SQL Server, DB2, mySQL, and pg shows that RAID 5 writes are slower than RAID 5 reads With the one notable exception of the Mylex controller that was so good IBM bought Mylex to put them out of business. Enough IO load, random or sequential, will cause the effect no matter how much cache you have or how fast the controller is. The even bigger problem that everyone is ignoring here is that large RAID 5's spend increasingly larger percentages of their time with 1 failed HD in them. The math of having that many HDs operating simultaneously 24x7 makes it inevitable. This means you are operating in degraded mode an increasingly larger percentage of the time under exactly the circumstance you least want to be. In addition, you are =one= HD failure from data loss on that array an increasingly larger percentage of the time under exactly the least circumstances you want to be. RAID 5 is not a silver bullet. On Mon, Dec 26, 2005 at 06:04:40PM -0500, Alex Turner wrote: Yes, but those blocks in RAID 10 are largely irrelevant as they are to independant disks. In RAID 5 you have to write parity to an 'active' drive that is part of the stripe. Once again, this doesn't make any sense. Can you explain which parts of a RAID 10 array are inactive? I agree totally that the read+parity-calc+write in the worst case is totaly bad, which is why I alway recommend people should _never ever_ use RAID 5. In this day and age of large capacity chassis, and large capacity SATA drives, RAID 5 is totally inapropriate IMHO for _any_ application least of all databases. I vote with Michael here. This is an extreme position to take that can't be followed under many circumstances ITRW. So I've got a 14 drive chassis full of 300G SATA disks and need at least 3.5TB of data storage. In your mind the only possible solution is to buy another 14 drive chassis? Must be nice to never have a budget. I think you mean an infinite budget. That's even assuming it's possible to get the HD's you need. I've had arrays that used all the space I could give them in 160 HD cabinets. Two 160 HD cabinets was neither within the budget nor going to perform well. I =had= to use RAID 5. RAID 10 was just not usage efficient enough. Must be a hard sell if you've bought decent enough hardware that your benchmarks can't demonstrate a difference between a RAID 5 and a RAID 10 configuration on that chassis except in degraded mode (and the customer doesn't want to pay double for degraded mode performance) I have =never= had this situation. RAID 10 latency is better than RAID 5 latency. RAID 10 write speed under heavy enough load, of any type, is faster than RAID 5 write speed under the same circumstances. RAID 10 robustness is better as well. Problem is that sometimes budget limits or number of HDs needed limits mean you can't use RAID 10. In reality I have yet to benchmark a system where RAID 5 on the same number of drives with 8 drives or less in a single array beat a RAID 10 with the same number of drives. Well, those are frankly little arrays, probably on lousy controllers... Nah. Regardless of controller I can take any RAID 5 and any RAID 10 built on the same HW under the same OS running the same DBMS and =guarantee= there is an IO load above which it can be shown that the RAID 10 will do writes faster than the RAID 5. The only exception in my career thus far has been the aforementioned Mylex controller. OTOH, sometimes you have no choice but to "take the hit" and use RAID 5. cheers, Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
At 02:05 PM 12/27/2005, Michael Stone wrote: On Tue, Dec 27, 2005 at 11:50:16AM -0500, Ron wrote: Sorry. A decade+ RWE in production with RAID 5 using controllers as bad as Adaptec and as good as Mylex, Chaparral, LSI Logic (including their Engino stuff), and Xyratex under 5 different OS's (Sun, Linux, M$, DEC, HP) on each of Oracle, SQL Server, DB2, mySQL, and pg shows that RAID 5 writes are slower than RAID 5 reads What does that have to do with anything? That wasn't the question... Your quoted position is "there isn't a 'RAID 5 penalty' assuming you've got a reasonably fast controller and you're doing large sequential writes (or have enough cache that random writes can be batched as large sequential writes)." My experience across a wide range of HW, OSs, DBMS, and applications says you are wrong. Given enough IO, RAID 5 takes a bigger performance hit for writes than RAID 10 does. Enough IO, sequential or otherwise, will result in a situation where a RAID 10 array using the same number of HDs (and therefore of ~1/2 the usable capacity) will have better write performance than the equivalent RAID 5 built using the same number of HDs. There is a 'RAID 5 write penalty'. Said RAID 10 array will also be more robust than a RAID 5 built using the same number of HDs. OTOH, that does not make RAID 5 "bad". Nor are statements like "Never use RAID 5!" realistic or reasonable. Also, performance is not the only or even most important reason for choosing RAID 10 or RAID 50 over RAID 5. Robustness considerations can be more important than performance ones. cheers, Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
At 04:15 PM 12/27/2005, Michael Stone wrote: I don't understand why you keep using the pejorative term "performance hit". Try describing the "performance characteristics" instead. pe·jor·a·tive( P ) Pronunciation Key (p-jôr-tv, -jr-, pj-rtv, pj-) adj. Tending to make or become worse. Disparaging; belittling. RAID 5 write performance is significantly enough less than RAID 5 read performance as to be a matter of professional note and concern. That's not "disparaging or belittling" nor is it "tending to make or become worse". It's measurable fact that has an adverse impact on capacity planning, budgeting, HW deployment, etc. If you consider calling a provable decrease in performance while doing a certain task that has such effects "a hit" or "bad" pejorative, you are using a definition for the word that is different than the standard one. Also, claims about performance claims based on experience are fairly useless. Either you have data to provide (in which case claiming vast experience is unnecessary) or you don't. My experience _is_ the data provided. Isn't it convenient for you that I don't have the records for every job I've done in 20 years, nor do I necessarily have the right to release some specifics for some of what I do have. I've said what I can as a service to the community. Including to you. Your reaction implies that I and others with perhaps equally or more valuable experience to share shouldn't bother. "One of the major differences between Man and Beast is that Man learns from others experience." It's also impressive that you evidently seem to be implying that you do such records for your own job experience _and_ that you have the legal right to publish them. In which case, please feel free to impress me further by doing so. Said RAID 10 array will also be more robust than a RAID 5 built using the same number of HDs. And a RAID 6 will be more robust than either. Basing reliability on "hopefully you wont have both disks in a mirror fail" is just silly. Either you need double disk failure protection or you don't. That statement is incorrect and ignores both probability and real world statistical failure patterns. The odds of a RAID 10 array of n HDs suffering a failure that loses data are less than the odds of it happening in a RAID 6 array of n HDs. You are correct that RAID 6 is more robust than RAID 5. cheers, Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] How import big amounts of data?
At 04:48 AM 12/29/2005, Arnau wrote: Hi all, Which is the best way to import data to tables? I have to import 9 rows into a column and doing it as inserts takes ages. Would be faster with copy? is there any other alternative to insert/copy? Compared to some imports, 90K rows is not that large. Assuming you want the table(s) to be in some sorted order when you are done, the fastest way to import a large enough amount of data is: -put the new data into a temp table (works best if temp table fits into RAM) -merge the rows from the original table and the temp table into a new table -create the indexes you want on the new table -DROP the old table and its indexes -rename the new table and its indexes to replace the old ones. If you _don't_ care about having the table in some sorted order, -put the new data into a new table -COPY the old data to the new table -create the indexes you want on the new table -DROP the old table and its indexes -rename the new table and its indexes to replace the old ones Either of these procedures will also minimize your downtime while you are importing. If one doesn't want to go to all of the trouble of either of the above, at least DROP your indexes, do your INSERTs in batches, and rebuild your indexes. Doing 90K individual INSERTs should usually be avoided. cheers, Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Invulnerable VACUUM process thrashing everything
Ick. Can you get users and foreign connections off that machine, lock them out for some period, and renice the VACUUM? Shedding load and keeping it off while VACUUM runs high priority might allow it to finish in a reasonable amount of time. Or Shedding load and dropping the VACUUM priority might allow a kill signal to get through. Hope this helps, Ron At 05:09 PM 12/29/2005, Jeffrey W. Baker wrote: A few WEEKS ago, the autovacuum on my instance of pg 7.4 unilaterally decided to VACUUM a table which has not been updated in over a year and is more than one terabyte on the disk. Because of the very high transaction load on this database, this VACUUM has been ruining performance for almost a month. Unfortunately is seems invulnerable to killing by signals: # ps ax | grep VACUUM 15308 ?D588:00 postgres: postgres skunk [local] VACUUM # kill -HUP 15308 # ps ax | grep VACUUM 15308 ?D588:00 postgres: postgres skunk [local] VACUUM # kill -INT 15308 # ps ax | grep VACUUM 15308 ?D588:00 postgres: postgres skunk [local] VACUUM # kill -PIPE 15308 # ps ax | grep VACUUM 15308 ?D588:00 postgres: postgres skunk [local] VACUUM o/~ But the cat came back, the very next day ... I assume that if I kill this with SIGKILL, that will bring down every other postgres process, so that should be avoided. But surely there is a way to interrupt this. If I had some reason to shut down the instance, I'd be screwed, it seems. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] improving write performance for logging
2B is a lot of inserts. If you had to guess, what do you think is the maximum number of inserts you could do in a day? How large is each record being inserted? How much can you put in a COPY and how many COPYs can you put into a transactions? What values are you using for bgwriter* and checkpoint*? What HW on you running on and what kind of performance do you typically get? Inquiring minds definitely want to know ;-) Ron At 08:54 AM 1/4/2006, Ian Westmacott wrote: We have a similar application thats doing upwards of 2B inserts per day. We have spent a lot of time optimizing this, and found the following to be most beneficial: 1) use COPY (BINARY if possible) 2) don't use triggers or foreign keys 3) put WAL and tables on different spindles (channels if possible) 4) put as much as you can in each COPY, and put as many COPYs as you can in a single transaction. 5) watch out for XID wraparound 6) tune checkpoint* and bgwriter* parameters for your I/O system On Tue, 2006-01-03 at 16:44 -0700, Steve Eckmann wrote: > I have questions about how to improve the write performance of PostgreSQL for logging data from a real-time simulation. We found that MySQL 4.1.3 could log about 1480 objects/second using MyISAM tables or about 1225 objects/second using InnoDB tables, but PostgreSQL 8.0.3 could log only about 540 objects/second. (test system: quad-Itanium2, 8GB memory, SCSI RAID, GigE connection from simulation server, nothing running except system processes and database system under test) > > We also found that we could improve MySQL performance significantly using MySQL's "INSERT" command extension allowing multiple value-list tuples in a single command; the rate for MyISAM tables improved to about 2600 objects/second. PostgreSQL doesn't support that language extension. Using the COPY command instead of INSERT might help, but since rows are being generated on the fly, I don't see how to use COPY without running a separate process that reads rows from the application and uses COPY to write to the database. The application currently has two processes: the simulation and a data collector that reads events from the sim (queued in shared memory) and writes them as rows to the database, buffering as needed to avoid lost data during periods of high activity. To use COPY I think we would have to split our data collector into two processes communicating via a pipe. > > Query performance is not an issue: we found that when suitable indexes are added PostgreSQL is fast enough on the kinds of queries our users make. The crux is writing rows to the database fast enough to keep up with the simulation. > > Are there general guidelines for tuning the PostgreSQL server for this kind of application? The suggestions I've found include disabling fsync (done), increasing the value of wal_buffers, and moving the WAL to a different disk, but these aren't likely to produce the 3x improvement that we need. On the client side I've found only two suggestions: disable autocommit and use COPY instead of INSERT. I think I've effectively disabled autocommit by batching up to several hundred INSERT commands in each PQexec() call, and it isn’t clear that COPY is worth the effort in our application. > > Thanks. > > > ---(end of broadcast)--- > TIP 2: Don't 'kill -9' the postmaster -- Ian Westmacott <[EMAIL PROTECTED]> Intellivid Corp. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] help tuning queries on large database
I'll second all of Luke Lonergan's comments and add these. You should be able to increase both "cold" and "warm" performance (as well as data integrity. read below.) considerably. Ron At 05:59 PM 1/6/2006, peter royal wrote: Howdy. I'm running into scaling problems when testing with a 16gb (data +indexes) database. I can run a query, and it returns in a few seconds. If I run it again, it returns in a few milliseconds. I realize this is because during subsequent runs, the necessary disk pages have been cached by the OS. I have experimented with having all 8 disks in a single RAID0 set, a single RAID10 set, and currently 4 RAID0 sets of 2 disks each. There hasn't been an appreciable difference in the overall performance of my test suite (which randomly generates queries like the samples below as well as a few other types. this problem manifests itself on other queries in the test suite as well). So, my question is, is there anything I can do to boost performance with what I've got, or am I in a position where the only 'fix' is more faster disks? I can't think of any schema/index changes that would help, since everything looks pretty optimal from the 'explain analyze' output. I'd like to get a 10x improvement when querying from the 'cold' state. Thanks for any assistance. The advice from reading this list to getting to where I am now has been invaluable. -peter Configuration: PostgreSQL 8.1.1 shared_buffers = 1 # (It was higher, 50k, but didn't help any, so brought down to free ram for disk cache) work_mem = 8196 random_page_cost = 3 effective_cache_size = 25 Hardware: CentOS 4.2 (Linux 2.6.9-22.0.1.ELsmp) Upgrade your kernel to at least 2.6.12 There's a known issue with earlier versions of the 2.6.x kernel and 64b CPUs like the Opteron. See kernel.org for details. Areca ARC-1220 8-port PCI-E controller Make sure you have 1GB or 2GB of cache. Get the battery backup and set the cache for write back rather than write through. 8 x Hitachi Deskstar 7K80 (SATA2) (7200rpm) 2 x Opteron 242 @ 1.6ghz 3gb RAM (should be 4gb, but separate Linux issue preventing us from getting it to see all of it) Tyan Thunder K8WE The K8WE has 8 DIMM slots. That should be good for 16 or 32 GB of RAM (Depending on whether the mainboard recognizes 4GB DIMMs or not. Ask Tyan about the latest K8WE firmare.). If nothing else, 1GB DIMMs are now so cheap that you should have no problems having 8GB on the K8WE. A 2.6.12 or later based Linux distro should have NO problems using more than 4GB or RAM. Among the other tricks having lots of RAM allows: If some of your tables are Read Only or VERY rarely written to, you can preload them at boot time and make them RAM resident using the /etc/tmpfs trick. In addition there is at least one company making a cheap battery backed PCI-X card that can hold up to 4GB of RAM and pretend to be a small HD to the OS. I don't remember any names at the moment, but there have been posts here and at storage.review.com on such products. RAID Layout: 4 2-disk RAID0 sets created You do know that a RAID 0 set provides _worse_ data protection than a single HD? Don't use RAID 0 for any data you want kept reliably. With 8 HDs, the best config is probably 1 2HD RAID 1 + 1 6HD RAID 10 or 2 4HD RAID 10's It is certainly true that once you have done everything you can with RAM, the next set of HW optimizations is to add HDs. The more the better up to a the limits of your available PCI-X bandwidth. In short, a 2nd RAID fully populated controller is not unreasonable. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] File Systems Compared
At 10:40 AM 12/6/2006, Brian Wipf wrote: All tests are with bonnie++ 1.03a Main components of system: 16 WD Raptor 150GB 1 RPM drives all in a RAID 10 ARECA 1280 PCI-Express RAID adapter with 1GB BB Cache (Thanks for the recommendation, Ron!) 32 GB RAM Dual Intel 5160 Xeon Woodcrest 3.0 GHz processors OS: SUSE Linux 10.1 xfs (with write cache disabled on disks): /usr/local/sbin/bonnie++ -d bonnie/ -s 64368:8k Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk464368M 90621 99 283916 35 105871 11 88569 97 433890 23 644.5 0 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 28435 95 + +++ 28895 82 28523 91 + ++ + 24369 86 hulk4,64368M, 90621,99,283916,35,105871,11,88569,97,433890,23,644.5,0,16,28435,95,++ +++,+++,28895,82,28523,91,+,+++,24369,86 xfs (with write cache enabled on disks): /usr/local/sbin/bonnie++ -d bonnie -s 64368:8k Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- -- Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP /sec %CP hulk464368M 90861 99 348401 43 131887 14 89412 97 432964 23 658.7 0 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- -- Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec % CP /sec %CP 16 28871 90 + +++ 28923 91 30879 93 + ++ + 28012 94 hulk4,64368M, 90861,99,348401,43,131887,14,89412,97,432964,23,658.7,0,16,28871,90,++ +++,+++,28923,91,30879,93,+,+++,28012,94 Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K rpm HDs, you should be seeing higher absolute performance numbers. Find out what HW the Areca guys and Tweakers guys used to test the 1280s. At LW2006, Areca was demonstrating all-in-cache reads and writes of ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained Rates of ~900MBps read, and ~850MBps write. Luke, I know you've managed to get higher IO rates than this with this class of HW. Is there a OS or SW config issue Brian should closely investigate? Ron Peacetree ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Areca 1260 Performance (was: File Systems
The 1100 series is PCI-X based. The 1200 series is PCI-E x8 based. Apples and oranges. I still think Luke Lonergan or Josh Berkus may have some interesting ideas regarding possible OS and SW optimizations. WD1500ADFDs are each good for ~90MBps read and ~60MBps write ASTR. That means your 16 HD RAID 10 should be sequentially transferring ~720MBps read and ~480MBps write. Clearly more HDs will be required to allow a ARC-12xx to attain its peak performance. One thing that occurs to me with your present HW is that your CPU utilization numbers are relatively high. Since 5160s are clocked about as high as is available, that leaves trying CPUs with more cores and trying more CPUs. You've got basically got 4 HW threads at the moment. If you can, evaluate CPUs and mainboards that allow for 8 or 16 HW threads. Intel-wise, that's the new Kentfields. AMD-wise, you have lot's of 4S mainboard options, but the AMD 4C CPUs won't be available until sometime late in 2007. I've got other ideas, but this list is not the appropriate venue for the level of detail required. Ron Peacetree At 05:30 PM 12/6/2006, Brian Wipf wrote: On 6-Dec-06, at 2:47 PM, Brian Wipf wrote: Hmmm. Something is not right. With a 16 HD RAID 10 based on 10K rpm HDs, you should be seeing higher absolute performance numbers. Find out what HW the Areca guys and Tweakers guys used to test the 1280s. At LW2006, Areca was demonstrating all-in-cache reads and writes of ~1600MBps and ~1300MBps respectively along with RAID 0 Sustained Rates of ~900MBps read, and ~850MBps write. Luke, I know you've managed to get higher IO rates than this with this class of HW. Is there a OS or SW config issue Brian should closely investigate? I wrote 1280 by a mistake. It's actually a 1260. Sorry about that. The IOP341 class of cards weren't available when we ordered the parts for the box, so we had to go with the 1260. The box(es) we build next month will either have the 1261ML or 1280 depending on whether we go 16 or 24 disk. I noticed Bucky got almost 800 random seeks per second on her 6 disk 1 RPM SAS drive Dell PowerEdge 2950. The random seek performance of this box disappointed me the most. Even running 2 concurrent bonnies, the random seek performance only increased from 644 seeks/sec to 813 seeks/sec. Maybe there is some setting I'm missing? This card looked pretty impressive on tweakers.net. Areca has some performance numbers in a downloadable PDF for the Areca ARC-1120, which is in the same class as the ARC-1260, except with 8 ports. With all 8 drives in a RAID 0 the card gets the following performance numbers: Card single thread write20 thread write single thread read20 thread read ARC-1120 321.26 MB/s404.76 MB/s 412.55 MB/ s 672.45 MB/s My numbers for sequential i/o for the ARC-1260 in a 16 disk RAID 10 are slightly better than the ARC-1120 in an 8 disk RAID 0 for a single thread. I guess this means my numbers are reasonable. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Areca 1260 Performance
At 06:40 PM 12/6/2006, Brian Wipf wrote: I appreciate your suggestions, Ron. And that helps answer my question on processor selection for our next box; I wasn't sure if the lower MHz speed of the Kentsfield compared to the Woodcrest but with double the cores would be better for us overall or not. Please do not misunderstand me. I am not endorsing the use of Kentsfield. I am recommending =evaluating= Kentsfield. I am also recommending the evaluation of 2C 4S AMD solutions. All this stuff is so leading edge that it is far from clear what the RW performance of DBMS based on these components will be without extensive testing of =your= app under =your= workload. One thing that is clear from what you've posted thus far is that you are going to needmore HDs if you want to have any chance of fully utilizing your Areca HW. Out of curiosity, where are you geographically? Hoping I'm being helpful, Ron ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Areca 1260 Performance
At 03:37 AM 12/7/2006, Brian Wipf wrote: On 6-Dec-06, at 5:26 PM, Ron wrote: All this stuff is so leading edge that it is far from clear what the RW performance of DBMS based on these components will be without extensive testing of =your= app under =your= workload. I want the best performance for the dollar, so I can't rule anything out. Right now I'm leaning towards Kentsfield, but I will do some more research before I make a decision. We probably won't wait much past January though. Kentsfield's outrageously high pricing and operating costs (power and cooling) are not likely to make it the cost/performance winner. OTOH, 1= ATM it is the way to throw the most cache per socket at a DBMS within the Core2 CPU line (Tulsa has even more at 16MB per CPU). 2= SSSE3 and other Core2 optimizations have led to some impressive performance numbers- unless raw clock rate is the thing that can help you the most. If what you need for highest performance is the absolute highest clock rate or most cache per core, then bench some Intel Tulsa's. Apps with memory footprints too large for on die or in socket caches or that require extreme memory subsystem performance are still best served by AMD CPUs. If you are getting the impression that it is presently complicated deciding which CPU is best for any specific pg app, then I am making the impression I intend to. One thing that is clear from what you've posted thus far is that you are going to needmore HDs if you want to have any chance of fully utilizing your Areca HW. Do you know off hand where I might find a chassis that can fit 24[+] drives? The last chassis we ordered was through Supermicro, and the largest they carry fits 16 drives. www.pogolinux.com has 24 and 48 bay 3.5" HD chassis'; and a 64 bay 2.5" chassis. Tell them I sent you. www.impediment.com are folks I trust regarding all things storage (and RAM). Again, tell them I sent you. www.aberdeeninc.com is also a vendor I've had luck with, but try Pogo and Impediment first. Good luck and please post what happens, Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Areca 1260 Performance
At 11:02 AM 12/7/2006, Gene wrote: I'm building a SuperServer 6035B server (16 scsi drives). My schema has basically two large tables (million+ per day) each which are partitioned daily, and queried independently of each other. Would you recommend a raid1 system partition and 14 drives in a raid 10 or should i create separate partitions/tablespaces for the two large tables and indexes? Not an easy question to answer w/o knowing more about your actual queries and workload. To keep the math simple, let's assume each SCSI HD has and ASTR of 75MBps. A 14 HD RAID 10 therefore has an ASTR of 7* 75= 525MBps. If the rest of your system can handle this much or more bandwidth, then this is most probably the best config. Dedicating spindles to specific tables is usually best done when there is HD bandwidth that can't be utilized if the HDs are in a larger set +and+ there is a significant hot spot that can use dedicated resources. My first attempt would be to use other internal HDs for a RAID 1 systems volume and use all 16 of your HBA HDs for a 16 HD RAID 10 array. Then I'd bench the config to see if it had acceptable performance. If yes, stop. Else start considering the more complicated alternatives. Remember that adding HDs and RAM is far cheaper than even a few hours of skilled technical labor. Ron Peacetree ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] New to PostgreSQL, performance considerations
This definitely is the correct approach. Actually, Daniel van Ham Colchete may not be as "all wet" as some around here think. We've had previous data that shows that pg can become CPU bound (see previous posts by Josh Berkus and others regarding CPU overhead in what should be IO bound tasks). In addition, we know that x86 compatible 64b implementations differ enough between AMD and Intel products that it sometimes shows on benches. Evidence outside the DBMS arena supports the hypothesis that recent CPUs are needing more hand-holding and product specific compiling, not less, compared to their previous versions. Side Note: I wonder what if anything pg could gain from using SWAR instructions (SSE*, MMX, etc)? I'd say the fairest attitude is to do everything we can to support having the proper experiments done w/o presuming the results. Ron Peacetree At 10:47 AM 12/11/2006, Chris Browne wrote: In order to prove what you want to prove, you need to run the benchmarks all on Gentoo, where you run with 4 categorizations: 1. Where you run PostgreSQL and GLIBC without any processor-specific optimizations 2. Where you run PostgreSQL and GLIBC with all relevant processor-specific optimizations 3. Where you run PostgreSQL with, and GLIBC without processor-specific optimizations 4. Where you run PostgreSQL without, and GLIBC with processor-specific optimizations That would allow one to clearly distinguish which optimizations are particularly relevant. > I'm not saying that Gentoo is faster than FC6. I just want to prove > that if you compile your software to make better use of your > processor, it will run faster. > > It might take a few days because I'm pretty busy right now at my > job. I expect that you'll discover, if you actually do these tests, that this belief is fairly much nonsense. - Modern CPUs do a huge amount of on-CPU self-tuning. - CPU features that could have a material effect tend to be unusable when compiling general purpose libraries and applications. GCC doesn't generate MMX-like instructions. - Database application performance tends to be I/O driven. - When database application performance *isn't* I/O driven, it is likely to be driven by cache management, which compiler options won't affect. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] New to PostgreSQL, performance considerations
Statements like these can not be reasonably interpreted in any manner _except_ that of presuming the results: "I expect that you'll discover, if you actually do these tests, that this belief (that using arch specific compiler options lead to better performing SW) is fairly much nonsense." "...IMO a waste of time..." etc The correct objective response to claims w/o evidence is to request evidence, and to do everything we can to support it being properly gathered. Not to try to discourage the claimant from even trying by ganging up on them with multiple instances of Argument From Authority or variations of Ad Hominem attacks. (The validity of the claim has nothing to do with the skills or experience of the claimant or anyone else in the discussion. Only on the evidence.) It is a tad unfair and prejudicial to call claims that CPU optimizations matter to the performance of DB product "extraordinary". Evidence outside the DBMS field exists; and previous posts here show that pg can indeed become CPU-bound during what should be IO bound tasks. At the moment, Daniel's claims are not well supported. That is far different from being "extraordinary" given the current circumstantial evidence. Let's also bear in mind that as a community project, we can use all the help we can get. Driving potential resources away is in opposition to that goal. [1] The evidence that arch specific flags matter to performance can be found as easily as recompiling your kernel or your compiler. While it certainly could be argued how "general purpose" such SW is, the same could be said for just about any SW at some level of abstraction. Ron Peacetree At 12:31 PM 12/11/2006, Michael Stone wrote: On Mon, Dec 11, 2006 at 12:15:51PM -0500, Ron wrote: I'd say the fairest attitude is to do everything we can to support having the proper experiments done w/o presuming the results. Who's presuming results?[1] It is fair to say that making extraordinary claims without any evidence should be discouraged. It's also fair to say that if there are specific things that need cpu-specific tuning they'll be fairly limited critical areas (e.g., locks) which would probably be better implemented with a hand-tuned code and runtime cpu detection than by magical mystical compiler invocations. Mike Stone [1] I will say that I have never seen a realistic benchmark of general code where the compiler flags made a statistically significant difference in the runtime. There are some particularly cpu-intensive codes, like some science simulations or encoding routines where they matter, but that's not the norm--and many of those algorithms already have hand-tuned versions which will outperform autogenerated code. You'd think that with all the talk that the users of certain OS's generate about CFLAG settings, there'd be some well-published numbers backing up the hype. At any rate if there were numbers to back the claim then I think they could certainly be considered without prejudice. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] New to PostgreSQL, performance considerations
At 01:47 PM 12/11/2006, Michael Stone wrote: On Mon, Dec 11, 2006 at 01:20:50PM -0500, Ron wrote: (The validity of the claim has nothing to do with the skills or experience of the claimant or anyone else in the discussion. Only on the evidence.) Please go back and reread the original post. I don't think the response was unwarranted. So he's evidently young and perhaps a trifle over-enthusiast. We were once too. ;-) We are not going to get valuable contributions nor help people become more valuable to the community by "flaming them into submission". ...and who knows, =properly= done experiment may provide both surprises and unexpected insights/benefits. I agree completely with telling him he needs to get better evidence and even with helping him understand how he should go about getting it. It should be noted that his opposition has not yet done these experiments either. (Else they could just simply point to the results that refute Daniel's hypothesis.) The reality is that a new CPU architecture and multiple new memory technologies are part of this discussion. I certainly do not expect them to change the fundamental thinking regarding how to get best performance for a DBMS. OTOH, there are multiple valid reasons to give such new stuff a thorough and rigorous experimental shake-down. ATM, =both= sides of this debate are lacking evidence for their POV. Let's support getting definitive evidence. No matter who brings it to the table ;-) Ron Peacetree ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] New to PostgreSQL, performance considerations
At 02:28 PM 12/11/2006, Merlin Moncure wrote: also, some people posting here, not necessarily me, are authority figures. :-) merlin Noam Chomsky was one of the most influential thinkers in Linguistics to yet have lived. He was proven wrong a number of times. Even within Linguistics. There are plenty of other historical examples. As others have said, opinion without evidence and logic is just that- opinion. And even Expert Opinion has been known to be wrong. Sometimes very much so. Part of what makes an expert an expert is that they can back up their statements with evidence and logic that are compelling even to the non expert when asked to do so. All I'm saying is let's all remember how "assume" is spelled and support the getting of some hard data. Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] New to PostgreSQL, performance considerations
1= In all these results I'm seeing, no one has yet reported what their physical IO subsystem is... ...when we are benching a DB. 2= So far we've got ~ a factor of 4 performance difference between Michael Stone's 1S 1C Netburst era 2.5GHz P4 PC and Guido Neitzer's 1S 2C MacBook Pro 2.33GHz C2D. If the physical IO subsystems are even close to equivalent across the systems benched so far, we've clearly established that pg performance is more sensitive to factors outside the physical IO subsystem than might usually be thought with regard to a DBMS. (At least for this benchmark SW.) 3= Daniel van Ham Colchete is running Gentoo. That means every SW component on his box has been compiled to be optimized for the HW it is running on. There may be a combination of effects going on for him that others not running a system optimized from the ground up for its HW do not see. 4= If we are testing arch specific compiler options and only arch specific compiler options, we should remove the OS as a variable. Since Daniel has presented evidence in support of his hypothesis, the first step should be to duplicate his environment as =exactly= as possible and see if someone can independently reproduce the results when the only significant difference is the human involved. This will guard against procedural error in the experiment. Possible Outcomes A= Daniel made a procedural error. We all learn what is and to avoid it. B= The Gentoo results are confirmed but no other OS shows this effect. Much digging ensues ;-) C= Daniel's results are confirmed as platform independent once we take all factor into account properly We all learn more re: how to best set up pg for highest performance. Ron Peacetree At 01:35 AM 12/12/2006, Greg Smith wrote: On Mon, 11 Dec 2006, Michael Stone wrote: Can anyone else reproduce these results? I'm on similar hardware (2.5GHz P4, 1.5G RAM)... There are two likely candidates for why Daniel's P4 3.0GHz significantly outperforms your 2.5GHz system. 1) Most 2.5GHZ P4 processors use a 533MHz front-side bus (FSB); most 3.0GHZ ones use an 800MHz bus. 2) A typical motherboard paired with a 2.5GHz era processor will have a single-channel memory interface; a typical 3.0GHZ era board supports dual-channel DDR. These changes could easily explain the magnitude of difference in results you're seeing, expecially when combined with a 20% greater raw CPU clock. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] New to PostgreSQL, performance considerations
At 10:47 AM 12/12/2006, Tom Lane wrote: It's notoriously hard to get repeatable numbers out of pgbench :-( That's not a good characteristic in bench marking SW... Does the ODSL stuff have an easier time getting reproducible results? A couple of tips: * don't put any faith in short runs. I usually use -t 1000 plus -c whatever. * make sure you loaded the database (pgbench -i) with a scale factor (-s) at least equal to the maximum -c you want to test. Otherwise you're mostly measuring update contention. * pay attention to when checkpoints occur. You probably need to increase checkpoint_segments if you want pgbench not to be checkpoint-bound. This all looks very useful. Can you give some guidance as to what checkpoint_segments should be increased to? Do the values you are running pgbench with suggest what value checkpoint_segments should be? Ron Peacetree ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] New to PostgreSQL, performance considerations
At 01:35 PM 12/12/2006, Daniel van Ham Colchete wrote: I just made another test with a second Gentoo machine: The results showed no significant change. The conclusion of today's test would be that there are no improvement at PostgreSQL when using -march=prescott. I only see 3 diferences between yesterday's server and today's: the kernel version (y: 2.6.18, t:2.6.17), the server uses an IDE harddrive (yesterday was SATA), and the gcc version (3.4.6 -> 4.1.1). I don't know why yesterday we had improved and today we had not. SATA HD's, particularly SATA II HD's and _especially_ 10Krpm 150GB SATA II Raptors are going to have far better performance than older IDE HDs. Do some raw bonnie++ benches on the two systems. If the numbers from bonnie++ are close to those obtained during the pgbench runs, then the HDs are limiting pgbench. Best would be to use the exact same HD IO subsystem on both boxes, but that may not be feasible. In general, it would be helpful if the entire config, HW + OS + pg stuff, was documented when submitting benchmark results. (For instance, it would not be outside the realm of plausibility for Guidos C2D laptop to be HD IO limited and for Micheal's 2.5 GHZ P4 PC to be CPU limited during pgbench runs.) Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] New to PostgreSQL, performance considerations
At 11:11 AM 12/13/2006, Cosimo Streppone wrote: Interesting, eh? Cosimo What I find interesting is that so far Guido's C2D Mac laptop has gotten the highest values by far in this set of experiments, and no one else is even close. The slowest results, Michael's, are on the system with what appears to be the slowest CPU of the bunch; and the ranking of the rest of the results seem to similarly depend on relative CPU performance. This is not what one would naively expect when benching a IO intensive app like a DBMS. Given that the typical laptop usually has 1 HD, and a relatively modest one at that (the fastest available are SATA 7200rpm or Seagate's perpendicular recording 5400rpm) in terms of performance, this feels very much like other factors are bottlenecking the experiments to the point where Daniel's results regarding compiler options are not actually being tested. Anyone got a 2.33 GHz C2D box with a decent HD IO subsystem more representative of a typical DB server hooked up to it? Again, the best way to confirm/deny Daniel's results is to duplicate the environment he obtained those results with as closely as possible (preferably exactly) and then have someone else try to duplicate his results. Also, I think the warnings regarding proper configuration of pgbench and which version of pgbench to use are worthy of note. Do we have guidance yet as to what checkpoint_segments should be set to? Should we be considering using something other than pgbench for such experiments? Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] New to PostgreSQL, performance considerations
At 01:49 PM 12/13/2006, Bucky Jordan wrote: I've only seen pg_bench numbers > 2,000 tps on either really large hardware (none of the above mentioned comes close) or the results are in memory due to a small database size (aka measuring update contention). Which makes a laptop achieving such numbers all the more interesting IMHO. Just a guess, but these tests (compiler opts.) seem like they sometimes show a benefit where the database is mostly in RAM (which I'd guess many people have) since that would cause more work to be put on the CPU/Memory subsystems. The cases where the working set, or the performance critical part of the working set, of the DB is RAM resident are very important ones ITRW. Other people on the list hinted at this, but I share their hypothesis that once you get IO involved as a bottleneck (which is a more typical DB situation) you won't notice compiler options. Certainly makes intuitive sense. OTOH, this list has seen discussion of what should be IO bound operations being CPU bound. Evidently due to the expense of processing pg datastructures. Only objective benches are going to tell us where the various limitations on pg performance really are. I've got a 2 socket x 2 core woodcrest poweredge 2950 with a BBC 6 disk RAID I'll run some tests on as soon as I get a chance. I'm also thinking for this test, there's no need to tweak the default config other than maybe checkpoint_segments, since I don't really want postgres using large amounts of RAM (all that does is require me to build a larger test DB). Daniel's orginal system had 512MB RAM. This suggests to me that tests involving 256MB of pg memory should be plenty big enough. Thoughts? Hope they are useful. Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] New to PostgreSQL, performance considerations
Benchmarks, like any other SW, need modernizing and updating from time to time. Given the multi-core CPU approach to higher performance as the current fad in CPU architecture, we need a benchmark that is appropriate. If SPEC feels it is appropriate to rev their benchmark suite regularly, we probably should as well. Ron Peacetree At 12:44 AM 12/14/2006, Tom Lane wrote: "Joshua D. Drake" <[EMAIL PROTECTED]> writes: > On Wed, 2006-12-13 at 18:36 -0800, Josh Berkus wrote: >> Mostly, though, pgbench just gives the I/O system a workout. It's not a >> really good general workload. > It also will not utilize all cpus on a many cpu machine. We recently > found that the only way to *really* test with pgbench was to actually > run 4+ copies of pgbench at the same time. The pgbench app itself becomes the bottleneck at high transaction rates. Awhile back I rewrote it to improve its ability to issue commands concurrently, but then desisted from submitting the changes --- if we change the app like that, future numbers would be incomparable to past ones, which sort of defeats the purpose of a benchmark no? ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] New to PostgreSQL, performance considerations
At 10:00 AM 12/14/2006, Greg Smith wrote: On Wed, 13 Dec 2006, Ron wrote: The slowest results, Michael's, are on the system with what appears to be the slowest CPU of the bunch; and the ranking of the rest of the results seem to similarly depend on relative CPU performance. This is not what one would naively expect when benching a IO intensive app like a DBMS. pgbench with 3000 total transactions and fsync off is barely doing I/O to disk; it's writing a bunch of data to the filesystem cache and ending the benchmark before the data even makes it to the hard drive. This is why his results become completely different as soon as the number of transactions increases. With little or no actual disk writes, you should expect results to be ranked by CPU speed. I of course agree with you in the general sense. OTOH, I'm fairly sure the exact point where this cross-over occurs is dependent on the components and configuration of the system involved. (Nor do I want to dismiss this scenario as irrelevant or unimportant. There are plenty of RW situations where this takes place or where the primary goal of a tuning effort is to make it take place. Multi-GB BB RAID caches anyone?) In addition, let's keep in mind that we all know that overall system performance is limited by whatever component hits its limits first. Local pg performance has been known to be limited by any of : CPUs, memory subsystems, or physical IO subsystems. Intuitively, one tends to expect only the later to be a limiting factor in the vast majority of DBMS tasks. pg has a history of regularly surprising such intuition in many cases. IMO, this makes good bench marking tools and procedures more important to have. (If nothing else, knowing what component is likely to be the bottleneck in system "X" made of components "x1, x2, x3, " for task "Y" is valuable lore for the pg community to have as preexisting data when first asked any given question on this list! ) One plausible positive effect of tuning like that Daniel advocates is to "move" the level of system activity where the physical IO subsystem becomes the limiting factor on overall system performance. We are not going to know definitively if such an effect exists, or to what degree, or how to achieve it, if we don't have appropriately rigorous and reproducible experiments (and documentation of them) in place to test for it. So it seem to make sense that the community should have a discussion about the proper bench marking of pg and to get some results based on said. Ron Peacetree ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] New to PostgreSQL, performance considerations
Alexander, Good stuff. Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 -mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ? As it is, you've given a good lower and upper bound on your performance obtainable using compiler options, but you've given no data to show what effect arch specific compiler options have by themselves. Also, what HDs are you using? How many in what config? Thanks in Advance, Ron Peacetree At 02:14 PM 12/14/2006, Alexander Staubo wrote: My PostgreSQL config overrides, then, are: shared_buffers = 1024MB work_mem = 1MB maintenance_work_mem = 16MB fsync = off Environment: Linux 2.6.15-23-amd64-generic on Ubuntu. Dual-core AMD Opteron 280 with 4GB of RAM. LSI PCI-X Fusion-MPT SAS. Running with: pgbench -S -v -n -t 5000 -c 5. Results as a graph: http://purefiction.net/paste/pgbench.pdf Stats for CFLAGS="-O0": 18440.181894 19207.882300 19894.432185 19635.625622 19876.858884 20032.597042 19683.597973 20370.19 19989.157881 20207.343510 19993.745956 20081.353580 20356.416424 20047.810017 20319.834190 19417.807528 19906.788454 20536.039929 19491.308046 20002.144230 Stats for CFLAGS="-O3 -msse2 -mfpmath=sse -funroll-loops -m64 - march=opteron -pipe": 23830.358351 26162.203569 25569.091264 26762.755665 26590.822550 26864.908197 26608.029665 26796.116921 26323.742015 26692.576261 26878.859132 26106.770425 26328.371664 26755.595130 25488.304946 26635.527959 26377.485023 24817.590708 26480.245737 26223.427801 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] New to PostgreSQL, performance considerations
At 05:39 PM 12/14/2006, Alexander Staubo wrote: On Dec 14, 2006, at 20:28 , Ron wrote: Can you do runs with just CFLAGS="-O3" and just CFLAGS="-msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" as well ? All right. From my perspective, the effect of -O3 is significant, whereas architecture-related optimizations have no statistically significant effect. Is this opinion? Or have you rerun the tests using the flags I suggested? If so, can you post the results? If "-O3 -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" results in a 30-40% speed up over "-O0", and " -msse2 - mfpmath=sse -funroll-loops -m64 - march=opteron -pipe" results in a 5-10% speedup, then ~ 1/8 - 1/3 of the total possible speedup is due to arch specific optimizations. (testing "-O3" in isolation in addition tests for independence of factors as well as showing what "plain" "-O3" can accomplish.) Some might argue that a 5-10% speedup which represents 1/8 - 1/3 of the total speedup is significant... But enough speculating. I look forward to seeing your data. Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] New to PostgreSQL, performance considerations
At 07:27 PM 12/14/2006, Alexander Staubo wrote: Sorry, I neglected to include the pertinent graph: http://purefiction.net/paste/pgbench2.pdf In fact, your graph suggests that using arch specific options in addition to -O3 actually =hurts= performance. ...that seems unexpected... Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] New to PostgreSQL, performance considerations
At 04:54 AM 12/15/2006, Alexander Staubo wrote: On Dec 15, 2006, at 04:09 , Ron wrote: At 07:27 PM 12/14/2006, Alexander Staubo wrote: Sorry, I neglected to include the pertinent graph: http://purefiction.net/paste/pgbench2.pdf In fact, your graph suggests that using arch specific options in addition to -O3 actually =hurts= performance. According to the tech staff, this is a Sun X4100 with a two-drive RAID 1 volume. No idea about the make of the hard drives. Alexander. http://www.sun.com/servers/entry/x4100/features.xml So we are dealing with a 1U 1-4S (which means 1-8C) AMD Kx box with up to 32GB of ECC RAM (DDR2 ?) and 2 Seagate 2.5" SAS HDs. http://www.seagate.com/cda/products/discsales/index/1,,,00.html?interface=SAS My bet is the X4100 contains one of the 3 models of Cheetah 15K.4's. A simple du, dkinfo, whatever, will tell you which. I'm looking more closely into exactly what the various gcc -O optimizations do on Kx's as well. 64b vs 32b gets x86 compatible code access to ~ 2x as many registers; and MMX or SSE instructions get you access to not only more registers, but wider ones as well. As one wit has noted, "all optimization is an exercise in caching." (Terje Mathisen- one of the better assembler coders on the planet.) It seems unusual that code generation options which give access to more registers would ever result in slower code... Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] New to PostgreSQL, performance considerations
At 09:23 AM 12/15/2006, Merlin Moncure wrote: On 12/15/06, Ron <[EMAIL PROTECTED]> wrote: It seems unusual that code generation options which give access to more registers would ever result in slower code... The slower is probably due to the unroll loops switch which can actually hurt code due to the larger footprint (less cache coherency). I have seen that effect as well occasionally in the last few decades ;-) OTOH, suspicion is not _proof_; and I've seen other "optimizations" turn out to be "pessimizations" over the years as well. The extra registers are not all that important because of pipelining and other hardware tricks. No. Whoever told you this or gave you such an impression was mistaken. There are many instances of x86 compatible code that get 30-40% speedups just because they get access to 16 rather than 8 GPRs when recompiled for x84-64. Pretty much all the old assembly strategies such as forcing local variables to registers are basically obsolete...especially with regards to integer math. Again, not true. OTOH, humans are unlikely at this point to be able to duplicate the accuracy of the compiler's register coloring algorithms. Especially on x86 compatibles. (The flip side is that _expert_ humans can often put the quirky register set and instruction pipelines of x86 compatibles to more effective use for a specific chunk of code than even the best compilers can.) As I said before, modern CPUs are essentially RISC engines with a CISC preprocessing engine laid in top. I'm sure you meant modern =x86 compatible= CPUs are essentially RISC engines with a CISC engine on top. Just as "all the world's not a VAX", "all CPUs are not x86 compatibles". Forgetting this has occasionally cost folks I know... Things are much more complicated than they were in the old days where you could count instructions for the assembly optimization process. Those were the =very= old days in computer time... I suspect that there is little or no differnece between the -march=686 and the various specifc archicectures. There should be. The FSF compiler folks (and the rest of the industry compiler folks for that matter) are far from stupid. They are not just adding compiler switches because they randomly feel like it. Evidence suggests that the most recent CPUs are in need of =more= arch specific TLC compared to their ancestors, and that this trend is not only going to continue, it is going to accelerate. Did anybody think to look at the binaries and look for the amount of differences? I bet you code compiled for march=opteron will just fine on a pentium 2 if compiled for 32 bit. Sucker bet given that the whole point of a 32b x86 compatible is to be able to run code on any I32 ISA. CPU. OTOH, I bet that code optimized for best performance on a P2 is not getting best performance on a P4. Or vice versa. ;-) The big arch specific differences in Kx's are in 64b mode. Not 32b Ron Peacetree. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] New to PostgreSQL, performance considerations
At 09:50 AM 12/15/2006, Greg Smith wrote: On Fri, 15 Dec 2006, Merlin Moncure wrote: The slower is probably due to the unroll loops switch which can actually hurt code due to the larger footprint (less cache coherency). The cache issues are so important with current processors that I'd suggest throwing -Os (optimize for size) into the mix people test. That one may stack usefully with -O2, but probably not with -O3 (3 includes optimizations that increase code size). -Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size. -Os disables the following optimization flags: -falign-functions -falign-jumps -falign-loops -falign-labels -freorder-blocks -freorder-blocks-and-partition -fprefetch-loop-arrays -ftree-vect-loop-version Hmmm. That list of disabled flags bears thought. -falign-functions -falign-jumps -falign-loops -falign-labels 1= Most RISC CPUs performance is very sensitive to misalignment issues. Not recommended to turn these off. -freorder-blocks Reorder basic blocks in the compiled function in order to reduce number of taken branches and improve code locality. Enabled at levels -O2, -O3. -freorder-blocks-and-partition In addition to reordering basic blocks in the compiled function, in order to reduce number of taken branches, partitions hot and cold basic blocks into separate sections of the assembly and .o files, to improve paging and cache locality performance. This optimization is automatically turned off in the presence of exception handling, for link once sections, for functions with a user-defined section attribute and on any architecture that does not support named sections. 2= Most RISC CPUs are cranky about branchy code and (lack of) cache locality. Wouldn't suggest punting these either. -fprefetch-loop-arrays If supported by the target machine, generate instructions to prefetch memory to improve the performance of loops that access large arrays. This option may generate better or worse code; results are highly dependent on the structure of loops within the source code. 3= OTOH, This one looks worth experimenting with turning off. -ftree-vect-loop-version Perform loop versioning when doing loop vectorization on trees. When a loop appears to be vectorizable except that data alignment or data dependence cannot be determined at compile time then vectorized and non-vectorized versions of the loop are generated along with runtime checks for alignment or dependence to control which version is executed. This option is enabled by default except at level -Os where it is disabled. 4= ...and this one looks like a 50/50 shot. Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] EXPLAIN ANALYZE on 8.2
At 10:45 AM 12/15/2006, Tom Lane wrote: Gregory Stark <[EMAIL PROTECTED]> writes: > There are various attempts at providing better timing infrastructure at low > overhead but I'm not sure what's out there currently. I expect to do this what > we'll have to do is invent a pg_* abstraction that has various implementations > on different architectures. You've got to be kidding. Surely it's glibc's responsibility, not ours, to implement gettimeofday correctly for the hardware. regards, tom lane I agree with Tom on this. Perhaps the best compromise is for the pg community to make thoughtful suggestions to the glibc community? Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] New to PostgreSQL, performance considerations
At 10:55 AM 12/15/2006, Merlin Moncure wrote: On 12/15/06, Ron <[EMAIL PROTECTED]> wrote: There are many instances of x86 compatible code that get 30-40% speedups just because they get access to 16 rather than 8 GPRs when recompiled for x84-64. ...We benchmarked PostgreSQL internally here and found it to be fastest in 32 bit mode running on a 64 bit platform -- this was on a quad opteron 870 runnning our specific software stack, your results might be differnt of course. On AMD Kx's, you probably will get best performance in 64b mode (so you get all those extra registers and other stuff) while using 32b pointers (to keep code size and memory footprint down). On Intel C2's, things are more complicated since Intel's x86-64 implementation and memory IO architecture are both different enough from AMD's to have caused some consternation on occasion when Intel's 64b performance did not match AMD's. The big arch specific differences in Kx's are in 64b mode. Not 32b I dont think so. IMO all the processor specific instruction sets were hacks of 32 bit mode to optimize specific tasks. Except for certain things these instructions are rarely, if ever used in 64 bit mode, especially in integer math (i.e. database binaries). Since Intel and AMD64 64 bit are virtually indentical I submit that -march is not really important anymore except for very, very specific (but important) cases like spinlocks. Take a good look at the processor specific manuals and the x86-64 benches around the net. The evidence outside the DBMS domain is pretty solidly in contrast to your statement and position. Admittedly, DBMS are not web servers or FPS games or ... That's why we need to do our own rigorous study of the subject. This thread is about how much architecture depenant binares can beat standard ones. I say they don't very much at all, and with the specific exception of Daniel's benchmarking the results posted to this list bear that out. ...and IMHO the issue is still very much undecided given that we don't have enough properly obtained and documented evidence. ATM, the most we can say is that in a number of systems with modest physical IO subsystems that are not running Gentoo Linux we have not been able to duplicate the results. (We've also gotten some interesting results like yours suggesting the arch specific optimizations are bad for pg performance in your environment.) In the process questions have been asked and issues raised regarding both the tolls involved and the proper way to use them. We really do need to have someone other than Daniel duplicate his Gentoo environment and independently try to duplicate his results. ...and let us bear in mind that this is not just intellectual curiosity. The less pg is mysterious, the better the odds pg will be adopted in any specific case. Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] New to PostgreSQL, performance considerations
At 07:06 PM 12/15/2006, Michael Stone wrote: On Fri, Dec 15, 2006 at 12:24:46PM -0500, Ron wrote: ATM, the most we can say is that in a number of systems with modest physical IO subsystems So I reran it on a 3.2GHz xeon with 6G RAM off a ramdisk; I/O ain't the bottleneck on that one. Results didn't show didn't show any signficant gains regardless of compilation options (results hovered around 12k tps). If people want to continue this, I will point out that they should make sure they're linked against the optimized libpq rather than an existing one elsewhere in the library path. Beyond that, I'm done with this thread. Maybe there are some gains to be found somewhere, but the testing done thus far (while limited) is sufficient, IMO, to demonstrate that compiler options aren't going to provide a blow-your-socks-off dramatic performance improvement. AFAICT, no one has stated there would be a "blow-your-socks-off dramatic performance improvement" for pg due to compilation options. Just that there might be some, and there might be some that are arch specific. So far these experiments have shown = multiple instances of a ~30-35% performance improvement going from -O0 to --O3 = 1 instance of arch specific options hurting performance when combined with -O3 = 1 instance of arch specific options helping performance on an OS that only one person has tested (Gentoo Linux) = that a 2.33 GHz C2D Mac laptop (under what OS?) with a typical laptop modest physical IO subystem can do ~2100tps = that pg has a "speed limit" on a 3.2GHz Xeon (which kind?) with 6G RAM off a ramdisk (under what OS?) of ~12K tps (I'd be curious to see what this limit is with better CPUs and memory subsystems) Note that except for the first point, all the other results are singletons that as of yet have not been reproduced. The most important "gain" IMO is Knowledge, and I'd say there is still more to learn and/or verify IMHO. YMMV. Ron Peacetree ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] New to PostgreSQL, performance considerations
Sorry for the delay in responding. I had familial obligations. As a matter of fact, I am spending a decent amount of time on this. I don't usually pore through documentation for compilers and OS's to the degree I've been since this thread started. Nor do I usually try and get access to the HW I'm presently tracking down. I'll post my thoughts re: detailed analysis of gcc/g++ compiler options later today or tomorrow as work schedule allows. Why this is worth it: 1= Any gains from setup and configuration are the cheapest ones available once we codify how to obtain them. 2= any public database or knowledge about how to best setup, configure, and test pg is very good for the community. 3= developers need to know and agree on proper procedure and technique for generating results for discussion or we end up wasting a lot of time. 4= measuring and documenting pg performance means we know where best to allocate resources for improving pg. Or where using pg is (in)appropriate compared to competitors. Potential performance gains are not the only value of this thread. Ron Peacetree At 12:33 PM 12/16/2006, Michael Stone wrote: On Sat, Dec 16, 2006 at 10:53:21AM -0500, Ron wrote: The most important "gain" IMO is Knowledge, and I'd say there is still more to learn and/or verify IMHO. YMMV. Well, I think there are other areas where I can spend my time where potential gains are more likely. YMMV (although, I note, you don't seem to be spending much of your own time testing this) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
At 12:46 AM 12/28/2006, Guy Rouillier wrote: I don't want to violate any license agreement by discussing performance, so I'll refer to a large, commercial PostgreSQL-compatible DBMS only as BigDBMS here. I'm trying to convince my employer to replace BigDBMS with PostgreSQL for at least some of our Java applications. As a proof of concept, I started with a high-volume (but conceptually simple) network data collection application. This application collects files of 5-minute usage statistics from our network devices, and stores a raw form of these stats into one table and a normalized form into a second table. We are currently storing about 12 million rows a day in the normalized table, and each month we start new tables. For the normalized data, the app inserts rows initialized to zero for the entire current day first thing in the morning, then throughout the day as stats are received, executes updates against existing rows. So the app has very high update activity. In my test environment, I have a dual-x86 Linux platform running the application, and an old 4-CPU Sun Enterprise 4500 running BigDBMS and PostgreSQL 8.2.0 (only one at a time.) The Sun box has 4 disk arrays attached, each with 12 SCSI hard disks (a D1000 and 3 A1000, for those familiar with these devices.) The arrays are set up with RAID5. So I'm working with a consistent hardware platform for this comparison. I'm only processing a small subset of files (144.) BigDBMS processed this set of data in 2 seconds, with all foreign keys in place. With all foreign keys in place, PG took 54000 seconds to complete the same job. I've tried various approaches to autovacuum (none, 30-seconds) and it doesn't seem to make much difference. What does seem to make a difference is eliminating all the foreign keys; in that configuration, PG takes about 3 seconds. Better, but BigDBMS still has it beat significantly. If you are using pg configured as default installed, you are not getting pg's best performance. Ditto using data structures optimized for BigDBMS. A= go through each query and see what work_mem needs to be for that query to be as RAM resident as possible. If you have enough RAM, set work_mem for that query that large. Remember that work_mem is =per query=, so queries running in parallel eat the sum of each of their work_mem's. B= Make sure shared buffers is set reasonably. A good rule of thumb for 8.x is that shared buffers should be at least ~1/4 your RAM. If your E4500 is maxed with RAM, there's a good chance shared buffers should be considerably more than 1/4 of RAM. C= What file system are you using? Unlike BigDBMS, pg does not have its own native one, so you have to choose the one that best suits your needs. For update heavy applications involving lots of small updates jfs and XFS should both be seriously considered. D= Your table schema and physical table layout probably needs to change. What BigDBMS likes here is most likely different from what pg likes. E= pg does not actually update records in place. It appends new records to the table and marks the old version invalid. This means that things like pages size, RAID stripe size, etc etc may need to have different values than they do for BigDBMS. Another consequence is that pg likes RAID 10 even more than most of its competitors. F= This may seem obvious, but how many of the foreign keys and other overhead do you actually need? Get rid of the unnecessary. G= Bother the folks at Sun, like Josh Berkus, who know pq inside and out +and+ know your HW (or have access to those that do ;-) )inside and out. I'll bet they'll have ideas I'm not thinking of. H= Explain Analyze is your friend. Slow queries may need better table statistics, or better SQL, or may be symptoms of issues "C" or "D" above or ... I've got PG configured so that that the system database is on disk array 2, as are the transaction log files. The default table space for the test database is disk array 3. I've got all the reference tables (the tables to which the foreign keys in the stats tables refer) on this array. I also store the stats tables on this array. Finally, I put the indexes for the stats tables on disk array 4. I don't use disk array 1 because I believe it is a software array. I= With 4 arrays of 12 HDs each, you definitely have enough spindles to place pg_xlog somewhere separate from all the other pg tables. In addition, you should analyze you table access patterns and then scatter them across your 4 arrays in such as way as to minimize head contention. I'm out of ideas how to improve this picture any further. I'd appreciate some suggestions. Thanks. Hope this helps, Ron Peacetree ---(end of broadcast)--- TIP 1: if posting/readi
Re: [PERFORM] slow result
At 07:34 AM 1/23/2007, Laurent Manchon wrote: Hi, I have a slow response of my PostgreSQL database 7.4 using this query below on a table with 80 rows: 1= Upgrade to the latest stable version of pg. That would be 8.2.x You are very much in the Dark Ages pg version wise. pg 8.x has significant IO enhancements. Especially compared to 7.4. select count(*)from tbl; PostgreSQL return result in 28 sec every time. although MS-SQL return result in 0.02 sec every time. 2= pg actually counts how many rows there are in a table. MS-SQL looks up a count value from a internal data table... which can be wrong in extraordinarily rare circumstances in a MVCC DBMS (which MS-SQL is !not!. MS-SQL uses the older hierarchical locking strategy for data protection.) Since pg actually scans the table for the count, pg's count will always be correct. No matter what. Since MS-SQL does not use MVCC, it does not have to worry about the corner MVCC cases that pg does. OTOH, MVCC _greatly_ reduces the number of cases where one transaction can block another compared to the locking strategy used in MS-SQL. This means in real day to day operation, pg is very likely to handle OLTP loads and heavy loads better than MS-SQL will. In addition, MS-SQL is a traditional Codd & Date table oriented DBMS. pg is an object oriented DBMS. Two very different products with very different considerations and goals (and initially designed at very different times historically.) Compare them under real loads using real queries if you are going to compare them. Comparing pg and MS-SQL using "fluff" queries like count(*) is both misleading and a waste of effort. My server is a DELL PowerEdge 2600 with bi-processor Xeon at 3.2 Ghz with 3GBytes RAM My PostgreSQL Conf is * log_connections = yes syslog = 2 effective_cache_size = 5 sort_mem = 1 max_connections = 200 shared_buffers = 3000 vacuum_mem = 32000 wal_buffers = 8 max_fsm_pages = 2000 max_fsm_relations = 100 Can you tell me is there a way to enhence performance ? There are extensive FAQs on what the above values should be for pg. The lore is very different for pg 8.x vs pg 7.x Thank you You're welcome. Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Tuning
At 06:24 PM 1/28/2007, Josh Berkus wrote: John, > -work_mem Depends on the number of concurrent queries you expect to run and what size sorts you expect them to do. EXPLAIN ANALYZE is your friend. It will tell you how much data each query is manipulating and therefore how much memory each query will chew. The next step is to figure out how many of each query will be running concurrently. Summing those will tell you the maximum work_mem each kind of query will be capable of using. If you have a deep enough understanding of how your pg system is working, then you can set work_mem on a per query basis to get the most efficient use of the RAM in your system. > -maintenance_work_mem - 50% of the largest table? Actually, in current code I've found that anything over 256mb doesn't actually get used. Is this considered a bug? When will this limit go away? Does work_mem have a similar limit? > -shared_buffers - max value 5 Actually, I need to update that. On newer faster multi-core machines you may want to allocate up to 1GB of shared buffers. > -effective_cache_size - max 2/3 of available ram, ie 24GB on the > hardware described above Yes. Why? "max of 2/3 of available RAM" sounds a bit hand-wavy. Especially with 32gb, 64GB, and 128GB systems available. Is there are hidden effective or hard limit here as well? For a dedicated pg machine, I'd assume one would want to be very aggressive about configuring the kernel, minimizing superfluous services, and configuring memory use so that absolutely as much as possible is being used by pg and in the most intelligent way given one's specific pg usage scenario. > -shmmax - how large dare I set this value on dedicated postgres servers? Set it to 2GB and you'll be covered. I thought that on 32b systems the 2GB shmmax limit had been raised to 4GB? and that there essentially is no limit to shmmax on 64b systems? What are Oracle and EnterpriseDB recommending for shmmax these days? My random thoughts, Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Querying distinct values from a large table
I strongly encourage anyone who is interested in the general external sorting problem peruse Jim Gray's site: http://research.microsoft.com/barc/SortBenchmark/ Ron Peacetree At 08:24 AM 1/31/2007, Gregory Stark wrote: Tom Lane <[EMAIL PROTECTED]> writes: > Alvaro Herrera <[EMAIL PROTECTED]> writes: > > Gregory Stark wrote: > >> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on > >> your data distribution. It's not hard to come up with distributions where it's > >> 1000x as fast and others where there's no speed difference.) > > > So the figure is really "1-1000x"? I bet this one is more impressive in > > PHB terms. > > Luke has a bad habit of quoting numbers that are obviously derived from > narrow benchmarking scenarios as Universal Truths, rather than providing > the context they were derived in. I wish he'd stop doing that... In fairness I may have exaggerated a bit there. There is a limit to how much of a speedup you can get in valid benchmarking situations. A single sequential scan is always going to be necessary so you're only saving the cost of writing out the temporary file and subsequent merge passes. It's hard to generate lots of intermediate merge passes since there are only O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would have to be sorting something on order of 2^1000 records which is ridiculous. Realistically you should only be able to save 2-5 intermediate merge passes. On the other there are some common situations where you could see atypical increases. Consider joining a bunch of small tables to generate a large result set. The small tables are probably all in memory and the result set may only have a small number of distinct values. If you throw out the duplicates early you save *all* the I/O. If you have to do a disk sort it could be many orders slower. This is actually not an uncommon coding idiom for MySQL programmers accustomed to fast DISTINCT working around the lack of subqueries and poor performance of IN and EXISTS. They often just join together all the tables in a big cross join and then toss in a DISTINCT at the top to get rid of the duplicates. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Vacuumdb - Max_FSM_Pages Problem.
At 10:53 AM 2/26/2007, Peter Childs wrote: On 26/02/07, Pallav Kalva <[EMAIL PROTECTED]> wrote: Hi, I am in the process of cleaning up one of our big table, this table has 187 million records and we need to delete around 100 million of them. I am deleting around 4-5 million of them daily in order to catchup with vacuum and also with the archive logs space. So far I have deleted around 15million in past few days. max_fsm_pages value is set to 120. Vacuumdb runs once daily, here is the output from last night's vacuum job === INFO: free space map: 999 relations, 798572 pages stored; 755424 total pages needed DETAIL: Allocated FSM size: 1000 relations + 120 pages = 7096 kB shared memory. VACUUM From the output it says 755424 total pages needed , this number keeps growing daily even after vacuums are done daily. This was around 350K pages before the delete process started. I am afraid that this number will reach the max_fsm_pages limit soon and vacuums thereafter will never catch up . Can anyone please explain this behavior ? What should I do to catch up with vacuumdb daily ? Vacuum adds to free pages to the fsm so that they can be reused. If you don't fill up those free pages the fsm will fill up. Once the fsm is full no more pages can be added to the fsm. If you start writing to the free pages via inserts when vacuum next runs more free pages will be added that did not fit previously in the free space map due to it being full. If you are really deleting that many records you may be better coping those you want to a new table and dropping the old one. To actually recover space you need to either run vacuum full or cluster. This ought to be in the manual somewhere as this question gets asked about once a week. Peter. In fact , a= copying data to a new table and dropping the original table rather than b= updating the original table is a "standard best DBA practice" regardless of DB product. The only thing that changes from DB product to DB product is the exact point where the copy is large enough to make "copy, replace" better than "update in place". Rule of Thumb: No matter what DB product you are using, if it's more than 1/2 of any table or more than 1/4 of any table that does not fit into memory, it's usually better to copy replace rather then update in place. ...and I completely agree that we should document this sort of Industry Best Practice in a way that is easily usable by the pg community. Cheers, Ron ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Opinions on Raid
At 08:12 AM 2/27/2007, Joe Uhl wrote: We have been running Postgres on a 2U server with 2 disks configured in raid 1 for the os and logs and 4 disks configured in raid 10 for the data. I have since been told raid 5 would have been a better option given our usage of Dell equipment and the way they handle raid 10. I have just a few general questions about raid with respect to Postgres: [1] What is the performance penalty of software raid over hardware raid? Is it truly significant? We will be working with 100s of GB to 1-2 TB of data eventually. The real CPU overhead when using SW RAID is when using any form of SW RAID that does XOR operations as part of writes (RAID 5, 6, 50, ..., etc). At that point, you are essentially hammering on the CPU just as hard as you would on a dedicated RAID controller... ...and the dedicated RAID controller probably has custom HW helping it do this sort of thing more efficiently. That being said, SW RAID 5 in this sort of scenario can be reasonable if you =dedicate= a CPU core to it. So in such a system, your "n" core box is essentially a "n-1" core box because you have to lock a core to doing nothing but RAID management. Religious wars aside, this actually can work well. You just have to understand and accept what needs to be done. SW RAID 1, or 10, or etc should not impose a great deal of CPU overhead, and often can be =faster= than a dedicated RAID controller. SW RAID 5 etc in usage scenarios involving far more reads than writes and light write loads can work quite well even if you don't dedicate a core to RAID management, but you must be careful about workloads that are, or that contain parts that are, examples of the first scenario I gave. If you have any doubts about whether you are doing too many writes, dedicate a core to RAID stuff as in the first scenario. [2] How do people on this list monitor their hardware raid? Thus far we have used Dell and the only way to easily monitor disk status is to use their openmanage application. Do other controllers offer easier means of monitoring individual disks in a raid configuration? It seems one advantage software raid has is the ease of monitoring. Many RAID controller manufacturers and storage product companies offer reasonable monitoring / management tools. 3ware AKA AMCC has a good reputation in this area for their cards. So does Areca. I personally do not like Adaptec's SW for this purpose, but YMMV. LSI Logic has had both good and bad SW in this area over the years. Dell, HP, IBM, etc's offerings in this area tend to be product line specific. I'd insist on some sort of "try before you buy" if the ease of use / quality of the SW matters to your overall purchase decision. Then there are the various CSSW and OSSW packages that contain this functionality or are dedicated to it. Go find some reputable reviews. (HEY LURKERS FROM Tweakers.net: ^^^ THAT"S AN ARTICLE IDEA ;-) ) Cheers, Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] increasing database connections
At 01:18 AM 3/1/2007, Joshua D. Drake wrote: Jonah H. Harris wrote: > On 3/1/07, Shiva Sarna <[EMAIL PROTECTED]> wrote: >> I am sorry if it is a repeat question but I want to know if database >> performance will decrease if I increase the max-connections to 2000. At >> present it is 100. > > Most certainly. Adding connections over 200 will degrade performance > dramatically. You should look into pgpool or connection pooling from > the application. huh? That is certainly not my experience. I have systems that show no depreciable performance hit on even 1000+ connections. To be fair to the discussion, these are on systems with 4+ cores. Usually 8+ and significant ram 16/32 gig fo ram. Sincerely, Joshua D. Drake Some caveats. Keeping a DB connection around is relatively inexpensive. OTOH, building and tearing down a DB connection =can be= expensive. Expensive or not, connection build and tear down are pure overhead activities. Any overhead you remove from the system is extra capacity that the system can use in actually answering DB queries (...at least until the physical IO system is running flat out...) So having 1000+ DB connections open should not be a problem in and of itself (but you probably do not want 1000+ queries worth of simultaneous HD IO!...). OTOH, you probably do !not! want to be constantly creating and destroying 1000+ DB connections. Better to open 1000+ DB connections once at system start up time and use them as a connection pool. The potential =really= big performance hit in having lots of connections around is in lots of connections doing simultaneous heavy, especially seek heavy, HD IO. Once you have enough open connections that your physical IO subsystem tends to be maxed out performance wise on the typical workload being handled, it is counter productive to allow any more concurrent DB connections. So the issue is not "how high a max-connections is too high?". It's "how high a max connections is too high for =my= HW running =my= query mix?" The traditional advice is to be conservative and start with a relatively small number of connections and increase that number only as long as doing so results in increased system performance on your job mix. Once you hit the performance plateau, stop increasing max-connections and let connection caching and pooling handle things. If that does not result in enough performance, it's time to initiate the traditional optimization hunt. Also, note Josh's deployed HW for systems that can handle 1000+ connections. ...and you can bet the IO subsystems on those boxes are similarly "beefy". Don't expect miracles out of modest HW. Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] strange performance regression between 7.4 and 8.1
At 07:36 PM 3/1/2007, Jeff Frost wrote: On Thu, 1 Mar 2007, Alex Deucher wrote: >> Postgresql might be choosing a bad plan because your effective_cache_size >> is >> way off (it's the default now right?). Also, what was the block read/write > > yes it's set to the default. > >> speed of the SAN from your bonnie tests? Probably want to tune >> random_page_cost as well if it's also at the default. >> > > --Sequential Output-- --Sequential Input- > --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec > %CP > luna12-san 16000M 58896 91 62931 9 35870 5 54869 82 145504 13 397.7 > 0 > So, you're getting 62MB/s writes and 145MB/s reads. Just FYI, that write speed is about the same as my single SATA drive write speed on my workstation, so not that great. The read speed is decent, though and with that sort of read performance, you might want to lower random_page_cost to something like 2.5 or 2 so the planner will tend to prefer index scans. Right, but the old box was getting ~45MBps on both reads and writes, so it's an improvement for me :) Thanks for the advice, I'll let you know how it goes. Do you think that is because you have a different interface between you and the SAN? ~45MBps is pretty slow - your average 7200RPM ATA133 drive can do that and costs quite a bit less than a SAN. Is the SAN being shared between the database servers and other servers? Maybe it was just random timing that gave you the poor write performance on the old server which might be also yielding occassional poor performance on the new one. Remember that pg, even pg 8.2.3, has a known history of very poor insert speed (see comments on this point by Josh Berkus, Luke Lonergan, etc) For some reason, the code changes that have resulted in dramatic improvements in pg's read speed have not had nearly the same efficacy for writes. Bottom line: pg presently has a fairly low and fairly harsh upper bound on write performance. What exactly that bound is has been the subject of some discussion, but IIUC the fact of its existence is well established. Various proposals for improving the situation exist, I've even made some of them, but AFAIK this is currently considered one of the "tough pg problems". Cheers, Ron Peacetree ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] strange performance regression between 7.4 and 8.1
At 08:56 AM 3/2/2007, Carlos Moreno wrote: Florian Weimer wrote: * Alex Deucher: I have noticed a strange performance regression and I'm at a loss as to what's happening. We have a fairly large database (~16 GB). Sorry for asking, but is this a typo? Do you mean 16 *TB* instead of 16 *GB*? If it's really 16 GB, you should check if it's cheaper to buy more RAM than to fiddle with the existing infrastructure. This brings me to a related question: Do I need to specifically configure something to take advantage of such increase of RAM? In particular, is the amount of things that postgres can do with RAM limited by the amount of shared_buffers or some other parameter? Should shared_buffers be a fixed fraction of the total amount of physical RAM, or should it be the total amount minus half a gigabyte or so? As an example, if one upgrades a host from 1GB to 4GB, what would be the right thing to do in the configuration, assuming 8.1 or 8.2? (at least what would be the critical aspects?) Thanks, Carlos Unfortunately, pg does not (yet! ;-) ) treat all available RAM as a common pool and dynamically allocate it intelligently to each of the various memory data structures. So if you increase your RAM, you will have to manually change the entries in the pg config file to take advantage of it. (and start pg after changing it for the new config values to take effect) The pertinent values are all those listed under "Memory" in the annotated pg conf file: shared_buffers, work_mem, maintenance_work_mem, etc. http://www.powerpostgresql.com/Downloads/annotated_conf_80.html Cheers, Ron Peacetree ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] strange performance regression between 7.4 and 8.1
At 10:16 AM 3/2/2007, Alex Deucher wrote: On 3/2/07, Florian Weimer <[EMAIL PROTECTED]> wrote: * Alex Deucher: > I have noticed a strange performance regression and I'm at a loss as > to what's happening. We have a fairly large database (~16 GB). Sorry for asking, but is this a typo? Do you mean 16 *TB* instead of 16 *GB*? If it's really 16 GB, you should check if it's cheaper to buy more RAM than to fiddle with the existing infrastructure. Yes, 16 GB. I'd rather not shell out for more ram, if I'm not even sure that will help. The new system should be faster, or at least as fast, so I'd like to sort out what's going on before I buy more ram. OK. You a= went from pg 7.4.x to 8.1.4 AND b= you changed from 4 SPARC CPUs (how many cores? If this is > 4...) to 2 2C Opterons AND (SPEC and TPC bench differences between these CPUs?) c= you went from a Sun box to a "white box" AND (memory subsystem differences? other differences?) d= you went from local HD IO to a SAN (many differences hidden in that one line... ...and is the physical layout of tables and things like pg_xlog sane on the SAN?) ...and you did this by just pulling over the old DB onto the new HW? May I suggest that it is possible that your schema, queries, etc were all optimized for pg 7.x running on the old HW? (explain analyze shows the old system taking ~1/10 the time per row as well as estimating the number of rows more accurately) RAM is =cheap=. Much cheaper than the cost of a detective hunt followed by rework to queries, schema, etc. Fitting the entire DB into RAM is guaranteed to help unless this is an OLTP like application where HD IO is required to be synchronous.. If you can fit the entire DB comfortably into RAM, do it and buy yourself the time to figure out the rest of the story w/o impacting on production performance. Cheers, Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] strange performance regression between 7.4 and 8.1
At 11:03 AM 3/2/2007, Alex Deucher wrote: On 3/2/07, Ron <[EMAIL PROTECTED]> wrote: May I suggest that it is possible that your schema, queries, etc were all optimized for pg 7.x running on the old HW? (explain analyze shows the old system taking ~1/10 the time per row as well as estimating the number of rows more accurately) RAM is =cheap=. Much cheaper than the cost of a detective hunt followed by rework to queries, schema, etc. Fitting the entire DB into RAM is guaranteed to help unless this is an OLTP like application where HD IO is required to be synchronous.. If you can fit the entire DB comfortably into RAM, do it and buy yourself the time to figure out the rest of the story w/o impacting on production performance. Perhaps so. I just don't want to spend $1000 on ram and have it only marginally improve performance if at all. The old DB works, so we can keep using that until we sort this out. Alex 1= $1000 worth of RAM is very likely less than the $ worth of, say, 10 hours of your time to your company. Perhaps much less. (Your =worth=, not your pay or even your fully loaded cost. This number tends to be >= 4x what you are paid unless the organization you are working for is in imminent financial danger.) You've already put more considerably more than 10 hours of your time into this... 2= If the DB goes from not fitting completely into RAM to being completely RAM resident, you are almost 100% guaranteed a big performance boost. The exception is an OLTP like app where DB writes can't be done a-synchronously (doing financial transactions, real time control systems, etc). Data mines should never have this issue. 3= Whether adding enough RAM to make the DB RAM resident (and re-configuring conf, etc, appropriately) solves the problem or not, you will have gotten a serious lead as to what's wrong. ...and I still think looking closely at the actual physical layout of the tables in the SAN is likely to be worth it. Cheers, Ron Peacetree ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] strange performance regression between 7.4 and 8.1
At 02:43 PM 3/2/2007, Alex Deucher wrote: On 3/2/07, Ron <[EMAIL PROTECTED]> wrote: ...and I still think looking closely at the actual physical layout of the tables in the SAN is likely to be worth it. How would I go about doing that? Alex Hard for me to give specific advice when I don't know what SAN product we are talking about nor what kind of HDs are in it nor how those HDs are presently configured... I quote you in an earlier post: "The RAID groups on the SAN were set up for maximum capacity rather than for performance. Using it for the databases just came up recently." That implies to me that the SAN is more or less set up as a huge 105 HD (assuming this number is correct? We all know how "assume" is spelled...) JBOD or RAID 5 (or 6, or 5*, or 6*) set. =IF= that is true, tables are not being given dedicated RAID groups. That implies that traditional lore like having pg_xlog on dedicated spindles is being ignored. Nor is the more general Best Practice of putting the most heavily used tables onto dedicated spindles being followed. In addition, the most space efficient RAID levels: 5* or 6*, are not the best performing one (RAID 10 striping your mirrors) In short, configuring a SAN for maximum capacity is exactly the wrong thing to do if one is planning to use it in the best way to support DB performance. I assume (there's that word again...) that there is someone in your organization who understands how the SAN is configured and administered. You need to talk to them about these issues. Cheers, Ron ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] strange performance regression between 7.4 and 8.1
At 10:25 AM 3/6/2007, Alex Deucher wrote: On 3/5/07, Guido Neitzer <[EMAIL PROTECTED]> wrote: On 05.03.2007, at 19:56, Alex Deucher wrote: > Yes, I started setting that up this afternoon. I'm going to test that > tomorrow and post the results. Good - that may or may not give some insight in the actual bottleneck. You never know but it seems to be one of the easiest to find out ... Well, the SAN appears to be the limiting factor. I set up the DB on the local scsi discs (software RAID 1) and performance is excellent (better than the old server). Thanks for everyone's help. Alex What kind of SAN is it and how many + what kind of HDs are in it? Assuming the answers are reasonable... Profile the table IO pattern your workload generates and start allocating RAID sets to tables or groups of tables based on IO pattern. For any table or group of tables that has a significant level of write IO, say >= ~25% of the IO mix, try RAID 5 or 6 first, but be prepared to go RAID 10 if performance is not acceptable. Don't believe any of the standard "lore" regarding what tables to put where or what tables to give dedicated spindles to. Profile, benchmark, and only then start allocating dedicated resources. For instance, I've seen situations where putting pg_xlog on its own spindles was !not! the right thing to do. Best Wishes, Ron Peacetree ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq