Re: [PERFORM] DRBD and Postgres: how to improve the perfomance?
Hi, Decibel! wrote: Actually, in this case, I suspect that latency will be far more critical than overall bandwidth. I don't know if it's inherent to Gig-E, but my limited experience has been that Gig-E has higher latency than 100mb. I've been looking for some benchmarks, but it's rather hard to find. It looks like people are much more concerned about throughput ?!? However, I'd like to share some of the sites I've found, especially regarding Fast Ethernet vs. Gigabit Ethernet: - Ashford Computer Consulting Service benchmarked five different gigabit ethernet adapters [1], back in 2004. For most cards they measured between ca. 100 - 150 microseconds for a UDP round trip of a token, a so called hot potato benchmark. Unfortunately they didn't compare with Fast Ethernet. - The NetPIPE project has some of it's measurements at the very bottom of it's website [2]. Mostly for high speed and low latency links. Again, Fast Ethernet is missing. The diagram tells the following latencies (in microseconds): 75 10 Gigabit Ethernet 62 Gigabit Ethernet 8 Myrinet 7.5 Infini Band 4.7 Atoll 4.2 SCI I've no explanation for the significantly better measure for gigabit ethernet compared with the above benchmark. From their description I'm concluding that they also measured a round-trip, but not via UDP. The bad value for 10 Gigabit Ethernet is due to a poor Intel adapter, which also has poor throughput. They claim that newer adapters are better. - Finally, I've found a latency comparison between Fast vs Gigabit Ethernet, here [3]. Figure 6, in the second third of the page shows a NetPIPE latency benchmark between Ethernet, Fast Ethernet and Gigabit Ethernet (additionally ATM and FDDI). It looks like Gigabit Ethernet features slightly better latency. From these findings I'm concluding, that commodity Ethernet hardware has quite similar latencies, no matter if you are using Fast, Gigabit or 10 Gigabit Ethernet. If you really want to have a low latency interconnect, you need to pay the extra bucks for specialized, low latency networking hardware (which may still be based on 10GE, see Myrinet's 10GE adapter). If you know other resources, I'd be curious to know. Regards Markus [1]: Ashford Computer Consulting Service, GigE benchmarks: http://www.accs.com/p_and_p/GigaBit/conclusion.html [2]: NetPIPE website: http://www.scl.ameslab.gov/netpipe/ [3]: Gigabit Ethernet and Low-Cost Supercomputing http://www.scl.ameslab.gov/Publications/Gigabit/tr5126.html ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Replication
Hi, Andrew Sullivan wrote: This isn't quite true. Slony-II was originally conceived by Jan as an attempt to implement some of the Postgres-R ideas. Oh, right, thanks for that correction. Part of the problem, as near as I could tell, was that we had no group communication protocol that would really work. Spread needed a _lot_ of work (where lot of work may mean rewrite), and I just didn't have the humans to put on that problem. Another part of the problem was that, for high-contention workloads like the ones we happened to be working on, an optimistic approach like Postgres-R is probably always going to be a loser. Hm.. for high-contention on single rows, sure, yes - you would mostly get rollbacks for conflicting transactions. But the optimism there is justified, as I think most real world transactions don't conflict (or else you can work around such high single row contention). You are right in that the serialization of the GCS can be bottleneck. However, there's lots of research going on in that area and I'm convinced that Postgres-R has it's value. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Replication
Hi, Joshua D. Drake wrote: Slony-II Seems brilliant, a solid theoretical foundation, at the forefront of computer science. But can't find project status -- when will it be available? Is it a pipe dream, or a nearly-ready reality? Dead Not quite... there's still Postgres-R, see www.postgres-r.org And I'm continuously working on it, despite not having updated the website for almost a year now... I planned on releasing the next development snapshot together with 8.3, as that seems to be delayed, that seems realistic ;-) Regards Markus ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] Replication
Hi, Craig James wrote: Is Postgres-R the same thing as Slony-II? There's a lot of info and news around about Slony-II, but your web page doesn't seem to mention it. Hm... true. Good point. Maybe I should add a FAQ: Postgres-R has been the name of the research project by Bettina Kemme et al. Slony-II was the name Neil and Gavin gave their attempt to continue that project. I've based my work on the old (6.4.2) Postgres-R source code - and I'm still calling it Postgres-R, probably Postgres-R (8) to distinguish it from the original one. But I'm thinking about changing the name completely... however, I'm a developer, not a marketing guru. While researching replication solutions, I had a heck of a time sorting out the dead or outdated web pages (like the stuff on gborg) from the active projects. Yeah, that's one of the main problems with replication for PostgreSQL. I hope Postgres-R (or whatever name I'll come up with in the future) can change that. Either way, it's great to know you're working on it. Maybe you want to join its mailing list [1]? I'll try to get some discussion going there in the near future. Regards Markus [1]: Postgres-R on gborg: http://pgfoundry.org/projects/postgres-r/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] dbt2 NOTPM numbers
Hi, Mark Wong wrote: Yeah, I ran with 500+ warehouses, but I had 6 14-disk arrays of 15K RPM scsi drives and 6 dual-channel controllers... :) Lucky you! In the mean time, I've figured out that the box in question peaked at about 1450 NOTPMs with 120 warehouses with RAID 1+0. I'll try to compare again to RAID 6. Is there any place where such results are collected? Regards Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] dbt2 NOTPM numbers
Hi, Jim Nasby wrote: I don't think that kind of testing is useful for good raid controllers on RAID5/6, because the controller will just be streaming the data out; it'll compute the parity blocks on the fly and just stream data to the drives as fast as possible. That's why I called it 'simplistic throughput testing'... But that's not how writes in the database work (except for WAL); you're writing stuff all over the place, none of which is streamed. So in the best case (the entire stripe being updated is in the controller's cache), at a minimum it's going to have to write data + parity ( * 2 for RAID 6, IIRC) for every write. But any real-sized database is going to be far larger than your raid cache, which means there's a good chance a block being written will no longer have it's stripe in cache. In that case, the controller is going to have to read a bunch of data back off the drive, which is going to clobber performance. I'm well aware. Our workload (hopefully) consists of a much lower writes/reads ratio than dbt2, so RAID 6 might work anyway. Now, add that performance bottleneck on top of your WAL writes and you're in real trouble. Well, I'm basically surprised of the low NOTPM numbers compared to my desktop system, which also does around 200 NOTPMs, with only two platters in RAID 1 config... How can a server with four Cores and 8 Platters be equaly slow? Anyway, I've now reconfigured the system with RAID 1+0 and got more than twice the NOTPMs: Response Time (s) Transaction %Average :90th %Total Rollbacks % - - --- --- - Delivery 3.84204.733 : 241.998 704 0 0.00 New Order 45.77203.651 : 242.847 8382 75 0.90 Order Status 4.32199.184 : 238.081 7920 0.00 Payment 42.02198.969 : 236.549 7695 0 0.00 Stock Level 4.04198.668 : 236.113 740 0 0.00 - - --- --- - 567.72 new-order transactions per minute (NOTPM) 14.5 minute duration 0 total unknown errors 529 second(s) ramping up I'm still feeling that 550 is pretty low. The response times are beyond good and evil. As vmstat.out tells us, the CPUs are still pretty much idle or waiting most of the time. procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 3 494560 2181964 8 1787680 13 15 317 242 1402 4 1 72 23 0 9 494560 1558892 8 229834800 2973 2018 584 1114 2 1 76 21 1 14 494496 424116 8 331600020 5613 9293 935 2943 5 1 29 65 0 15 452840 150148 8 3487160 7383 5662 8709 925 3444 5 2 21 73 0 11 439172 151052 8 3386556 2630 5690 8293 969 4145 5 2 23 70 0 17 438996 149748 8 3308184 576 5036 7174 902 4104 5 2 25 69 1 25 439940 150344 8 32283049 28 4757 7479 922 4269 5 2 26 67 For everybody interested, these settings are different from Pg 8.2 default postgresql.conf: listen_addresses = '*' port = 54321 shared_buffers = 2048MB work_mem = 10MB maintenance_work_mem = 64MB #max_stack_depth = 4MB max_fsm_pages = 409600 eachcheckpoint_segments = 6 checkpoint_timeout = 1h effective_cache_size = 3800MB log_min_duration_statement = 500 For dbt2, I've used 500 warehouses and 90 concurrent connections, default values for everything else. Do I simply have to put more RAM (currently 4GB) in that machine? Or what else can be wrong? Is anybody else seeing low performance with the Areca SATA Controllers? (in my case: Areca Technology Corp. ARC-1260 16-Port PCI-Express to SATA RAID Controller, according to lspci) Then again, maybe I'm just expecting too much... Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] dbt2 NOTPM numbers
Hi, Heikki Linnakangas wrote: I still suspect there's something wrong with plans, I doubt you can get that bad performance unless it's doing something really stupid. Agreed, but I'm still looking for that really stupid thing... AFAICT, there are really no seqscans..., see the pg_stat_user_tables below. I'd suggest setting log_min_duration_statement = 5000, and seeing what you get. Also check pg_stat_user_table.seq_scan just to be extra sure there's no seq scans. I've also added some of the log messages for min_duration_statement below. Both were taken after two or three test runs. I'm really wondering, if the RAID 6 of the ARECA 1260 hurts so badly. That would be disappointing, IMO. I'll try if I can reconfigure it to do RAID 1+0, and then test again. (Unfortunately the box has already been shipped to the customer, so that's getting tricky to do via ssh..:-( ). Regards Markus *** pg_stat_user_tables *** relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | last_vacuum | last_autovacuum| last_analyze | last_autoanalyze ---+++--+--+--+---+---+---+---+-+---+--+--- 16390 | public | district |0 |0 | 206335 | 206335 | 0 |100771 | 0 | | 2007-06-05 15:40:44.39573+02 | | 2007-06-05 15:39:41.636736+02 16396 | public | new_order |0 |0 |91860 | 41402317 | 51372 | 0 | 45844 | | | | 16400 | public | order_line |0 |0 | 101195 | 933197 |538442 |436140 | 0 | | | | 16402 | public | item |0 |0 | 538942 | 538442 | 0 | 0 | 0 | | | | 16404 | public | stock |0 |0 | 1093528 | 1077782 | 0 |538442 | 0 | | | | 16394 | public | history|0 |0 | | | 49399 | 0 | 0 | | | | 16388 | public | warehouse |0 |0 | 150170 | 150170 | 0 | 49399 | 0 | | 2007-06-05 15:39:41.059572+02 | | 2007-06-05 15:38:39.976122+02 16398 | public | orders |0 |0 |96490 | 96519 | 51372 | 45930 | 0 | | | | 16392 | public | customer |0 |0 | 233263 | 599917 | 0 | 95329 | 0 | | | | *** database log snippet *** 2007-06-05 15:42:09 CEST LOG: duration: 6020.820 ms statement: SELECT * FROM order_status(1747, 291, 3, '') 2007-06-05 15:42:09 CEST LOG: duration: 688.730 ms statement: SELECT payment(47, 2, 1533, 47, 2, '', 4295.46) 2007-06-05 15:42:09 CEST LOG: duration: 5923.518 ms statement: SELECT payment(319, 8, 0, 319, 8, 'OUGHTATIONEING', 2331.47) 2007-06-05 15:42:09 CEST LOG: duration: 6370.433 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6463.583 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6358.047 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6114.972 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6193.684 ms statement: SELECT payment(96, 10, 0, 96, 10, 'ESEOUGHTBAR', 997.05) 2007-06-05 15:42:09 CEST LOG: duration: 6375.163 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6139.613 ms statement: SELECT payment(454, 8, 0, 454, 8, 'ANTIOUGHTEING', 1575.11) 2007-06-05 15:42:09 CEST LOG: duration: 6336.462 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6420.227 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6447.025 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 15549.277 ms statement: SELECT delivery(124, 7) 2007-06-05 15:42:09 CEST LOG: duration: 1432.199 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 6478.086 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 1405.925 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:09 CEST LOG: duration: 8399.567 ms statement: SELECT delivery(374, 4) 2007-06-05 15:42:10 CEST LOG: duration: 657.939 ms statement: FETCH ALL IN mycursor 2007-06-05 15:42:10 CEST LOG: duration:
Re: [PERFORM] dbt2 NOTPM numbers
Hi, Heikki Linnakangas wrote: Maybe, TPC-C is very write-intensive. I don't know much about RAID stuff, but I think you'd really benefit from a separate WAL drive. You could try turning fsync=off to see if that makes a difference. Hm.. good idea, I'll try that. Oh, and how many connections are you using? DBT-2 can be quite sensitive to that. 30 seems to work pretty well for me. I've been using between 2 and 90, but that made pretty much no difference at all. I'm not getting anything more that some 300 NOTPM. Regards Markus ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] dbt2 NOTPM numbers
Hi, I'm currently playing with dbt2 and am wondering, if the results I'm getting are reasonable. I'm testing a 2x Dual Core Xeon system with 4 GB of RAM and 8 SATA HDDs attached via Areca RAID Controller w/ battery backed write cache. Seven of the eight platters are configured as one RAID6, one spare drive. That should leave five platters for distributing read only accesses. The NOTPM numbers I'm getting are suspiciously low, IMO, but maybe I'm expecting too much. What do you think, is this reasonable or do I have to twiddle with the configuration somewhat more? Regards Markus Here are my results: Response Time (s) Transaction %Average :90th %Total Rollbacks % - - --- --- - Delivery 3.83549.046 : 595.280 1212 0 0.00 New Order 45.79524.659 : 562.01614494 151 1.05 Order Status 3.98517.497 : 551.552 12610 0.00 Payment 42.50514.562 : 550.38313452 0 0.00 Stock Level 3.90510.610 : 546.957 1236 0 0.00 - - --- --- - 238.39 new-order transactions per minute (NOTPM) 59.5 minute duration 0 total unknown errors 529 second(s) ramping up ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] dbt2 NOTPM numbers
Hi, Heikki Linnakangas wrote: There's clearly something wrong. The response times are ridiculously high, they should be 5 seconds (except for stock level transaction) to pass a TPC-C test. I wonder if you built any indexes at all? Hm.. according to the output/5/db/plan0.out, all queries use index scans, so that doesn't seem to be the problem. The configuration I'm running here has 3 data drives, and I'm getting reasonable results with ~100 warehouses, at ~1200 noTPM. Thanks, that's exactly the one simple and very raw comparison value I've been looking for. (Since most of the results pages of (former?) OSDL are down). I'll run a bonnie++ first. As the CPUs seem to be idle most of the time (see the vmstat.out below), I'm suspecting the RAID or disks. Regards Markus procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 2 329512 1289384 8 268070400 4 112 0 0 100 0 0 48 329512 1733016 8 208340000 1789 2265 553 1278 1 1 63 35 2 81 329512 492052 8 319453200 6007 10135 1025 3291 6 1 21 71 0 8 329512 153360 8 345793600 6321 11919 1093 4581 7 2 12 79 0 9 329512 150188 8 343338000 2083 5078 707 2197 2 1 35 62 0 6 329512 148412 8 340874800 1001 2888 526 1203 1 0 34 64 0 27 329512 152212 8 337973600 2281 5166 733 2320 3 1 18 79 0 11 329512 152560 8 335594000 1837 4028 626 1738 2 1 35 63 0 14 329512 149268 8 333491200 1674 3836 630 1619 2 1 31 67 0 6 329512 152916 8 331155200 1404 3017 568 1372 1 0 57 41 0 13 329688 149492 8 331320000 1687 4178 650 1644 2 1 29 69 0 84 329688 153480 8 330968400 812 3790 641 2669 1 1 22 76 0 18 329688 149232 8 33140320087 2147 511 2414 0 0 16 83 3 20 329688 149196 8 331464800 756 1854 496 1044 1 0 52 47 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] dbt2 NOTPM numbers
Hi, PFC wrote: You have a huge amount of iowait ! Yup. Did you put the xlog on a separate disk ? No, it's all one big RAID6 for the sake of simplicity (plus I doubt somewhat, that 2 disks for WAL + 5 for data + 1 spare would be much faster than 7 disks for WAL and data + 1 spare - considering that RAID6 needs two parity disks, that's 3 vs 5 disks for data...) What filesystem do you use ? XFS Did you check that your BBU cache works ? Thanks to you're hint, yes. I've attached the small python script, in case it might help someone else, too. For that run a dumb script which does INSERTS in a test table in autocommit mode ; if you get (7200rpm / 60) = 120 inserts / sec or less, the good news is that your drives don't lie about fsync, the bad news is that your BBU cache isn't working... According to my little script, I constantly get somewhat around 6000 inserts per second, so I guess either my BBU works, or the drives are lying ;-) Simplistic troughput testing with dd gives 200MB/s, which also seems fine. Obviously there's something else I'm doing wrong. I didn't really care much about postgresql.conf, except setting a larger shared_buffers and a reasonable effective_cache_size. Oh, something else that's probably worth thinking about (and just came to my mind again): the XFS is on a lvm2, on that RAID6. Regards Markus Simplistic throughput testing with dd: dd of=test if=/dev/zero bs=10K count=80 80+0 records in 80+0 records out 819200 bytes (8.2 GB) copied, 37.3552 seconds, 219 MB/s pamonth:/opt/dbt2/bb# dd if=test of=/dev/zero bs=10K count=80 80+0 records in 80+0 records out 819200 bytes (8.2 GB) copied, 27.6856 seconds, 296 MB/s #!/usr/bin/python import sys, time import psycopg count = 50 db = psycopg.connect(user=postgres dbname=test) db.autocommit(True) dbc = db.cursor() dbc.execute(CREATE TABLE test (data TEXT);) sys.stdout.flush() start_t = time.time() for i in range(count): dbc.execute(INSERT INTO test VALUES('insert no. %d'); % i) diff_t = time.time() - start_t print %d inserts in %0.3f seconds, %f inserts/sec % (count, diff_t, count / diff_t) dbc.execute(DROP TABLE test;) Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP pamonth 8G 48173 99 214016 33 93972 20 49244 92 266763 32 615.3 0 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 20510 82 + +++ 25655 98 23954 99 + +++ 25441 99 pamonth,8G,48173,99,214016,33,93972,20,49244,92,266763,32,615.3,0,16,20510,82,+,+++,25655,98,23954,99,+,+++,25441,99 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] [PERFORM] Locking in PostgreSQL?
Hi, Dave Cramer wrote: Apparently I've completely misunderstood MVCC then Probably not. You are both somewhat right. Jens Schipkowski wrote: Thats not right. UPDATE will force a RowExclusiveLock to rows matching the WHERE clause, or all if no one is specified. That almost right, RowExclusiveLock is a table level lock. An UPDATE acquires that, yes. Additionally there are row-level locks, which is what you're speaking about. An UPDATE gets an exclusive row-level lock on rows it updates. Please note however, that these row-level locks only block concurrent writers, not readers (MVCC lets the readers see the old, unmodified row). My understanding is that unless you do a select ... for update then update the rows will not be locked. Also almost right, depending on what you mean by 'locked'. A plain SELECT acquires an ACCESS SHARE lock on the table, but no row-level locks. Only a SELECT ... FOR UPDATE does row-level locking (shared ones here...) The very fine documentation covers that in [1]. Regards Markus [1]: PostgreSQL Documentation, Explicit Locking: http://www.postgresql.org/docs/8.2/interactive/explicit-locking.html ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] File Systems Compared
Hi, Alexander Staubo wrote: Care to post these numbers *without* word wrapping? Thanks. How is one supposed to do that? Care giving an example? Markus ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] File Systems Compared
Hi, Steinar H. Gunderson wrote: This is a rather long sentence without any kind of word wrapping except what would be imposed on your own side -- how to set that up properly depends on the sending e-mail client, but in mine it's just a matter of turning off the word wrapping in your editor :-) Duh! Cool, thank you for the example :-) I thought the MTA or at least the the mailing list would wrap mails at some limit. I've now set word-wrap to characters (it seems not possible to turn it off completely in thunderbird). But when writing, I'm now getting one long line. What's common practice? What's it on the pgsql mailing lists? Regards Markus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match