Re: [PERFORM] Hardware/OS recommendations for large databases (
It certainly makes quite a difference as I measure it: doing select(1) from a 181000 page table (completely uncached) on my PIII: 8.0 : 32 s 8.1 : 25 s Note that the 'fastcount()' function takes 21 s in both cases - so all the improvement seems to be from the count overhead reduction. Are you running windows? There is a big performance improvement in count(*) on pg 8.0-8.1 on win32 that is not relevant to this debate... Merlin ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Merlin Moncure wrote: It certainly makes quite a difference as I measure it: doing select(1) from a 181000 page table (completely uncached) on my PIII: 8.0 : 32 s 8.1 : 25 s Note that the 'fastcount()' function takes 21 s in both cases - so all the improvement seems to be from the count overhead reduction. Are you running windows? There is a big performance improvement in count(*) on pg 8.0-8.1 on win32 that is not relevant to this debate... No - FreeBSD 6.0 on a dual PIII 1 Ghz. The slow cpu means that the 8.1 improvements are very noticeable! A point of interest - applying Niels palloc - avoiding changes to NodeAgg.c and int8.c in 8.0 changes those results to: 8.0 + palloc avoiding patch : 27 s (I am guessing the remaining 2 s could be shaved off if I backported 8.1's virtual tuples - however that looked like a lot of work) Cheers Mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Forgive my ignorance, but what is MPP? Is that part of Bizgres? Is it possible to upgrade from Postgres 8.1 to Bizgres? Thanks, Brendan Duddridge | CTO | 403-277-5591 x24 | [EMAIL PROTECTED] ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Nov 28, 2005, at 3:05 PM, Luke Lonergan wrote: Mark, On 11/28/05 1:45 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: 8.0 : 32 s 8.1 : 25 s A 22% reduction. select count(1) on 12,900MB = 1617125 pages fully cached: MPP based on 8.0 : 6.06s MPP based on 8.1 : 4.45s A 26% reduction. I'll take it! I am looking to back-port Tom's pre-8.2 changes and test again, maybe tonight. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings smime.p7s Description: S/MIME cryptographic signature
Re: [PERFORM] Hardware/OS recommendations for large databases (
Brendan Duddridge wrote: Thanks for your reply. So how is that different than something like Slony2 or pgcluster with multi-master replication? Is it similar technology? We're currently looking for a good clustering solution that will work on our Apple Xserves and Xserve RAIDs. I think you need to be more specific about what you're trying to do. 'clustering' encompasses so many things that it means almost nothing by itself. slony provides facilities for replicating data. Its primary purpose is to improve reliability. MPP distributes both data and queries. Its primary purpose is to improve performance for a subset of all query types. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Mon, 28 Nov 2005, Brendan Duddridge wrote: Hi David, Thanks for your reply. So how is that different than something like Slony2 or pgcluster with multi-master replication? Is it similar technology? We're currently looking for a good clustering solution that will work on our Apple Xserves and Xserve RAIDs. MPP doesn't just split up the data, it splits up the processing as well, so if you have a 5 machine cluster, each machine holds 1/5 of your data (plus a backup for one of the other machines) and when you do a query MPP slices and dices the query to send a subset of the query to each machine, it then gets the responses from all the machines and combines them if you ahve to do a full table scan for example, wach machine would only have to go through 20% of the data a Slony of pgcluster setup has each machine with a full copy of all the data, only one machine can work on a given query at a time, and if you have to do a full table scan one machine needs to read 100% of the data. in many ways this is the holy grail of databases. almost all other areas of computing can now be scaled by throwing more machines at the problem in a cluster, with each machine just working on it's piece of the problem, but databases have had serious trouble doing the same and so have been ruled by the 'big monster machine'. Oracle has been selling Oracle Rac for a few years, and reports from people who have used it range drasticly (from it works great, to it's a total disaster), in part depending on the types of queries that have been made. Greenplum thinks that they have licked the problems for the more general case (and that commodity networks are now fast enough to match disk speeds in processing the data) if they are right then when they hit full release with the new version they should be cracking a lot of the price/performance records on the big database benchmarks (TPC and similar), and if their pricing is reasonable, they may be breaking them by an order of magnatude or more (it's not unusual for the top machines to spend more then $1,000,000 on just their disk arrays for those systems, MPP could conceivably put togeather a cluster of $5K machines that runs rings around them (and probably will for at least some of the subtests, the big question is if they can sweep the board and take the top spots outright) they have more details (and marketing stuff) on their site at http://www.greenplum.com/prod_deepgreen_cluster.html don't get me wrong, I am very impressed with their stuff, but (haveing ranted a little here on the list about them) I think MPP and it's performace is a bit off topic for the postgres performance list (at least until the postgres project itself starts implementing similar features :-) David Lang Thanks, Brendan Duddridge | CTO | 403-277-5591 x24 | [EMAIL PROTECTED] ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Nov 27, 2005, at 8:09 PM, David Lang wrote: On Mon, 28 Nov 2005, Brendan Duddridge wrote: Forgive my ignorance, but what is MPP? Is that part of Bizgres? Is it possible to upgrade from Postgres 8.1 to Bizgres? MPP is the Greenplum propriatary extention to postgres that spreads the data over multiple machines, (raid, but with entire machines not just drives, complete with data replication within the cluster to survive a machine failing) for some types of queries they can definantly scale lineraly with the number of machines (other queries are far more difficult and the overhead of coordinating the machines shows more. this is one of the key things that the new version they recently announced the beta for is supposed to be drasticly improving) early in the year when I first looked at them their prices were exorbadent, but Luke says I'm wildly mistake on their current prices so call them for details it uses the same interfaces as postgres so it should be a drop in replacement to replace a single server with a cluster. it's facinating technology to read about. I seem to remember reading that one of the other postgres companies is also producing a clustered version of postgres, but I don't remember who and know nothing about them. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Have you factored in how long it takes to build an index on 5TB? And the index size? Really, it's a whole different world at multi-TB, everything has to scale. Btw we don't just scan in parallel, we do all in parallel, check the sort number on this thread. Mpp is for the god box too. And your price is wrong - but if you want free then you'll have to find another way to get your work done. - Luke - Luke -- Sent from my BlackBerry Wireless Device -Original Message- From: David Lang [EMAIL PROTECTED] To: Luke Lonergan [EMAIL PROTECTED] CC: pgsql-performance@postgresql.org pgsql-performance@postgresql.org Sent: Sat Nov 26 14:34:14 2005 Subject: Re: [PERFORM] Hardware/OS recommendations for large databases ( On Sun, 27 Nov 2005, Luke Lonergan wrote: For data warehousing its pretty well open and shut. To use all cpus and io channels on each query you will need mpp. Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. if you truely need to scan the entire database then you are right, however indexes should be able to cut the amount you need to scan drasticly. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Sun, 27 Nov 2005, Luke Lonergan wrote: Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. Err, I get about 31 megabytes/second to do 5TB in 170,000 seconds. I think perhaps you were exaggerating a bit or adding additional overhead not obvious from the above. ;) --- At 1 gigabyte per second, 1 terrabyte should take about 1000 seconds (between 16 and 17 minutes). The impressive 3.2 gigabytes per second listed before (if it actually scans consistently at that rate), puts it at a little over 5 minutes I believe for 1, so about 26 for 5 terrabytes. The 200 megabyte per second number puts it about 7 hours for 5 terrabytes AFAICS. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases
At 01:18 AM 11/27/2005, Luke Lonergan wrote: For data warehousing its pretty well open and shut. To use all cpus and io channels on each query you will need mpp. Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. More than a bit of hyperbole there Luke. Some common RW scenarios: Dual 1GbE NICs = 200MBps = 5TB in 5x10^12/2x10^8= 25000secs= ~6hrs57mins. Network stuff like re-transmits of dropped packets can increase this, so network SLA's are critical. Dual 10GbE NICs = ~1.6GBps (10GbE NICs can't yet do over ~800MBps apiece) = 5x10^12/1.6x10^9= 3125secs= ~52mins. SLA's are even moire critical here. If you are pushing 5TB around on a regular basis, you are not wasting your time money on commodity = 300MBps RAID HW. You'll be using 800MBps and 1600MBps high end stuff, which means you'll need ~1-2hrs to sequentially scan 5TB on physical media. Clever use of RAM can get a 5TB sequential scan down to ~17mins. Yes, it's a lot of data. But sequential scan times should be in the mins or low single digit hours, not days. Particularly if you use RAM to maximum advantage. Ron ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases
Ron, On 11/27/05 9:10 AM, Ron [EMAIL PROTECTED] wrote: Clever use of RAM can get a 5TB sequential scan down to ~17mins. Yes, it's a lot of data. But sequential scan times should be in the mins or low single digit hours, not days. Particularly if you use RAM to maximum advantage. Unfortunately, RAM doesn't help with scanning from disk at all. WRT using network interfaces to help - it's interesting, but I think what you'd want to connect to is other machines with storage on them. - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Stephan, On 11/27/05 7:48 AM, Stephan Szabo [EMAIL PROTECTED] wrote: On Sun, 27 Nov 2005, Luke Lonergan wrote: Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. Err, I get about 31 megabytes/second to do 5TB in 170,000 seconds. I think perhaps you were exaggerating a bit or adding additional overhead not obvious from the above. ;) Thanks - the calculator on my blackberry was broken ;-) At 1 gigabyte per second, 1 terrabyte should take about 1000 seconds (between 16 and 17 minutes). The impressive 3.2 gigabytes per second listed before (if it actually scans consistently at that rate), puts it at a little over 5 minutes I believe for 1, so about 26 for 5 terrabytes. The 200 megabyte per second number puts it about 7 hours for 5 terrabytes AFAICS. 7 hours, days, same thing ;-) On the reality of sustained scan rates like that: We're getting 2.5GB/s sustained on a 2 year old machine with 16 hosts and 96 disks. We run them in RAID0, which is only OK because MPP has built-in host to host mirroring for fault management. We just purchased a 4-way cluster with 8 drives each using the 3Ware 9550SX. Our thought was to try the simplest approach first, which is a single RAID5, which gets us 7 drives worth of capacity and performance. As I posted earlier, we get about 400MB/s seq scan rate on the RAID, but the Postgres 8.0 current scan rate limit is 64% of 400MB/s or 256MB/s per host. The 8.1 mods (thanks Qingqing and Tom!) may increase that significantly toward the 400 max - we've already merged the 8.1 codebase into MPP so we'll also feature the same enhancements. Our next approach is to run these machines in a split RAID0 configuration, or RAID0 on 4 and 4 drives. We then run an MPP segment instance bound to each CPU and I/O channel. At that point, we'll have all 8 drives of performance and capacity per host and we should get 333MB/s with current MPP and perhaps over 400MB/s with MPP/8.1. That would get us up to the 3.2GB/s for 8 hosts. Even better, all operators are executed on all CPUs for each query, so sorting, hashing, agg, etc etc are run on all CPUs in the cluster. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases
At 02:11 PM 11/27/2005, Luke Lonergan wrote: Ron, On 11/27/05 9:10 AM, Ron [EMAIL PROTECTED] wrote: Clever use of RAM can get a 5TB sequential scan down to ~17mins. Yes, it's a lot of data. But sequential scan times should be in the mins or low single digit hours, not days. Particularly if you use RAM to maximum advantage. Unfortunately, RAM doesn't help with scanning from disk at all. I agree with you if you are scanning a table cold, having never loaded it before, or if the system is not (or can't be) set up properly with appropriate buffers. However, outside of those 2 cases there are often tricks you can use with enough RAM (and no, you don't need RAM equal to the size of the item(s) being scanned) to substantially speed things up. Best case, you can get performance approximately equal to that of a RAM resident scan. WRT using network interfaces to help - it's interesting, but I think what you'd want to connect to is other machines with storage on them. Maybe. Or maybe you want to concentrate your storage in a farm that is connected by network or Fiber Channel to the rest of your HW. That's what a NAS or SAN is after all. The rest of your HW nowadays is often a cluster of RAM rich hosts. Assuming 64GB per host, 5TB can be split across ~79 hosts if you want to make it all RAM resident. Most don't have that kind of budget, but thankfully it is not usually necessary to make all of the data RAM resident in order to obtain if not all of the performance benefits you'd get if all of the data was. Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Sun, 27 Nov 2005, Luke Lonergan wrote: Stephan, On 11/27/05 7:48 AM, Stephan Szabo [EMAIL PROTECTED] wrote: On Sun, 27 Nov 2005, Luke Lonergan wrote: Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. Err, I get about 31 megabytes/second to do 5TB in 170,000 seconds. I think perhaps you were exaggerating a bit or adding additional overhead not obvious from the above. ;) Thanks - the calculator on my blackberry was broken ;-) Well, it was suspiciously close to a factor of 60 off, which when working in time could have just been a simple math error. At 1 gigabyte per second, 1 terrabyte should take about 1000 seconds (between 16 and 17 minutes). The impressive 3.2 gigabytes per second listed before (if it actually scans consistently at that rate), puts it at a little over 5 minutes I believe for 1, so about 26 for 5 terrabytes. The 200 megabyte per second number puts it about 7 hours for 5 terrabytes AFAICS. 7 hours, days, same thing ;-) On the reality of sustained scan rates like that: Well, the reason I asked was that IIRC the 3.2 used earlier in the discussion was exactly multiplying scanners and base rate (ie, no additional overhead). I couldn't tell if that was back of the envelope or if the overhead was in fact negligible. (Or I could be misremembering the conversation). I don't doubt that it's possible to get the rate, just wasn't sure if the rate was actually applicable to the ongoing discussion of the comparison. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Another thought - I priced out a maxed out machine with 16 cores and 128GB of RAM and 1.5TB of usable disk - $71,000. You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB of disk for $48,000, and it would be 16 times faster in scan rate, which is the most important factor for large databases. The size would be 16 rack units instead of 5, and you'd have to add a GigE switch for $1500. Scan rate for above SMP: 200MB/s Scan rate for above cluster: 3,200Mb/s You could even go dual core and double the memory on the cluster and you'd about match the price of the god box. - Luke Luke, I assume you are talking about useing the Greenplum MPP for this (otherwise I don't know how you are combining all the different systems). If you are, then you are overlooking one very significant factor, the cost of the MPP software, at $10/cpu the cluster has an extra $160K in software costs, which is double the hardware costs. if money is no object then go for it, but if it is then you comparison would be (ignoring software maintinance costs) the 16 core 128G ram system vs ~3xsmall systems totaling 6 cores and 48G ram. yes if scan speed is the bottleneck you still win with the small systems, but for most other uses the large system would win easily. and in any case it's not the open and shut case that you keep presenting it as. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
For data warehousing its pretty well open and shut. To use all cpus and io channels on each query you will need mpp. Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. - Luke -- Sent from my BlackBerry Wireless Device -Original Message- From: [EMAIL PROTECTED] [EMAIL PROTECTED] To: pgsql-performance@postgresql.org pgsql-performance@postgresql.org Sent: Sat Nov 26 13:51:18 2005 Subject: Re: [PERFORM] Hardware/OS recommendations for large databases ( Another thought - I priced out a maxed out machine with 16 cores and 128GB of RAM and 1.5TB of usable disk - $71,000. You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB of disk for $48,000, and it would be 16 times faster in scan rate, which is the most important factor for large databases. The size would be 16 rack units instead of 5, and you'd have to add a GigE switch for $1500. Scan rate for above SMP: 200MB/s Scan rate for above cluster: 3,200Mb/s You could even go dual core and double the memory on the cluster and you'd about match the price of the god box. - Luke Luke, I assume you are talking about useing the Greenplum MPP for this (otherwise I don't know how you are combining all the different systems). If you are, then you are overlooking one very significant factor, the cost of the MPP software, at $10/cpu the cluster has an extra $160K in software costs, which is double the hardware costs. if money is no object then go for it, but if it is then you comparison would be (ignoring software maintinance costs) the 16 core 128G ram system vs ~3xsmall systems totaling 6 cores and 48G ram. yes if scan speed is the bottleneck you still win with the small systems, but for most other uses the large system would win easily. and in any case it's not the open and shut case that you keep presenting it as. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark, See the results below and analysis - the pure HeapScan gets 94.1% of the max available read bandwidth (cool!). Nothing wrong with heapscan in the presence of large readahead, which is good news. That says it's something else in the path. As you probably know there is a page lock taken, a copy of the tuple from the page, lock removed, count incremented for every iteration of the agg node on a count(*). Is the same true of a count(1)? I recall that the profile is full of memcpy and memory context calls. It would be nice to put some tracers into the executor and see where the time is going. I'm also curious about the impact of the new 8.1 virtual tuples in reducing the executor overhead. In this case my bet's on the agg node itself, what do you think? - Luke On 11/21/05 9:10 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: Luke Lonergan wrote: So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? It might be interesting to see what effect reducing the cpu consumption entailed by the count aggregation has - by (say) writing a little bit of code to heap scan the desired relation (sample attached). OK - here are results for a slightly smaller (still bigger than RAM) lineitem on the same machine, using the same xfs filesystem that achieved 407MB/s: 12.9GB of DBT-3 data from the lineitem table llonergan=# select relpages from pg_class where relname='lineitem'; relpages -- 1579270 (1 row) 1579270*8192/100 12937 Million Bytes or 12.9GB llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 197870.105 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 49912.164 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 49218.739 ms llonergan=# select fastcount('lineitem'); fastcount --- 59986052 (1 row) Time: 33752.778 ms llonergan=# select fastcount('lineitem'); fastcount --- 59986052 (1 row) Time: 34543.646 ms llonergan=# select fastcount('lineitem'); fastcount --- 59986052 (1 row) Time: 34528.053 ms Analysis: Bandwidth Percent of max dd Read 407MB/s 100% Count(1)263MB/s 64.6% HeapScan383MB/s 94.1% Wow - looks like the HeapScan gets almost all of the available bandwidth! - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: 12.9GB of DBT-3 data from the lineitem table llonergan=# select relpages from pg_class where relname='lineitem'; relpages -- 1579270 (1 row) 1579270*8192/100 12937 Million Bytes or 12.9GB llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 197870.105 ms So 198 seconds is the uncached read time with count (Just for clarity, did you clear the Pg and filesystem caches or unmount / remount the filesystem?) llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 49912.164 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 49218.739 ms and ~50 seconds is the (partially) cached read time with count llonergan=# select fastcount('lineitem'); fastcount --- 59986052 (1 row) Time: 33752.778 ms llonergan=# select fastcount('lineitem'); fastcount --- 59986052 (1 row) Time: 34543.646 ms llonergan=# select fastcount('lineitem'); fastcount --- 59986052 (1 row) Time: 34528.053 ms so ~34 seconds is the (partially) cached read time for fastcount - I calculate this to give ~362Mb/s effective IO rate (I'm doing / by 1024*1024 not 1000*1000) FWIW. While this is interesting, you probably want to stop Pg, unmount the filesystem, and restart Pg to get the uncached time for fastcount too (and how does this compare to uncached read with dd using the same block size?). But at this stage it certainly looks the the heapscan code is pretty efficient - great! Oh - and do you want to try out 32K block size, I'm interested to see what level of improvement you get (as my system is hopelessly cpu bound...)! Analysis: Bandwidth Percent of max dd Read 407MB/s 100% Count(1)263MB/s 64.6% HeapScan383MB/s 94.1% Cheers Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Mark, It would be nice to put some tracers into the executor and see where the time is going. I'm also curious about the impact of the new 8.1 virtual tuples in reducing the executor overhead. In this case my bet's on the agg node itself, what do you think? Yeah - it's pretty clear that the count aggregate is fairly expensive wrt cpu - However, I am not sure if all agg nodes suffer this way (guess we could try a trivial aggregate that does nothing for all tuples bar the last and just reports the final value it sees). Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark, Time: 197870.105 ms So 198 seconds is the uncached read time with count (Just for clarity, did you clear the Pg and filesystem caches or unmount / remount the filesystem?) Nope - the longer time is due to the second write known issue with Postgres - it writes the data to the table, but all of the pages are marked dirty? So, always on the first scan after loading they are written again. This is clear as you watch vmstat - the pattern on the first seq scan is half read / half write. Time: 49218.739 ms and ~50 seconds is the (partially) cached read time with count Again - the pattern here is pure read and completely non-cached. You see a very nearly constant I/O rate when watching vmstat for the entire scan. Time: 34528.053 ms so ~34 seconds is the (partially) cached read time for fastcount - I calculate this to give ~362Mb/s effective IO rate (I'm doing / by 1024*1024 not 1000*1000) FWIW. The dd number uses 1000*1000, so I maintained it for the percentage of max. While this is interesting, you probably want to stop Pg, unmount the filesystem, and restart Pg to get the uncached time for fastcount too (and how does this compare to uncached read with dd using the same block size?). I'll do it again sometime, but I've already deleted the file. I've done the following in the past to validate this though: - Reboot machine - Rerun scan And we get identical results. But at this stage it certainly looks the the heapscan code is pretty efficient - great! Yep. Oh - and do you want to try out 32K block size, I'm interested to see what level of improvement you get (as my system is hopelessly cpu bound...)! Yah - done so in the past and not seen any - was waiting for Alan to post his results. Analysis: Bandwidth Percent of max dd Read 407MB/s 100% Count(1)263MB/s 64.6% HeapScan383MB/s 94.1% Note these are all in consistent 1000x1000 units. Thanks for the test - neat trick! We'll use it to do some more profiling some time soon... - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: That says it's something else in the path. As you probably know there is a page lock taken, a copy of the tuple from the page, lock removed, count incremented for every iteration of the agg node on a count(*). Is the same true of a count(1)? Sorry Luke - message 3 - I seem to be suffering from a very small working memory buffer myself right now, I think it's after a day of working with DB2 ... :-) Anyway, as I read src/backend/parser/gram.y:6542 - count(*) is transformed into count(1), so these two are identical. Cheers (last time tonight, promise!) Mark ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Mark, Time: 197870.105 ms So 198 seconds is the uncached read time with count (Just for clarity, did you clear the Pg and filesystem caches or unmount / remount the filesystem?) Nope - the longer time is due to the second write known issue with Postgres - it writes the data to the table, but all of the pages are marked dirty? So, always on the first scan after loading they are written again. This is clear as you watch vmstat - the pattern on the first seq scan is half read / half write. Ah - indeed - first access after a COPY no? I should have thought of that, sorry! ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark Kirkwood [EMAIL PROTECTED] writes: Yeah - it's pretty clear that the count aggregate is fairly expensive wrt cpu - However, I am not sure if all agg nodes suffer this way (guess we could try a trivial aggregate that does nothing for all tuples bar the last and just reports the final value it sees). As you mention count(*) and count(1) are the same thing. Last I heard the reason count(*) was so expensive was because its state variable was a bigint. That means it doesn't fit in a Datum and has to be alloced and stored as a pointer. And because of the Aggregate API that means it has to be allocated and freed for every tuple processed. There was some talk of having a special case API for count(*) and maybe sum(...) to avoid having to do this. There was also some talk of making Datum 8 bytes wide on platforms where that was natural (I guess AMD64, Sparc64, Alpha, Itanic). Afaik none of these items have happened but I don't know for sure. -- greg ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark [EMAIL PROTECTED] writes: Last I heard the reason count(*) was so expensive was because its state variable was a bigint. That means it doesn't fit in a Datum and has to be alloced and stored as a pointer. And because of the Aggregate API that means it has to be allocated and freed for every tuple processed. There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil Conway IIRC). regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: Last I heard the reason count(*) was so expensive was because its state variable was a bigint. That means it doesn't fit in a Datum and has to be alloced and stored as a pointer. And because of the Aggregate API that means it has to be allocated and freed for every tuple processed. There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil Conway IIRC). ah, cool, missed that. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
The same 12.9GB distributed across 4 machines using Bizgres MPP fits into I/O cache. The interesting result is that the query select count(1) is limited in speed to 280 MB/s per CPU when run on the lineitem table. So when I run it spread over 4 machines, one CPU per machine I get this: == Bizgres MPP, 4 data segments, 1 per 2 CPUs == llonergan=# explain select count(1) from lineitem; QUERY PLAN -- Aggregate (cost=582452.00..582452.00 rows=1 width=0) - Gather Motion (cost=582452.00..582452.00 rows=1 width=0) - Aggregate (cost=582452.00..582452.00 rows=1 width=0) - Seq Scan on lineitem (cost=0.00..544945.00 rows=15002800 width=0) (4 rows) llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 12191.435 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 11986.109 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 11448.941 ms == That's 12,937 MB in 11.45 seconds, or 1,130 MB/s. When you divide out the number of Postgres instances (4), that's 283MB/s per Postgres instance. To verify that this has nothing to do with MPP, I ran it in a special internal mode on one instance and got the same result. So - we should be able to double this rate by running one segment per CPU, or two per host: == Bizgres MPP, 8 data segments, 1 per CPU == llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 6484.594 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 6156.729 ms llonergan=# select count(1) from lineitem; count -- 59986052 (1 row) Time: 6063.416 ms == That's 12,937 MB in 11.45 seconds, or 2,134 MB/s. When you divide out the number of Postgres instances (8), that's 267MB/s per Postgres instance. So, if you want to select count(1), using more CPUs is a good idea! For most complex queries, having lots of CPUs + MPP is a good combo. Here is an example of a sorting plan - this should probably be done with a hash aggregation, but using 8 CPUs makes it go 8x faster: - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Tom Lane wrote: Greg Stark [EMAIL PROTECTED] writes: Last I heard the reason count(*) was so expensive was because its state variable was a bigint. That means it doesn't fit in a Datum and has to be alloced and stored as a pointer. And because of the Aggregate API that means it has to be allocated and freed for every tuple processed. There's a hack in 8.1 to avoid the palloc overhead (courtesy of Neil Conway IIRC). It certainly makes quite a difference as I measure it: doing select(1) from a 181000 page table (completely uncached) on my PIII: 8.0 : 32 s 8.1 : 25 s Note that the 'fastcount()' function takes 21 s in both cases - so all the improvement seems to be from the count overhead reduction. Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Bruce, On 11/22/05 4:13 PM, Bruce Momjian pgman@candle.pha.pa.us wrote: Perfect summary. We have a background writer now. Ideally we would have a background reader, that reads-ahead blocks into the buffer cache. The problem is that while there is a relatively long time between a buffer being dirtied and the time it must be on disk (checkpoint time), the read-ahead time is much shorter, requiring some kind of quick create a thread approach that could easily bog us down as outlined above. Yes, the question is how much read-ahead buffer is needed to equate to the 38% of I/O wait time in the current executor profile? The idea of asynchronous buffering would seem appropriate if the executor would use the 38% of time as useful work. A background reader is an interesting approach - it would require admin management of buffers where AIO would leave that in the kernel. The advantage over AIO would be more universal platform support I suppose? Right now the file system will do read-ahead for a heap scan (but not an index scan), but even then, there is time required to get that kernel block into the PostgreSQL shared buffers, backing up Luke's observation of heavy memcpy() usage. As evidenced by the 16MB readahead setting still resulting in only 36% IO wait. So what are our options? mmap()? I have no idea. Seems larger page size does help. Not sure about that, we used to run with 32KB page size and I didn't see a benefit on seq scan at all. I haven't seen tests in this thread that compare 8K to 32K. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Alan, Why not contribute something - put up proof of your stated 8KB versus 32KB page size improvement. - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Why not contribute something - put up proof of your stated 8KB versus 32KB page size improvement. I did observe that 32KB block sizes were a significant win for our usage patterns. It might be a win for any of the following reasons: 0) The preliminaries: ~300GB database with about ~50GB daily turnover. Our data is fairly reasonably grouped. If we're getting one item on a page we're usually looking at the other items as well. 1) we can live with a smaller FSM size. We were often leaking pages with a 10M page FSM setting. With 32K pages, a 10M FSM size is sufficient. Yes, the solution to this is run vacuum more often, but when the vacuum was taking 10 hours at a time, that was hard to do. 2) The typical datum size in our largest table is about 2.8KB, which is more than 1/4 page size thus resulting in the use of a toast table. Switching to 32KB pages allows us to get a decent storage of this data into the main tables, thus avoiding another table and associated large index. Not having the extra index in memory for a table with 90M rows is probably beneficial. 3) vacuum time has been substantially reduced. Vacuum analyze now run in the 2 to 3 hour range depending on load. 4) less cpu time spent in the kernel. We're basically doing 1/4 as many system calls. Overall the system has now been working well. We used to see the database being a bottleneck at times, but now it's keeping up nicely. Hope this helps. Happy Thanksgiving! -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, On 11/23/05 2:00 PM, Alan Stange [EMAIL PROTECTED] wrote: Luke Lonergan wrote: Why not contribute something - put up proof of your stated 8KB versus 32KB page size improvement. I did observe that 32KB block sizes were a significant win for our usage patterns. It might be a win for any of the following reasons: (* big snip *) Though all of what you relate is interesting, it seems irrelevant to your earlier statement here: Alan Stange [EMAIL PROTECTED] writes: If your goal is sequential IO, then one must use larger block sizes. No one would use 8KB IO for achieving high sequential IO rates. Simply put, read() is about the slowest way to get 8KB of data. Switching to 32KB blocks reduces all the system call overhead by a large margin. Larger blocks would be better still, up to the stripe size of your mirror. (Of course, you're using a mirror and not raid5 if you care about performance.) And I am interested in seeing if your statement is correct. Do you have any proof of this to share? - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Mark, This is an excellent idea unfortunately Im in Maui right now (Mahalo!) and Im not getting to testing with this. My first try was with 8.0.3 and its an 8.1 function I presume. Not to be lazy but any hint as to how to do the same thing for 8.0? - Luke On 11/21/05 9:10 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: Luke Lonergan wrote: So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? It might be interesting to see what effect reducing the cpu consumption entailed by the count aggregation has - by (say) writing a little bit of code to heap scan the desired relation (sample attached). Cheers Mark /* * fastcount.c * * Do a count that uses considerably less CPU time than an aggregate. */ #include postgres.h #include funcapi.h #include access/heapam.h #include catalog/namespace.h #include utils/builtins.h extern Datum fastcount(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(fastcount); Datum fastcount(PG_FUNCTION_ARGS) { text *relname = PG_GETARG_TEXT_P(0); RangeVar *relrv; Relationrel; HeapScanDesc scan; HeapTupletuple; int64result = 0; /* Use the name to get a suitable range variable and open the relation. */ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname)); rel = heap_openrv(relrv, AccessShareLock); /* Start a heap scan on the relation. */ scan = heap_beginscan(rel, SnapshotNow, 0, NULL); while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { result++; } /* End the scan and close up the relation. */ heap_endscan(scan); heap_close(rel, AccessShareLock); PG_RETURN_INT64(result); }
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Mark, This is an excellent idea – unfortunately I’m in Maui right now (Mahalo!) and I’m not getting to testing with this. My first try was with 8.0.3 and it’s an 8.1 function I presume. Not to be lazy – but any hint as to how to do the same thing for 8.0? Yeah, it's 8.1 - I didn't think to check against 8.0. The attached variant works with 8.0.4 (textToQualifiedNameList needs 2 args) cheers Mark P.s. Maui eh, sounds real nice. /* * fastcount.c * * Do a count that uses considerably less CPU time than an aggregate. * * (Variant for 8.0.x - textToQualifiedNameList needs 2 args) */ #include postgres.h #include funcapi.h #include access/heapam.h #include catalog/namespace.h #include utils/builtins.h extern Datum fastcount(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(fastcount); Datum fastcount(PG_FUNCTION_ARGS) { text *relname = PG_GETARG_TEXT_P(0); RangeVar *relrv; Relationrel; HeapScanDesc scan; HeapTuple tuple; int64 result = 0; /* Use the name to get a suitable range variable and open the relation. */ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname, )); rel = heap_openrv(relrv, AccessShareLock); /* Start a heap scan on the relation. */ scan = heap_beginscan(rel, SnapshotNow, 0, NULL); while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { result++; } /* End the scan and close up the relation. */ heap_endscan(scan); heap_close(rel, AccessShareLock); PG_RETURN_INT64(result); } ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke, - XFS will probably generate better data rates with larger files. You really need to use the same file size as does postgresql. Why compare the speed to reading a 16G file and the speed to reading a 1G file. They won't be the same. If need be, write some code that does the test or modify lmdd to read a sequence of 1G files. Will this make a difference? You don't know until you do it. Any time you cross a couple of 2^ powers in computing, you should expect some differences. - you did umount the file system before reading the 16G file back in? Because if you didn't then your read numbers are possibly garbage. When the read began, 8G of the file was in memory. You'd be very naive to think that somehow the read of the first 8GB somehow flushed that cached data out of memory. After all, why would the kernel flush pages from file X when you're in the middle of a sequential read of...file X? I'm not sure how Linux handles this, but Solaris would've found the 8G still in memory. - What was the hardware and disk configuration on which these numbers were generated? For example, if you have a U320 controller, how did the read rate become larger than 320MB/s? - how did the results change from before? Just posting the new results is misleading given all the boasting we've had to read about your past results. - there are two results below for writing to ext2: one at 209 MB/s and one at 113MB/s. Why are they different? - what was the cpu usage during these tests? We see postgresql doing 200+MB/s of IO. You've claimed many times that the machine would be compute bound at lower IO rates, so how much idle time does the cpu still have? - You wrote: We'll do a 16GB table size to ensure that we aren't reading from the read cache. Do you really believe that?? You have to umount the file system before each test to ensure you're really measuring the disk IO rate. If I'm reading your results correctly, it looks like you have three results for ext and xfs, each of which is faster than the prior one. If I'm reading this correctly, then it looks like one is clearly reading from the read cache. - Gee, it's so nice of you to drop your 120MB/s observation. I guess my reading at 300MB/s wasn't convincing enough. Yeah, I think it was the cpus too... - I wouldn't focus on the flat 64% of the data rate number. It'll probably be different on other systems. I'm all for testing and testing. It seems you still cut a corner without umounting the file system first. Maybe I'm a little too old school on this, but I wouldn't spend a dime until you've done the measurements correctly. Good Luck. -- Alan Luke Lonergan wrote: Alan, Looks like Postgres gets sensible scan rate scaling as the filesystem speed increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs got faster since I last tested this. The scaling looks like 64% of the I/O subsystem speed is available to the executor - so as the I/O subsystem increases in scan rate, so does Postgres' executor scan speed. So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? - Luke = Results === Unless noted otherwise all results posted are for block device readahead set to 16M using blockdev --setra=16384 block_device. All are using the 2.6.9-11 Centos 4.1 kernel. For those who don't have lmdd, here is a comparison of two results on an ext2 filesystem: [EMAIL PROTECTED] dbfast1]# time bash -c (dd if=/dev/zero of=/dbfast1/bigfile bs=8k count=80 sync) 80+0 records in 80+0 records out real0m33.057s user0m0.116s sys 0m13.577s [EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=80 sync=1 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec real0m33.032s user0m0.087s sys 0m13.129s So lmdd with sync=1 is equivalent to a sync after a dd. I use 2x memory with dd for the *READ* performance testing, but let's make sure things are synced on both write and read for this set of comparisons. First, let's test ext2 versus ext3, data=ordered, versus xfs: 16GB write, then read --- ext2: --- [EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=200 sync=1 16384. MB in 144.2670 secs, 113.5672 MB/sec [EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=200 sync=1 16384. MB in 49.3766 secs, 331.8170 MB/sec --- ext3, data=ordered:
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? Well I think what he's saying is that Postgres is issuing a read, then waiting for the data to return. Then it does some processing, and goes back to issue another read. The CPU is idle half the time because Postgres isn't capable of doing any work while waiting for i/o, and the i/o system is idle half the time while the CPU intensive part happens. (Consider as a pathological example a program that reads 8k then sleeps for 10ms, and loops doing that 1,000 times. Now consider the same program optimized to read 8M asynchronously and sleep for 10s. By the time it's finished sleeping it has probably read in all 8M. Whereas the program that read 8k in little chunks interleaved with small sleeps would probably take twice as long and appear to be entirely i/o-bound with 50% iowait and 50% idle.) It's a reasonable theory and it's not inconsistent with the results you sent. But it's not exactly proven either. Nor is it clear how to improve matters. Adding additional threads to handle the i/o adds an enormous amount of complexity and creates lots of opportunity for other contention that could easily eat all of the gains. Perfect summary. We have a background writer now. Ideally we would have a background reader, that reads-ahead blocks into the buffer cache. The problem is that while there is a relatively long time between a buffer being dirtied and the time it must be on disk (checkpoint time), the read-ahead time is much shorter, requiring some kind of quick create a thread approach that could easily bog us down as outlined above. Right now the file system will do read-ahead for a heap scan (but not an index scan), but even then, there is time required to get that kernel block into the PostgreSQL shared buffers, backing up Luke's observation of heavy memcpy() usage. So what are our options? mmap()? I have no idea. Seems larger page size does help. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Bruce Momjian wrote: Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? Well I think what he's saying is that Postgres is issuing a read, then waiting for the data to return. Then it does some processing, and goes back to issue another read. The CPU is idle half the time because Postgres isn't capable of doing any work while waiting for i/o, and the i/o system is idle half the time while the CPU intensive part happens. (Consider as a pathological example a program that reads 8k then sleeps for 10ms, and loops doing that 1,000 times. Now consider the same program optimized to read 8M asynchronously and sleep for 10s. By the time it's finished sleeping it has probably read in all 8M. Whereas the program that read 8k in little chunks interleaved with small sleeps would probably take twice as long and appear to be entirely i/o-bound with 50% iowait and 50% idle.) It's a reasonable theory and it's not inconsistent with the results you sent. But it's not exactly proven either. Nor is it clear how to improve matters. Adding additional threads to handle the i/o adds an enormous amount of complexity and creates lots of opportunity for other contention that could easily eat all of the gains. Perfect summary. We have a background writer now. Ideally we would have a background reader, that reads-ahead blocks into the buffer cache. The problem is that while there is a relatively long time between a buffer being dirtied and the time it must be on disk (checkpoint time), the read-ahead time is much shorter, requiring some kind of quick create a thread approach that could easily bog us down as outlined above. Right now the file system will do read-ahead for a heap scan (but not an index scan), but even then, there is time required to get that kernel block into the PostgreSQL shared buffers, backing up Luke's observation of heavy memcpy() usage. So what are our options? mmap()? I have no idea. Seems larger page size does help. For sequential scans, you do have a background reader. It's the kernel. As long as you don't issue a seek() between read() calls, the kernel will get the hint about sequential IO and begin to perform a read ahead for you. This is where the above analysis isn't quite right: while postgresql is processing the returned data from the read() call, the kernel has also issued reads as part of the read ahead, keeping the device busy while the cpu is busy. (I'm assuming these details for Linux; Solaris/UFS does work this way). Issue one seek on the file and the read ahead algorithm will back off for a while. This was my point about some descriptions of how the system works not being sensible. If your goal is sequential IO, then one must use larger block sizes. No one would use 8KB IO for achieving high sequential IO rates. Simply put, read() is about the slowest way to get 8KB of data. Switching to 32KB blocks reduces all the system call overhead by a large margin. Larger blocks would be better still, up to the stripe size of your mirror. (Of course, you're using a mirror and not raid5 if you care about performance.) I don't think the memcpy of data from the kernel to userspace is that big of an issue right now. dd and all the high end network interfaces manage OK doing it, so I'd expect postgresql to do all right with it now yet too. Direct IO will avoid that memcpy, but then you also don't get any caching of the files in memory. I'd be more concerned about any memcpy calls or general data management within postgresql.Does postgresql use the platform specific memcpy() in libc? Some care might be needed to ensure that the memory blocks within postgresql are all properly aligned to make sure that one isn't ping-ponging cache lines around (usually done by padding the buffer sizes by an extra 32 bytes or L1 line size). Whatever you do, all the usual high performance computing tricks should be used prior to considering any rewriting of major code sections. Personally, I'd like to see some detailed profiling being done using hardware counters for cpu cycles and cache misses, etc. Given the poor quality of work that has been discussed here in this thread, I don't have much confidence in any other additional results at this time. None of the analysis would be acceptable in any environment in which I've worked. Be sure to take a look at Sun's free Workshop tools as they are excellent for this sort of profiling and one doesn't need to recompile to use them.If I get a little time in the next week or two I might take a crack at this. Cheers, -- Alan ---(end of
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange [EMAIL PROTECTED] writes: For sequential scans, you do have a background reader. It's the kernel. As long as you don't issue a seek() between read() calls, the kernel will get the hint about sequential IO and begin to perform a read ahead for you. This is where the above analysis isn't quite right: while postgresql is processing the returned data from the read() call, the kernel has also issued reads as part of the read ahead, keeping the device busy while the cpu is busy. (I'm assuming these details for Linux; Solaris/UFS does work this way). Issue one seek on the file and the read ahead algorithm will back off for a while. This was my point about some descriptions of how the system works not being sensible. Well that's certainly the hope. But we don't know that this is actually as effective as you assume it is. It's awfully hard in the kernel to make much more than a vague educated guess about what kind of readahead would actually help. This is especially true when a file isn't really being accessed in a sequential fashion as Postgres may well do if, for example, multiple backends are reading the same file. And as you pointed out it doesn't help at all for random access index scans. If your goal is sequential IO, then one must use larger block sizes. No one would use 8KB IO for achieving high sequential IO rates. Simply put, read() is about the slowest way to get 8KB of data. Switching to 32KB blocks reduces all the system call overhead by a large margin. Larger blocks would be better still, up to the stripe size of your mirror. (Of course, you're using a mirror and not raid5 if you care about performance.) Switching to 32kB blocks throughout Postgres has pros but also major cons, not the least is *extra* i/o for random access read patterns. One of the possible advantages of the suggestions that were made, the ones you're shouting down, would actually be the ability to use 32kB scatter/gather reads without necessarily switching block sizes. (Incidentally, your parenthetical comment is a bit confused. By mirror I imagine you're referring to raid1+0 since mirrors alone, aka raid1, aren't a popular way to improve performance. But raid5 actually performs better than raid1+0 for sequential reads.) Does postgresql use the platform specific memcpy() in libc? Some care might be needed to ensure that the memory blocks within postgresql are all properly aligned to make sure that one isn't ping-ponging cache lines around (usually done by padding the buffer sizes by an extra 32 bytes or L1 line size). Whatever you do, all the usual high performance computing tricks should be used prior to considering any rewriting of major code sections. So your philosophy is to worry about microoptimizations before worrying about architectural issues? -- greg ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange wrote: Bruce Momjian wrote: Right now the file system will do read-ahead for a heap scan (but not an index scan), but even then, there is time required to get that kernel block into the PostgreSQL shared buffers, backing up Luke's observation of heavy memcpy() usage. So what are our options? mmap()? I have no idea. Seems larger page size does help. For sequential scans, you do have a background reader. It's the kernel. As long as you don't issue a seek() between read() calls, the I guess you missed my text of Right now the file system will do read-ahead, meaning the kernel. I don't think the memcpy of data from the kernel to userspace is that big of an issue right now. dd and all the high end network interfaces manage OK doing it, so I'd expect postgresql to do all right with it now yet too. Direct IO will avoid that memcpy, but then you also don't get any caching of the files in memory. I'd be more concerned about any memcpy calls or general data management within postgresql.Does postgresql use the platform specific memcpy() in libc? Some care might be needed to ensure that the memory blocks within postgresql are all properly aligned to make sure that one isn't ping-ponging cache lines around (usually done by padding the buffer sizes by an extra 32 bytes or L1 line size). Whatever you do, all the usual high performance computing tricks should be used prior to considering any rewriting of major code sections. We have dealt with alignment and MemCpy is what we used for small-sized copies to reduce function call overhead. If you want to improve it, feel free to take a look. -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, On 11/19/05 8:43 PM, Alan Stange [EMAIL PROTECTED] wrote: Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 343.73175035.73 277.555251072 8326 while doing a select count(1) on the same large table as before. Subsequent iostat output all showed that this data rate was being maintained. The system is otherwise mostly idle during this measurement. Yes - interesting. Note the other result using XFS that I posted earlier where I got 240+MB/s. XFS has more aggressive readahead, which is why I used it. Can you explain again why you think there's an IO ceiling of 120MB/s because I really don't understand? OK - slower this time: We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but which all are capped at 120MB/s when doing sequential scans with different versions of Postgres. Understand my point: It doesn't matter that there is idle or iowait on the CPU, the postgres executor is not able to drive the I/O rate for two reasons: there is a lot of CPU used for the scan (the 40% you reported) and a lack of asynchrony (the iowait time). That means that by speeding up the CPU you only reduce the first part, but you don't fix the second and v.v. With more aggressive readahead, the second problem (the I/O asynchrony) is handled better by the Linux kernel and filesystem. That's what we're seeing with XFS. - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan [EMAIL PROTECTED] writes: OK - slower this time: We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but which all are capped at 120MB/s when doing sequential scans with different versions of Postgres. Luke, sometime it would be nice if you would post your raw evidence and let other people do their own analysis. I for one have gotten tired of reading sweeping generalizations unbacked by any data. I find the notion of a magic 120MB/s barrier, independent of either CPU or disk speed, to be pretty dubious to say the least. I would like to know exactly what the wide variety of data points you haven't shown us are. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: OK - slower this time: We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but which all are capped at 120MB/s when doing sequential scans with different versions of Postgres. Postgresql issues the exact same sequence of read() calls as does dd. So why is dd so much faster? I'd be careful with the dd read of a 16GB file on an 8GB system. Make sure you umount the file system first, to make sure all of the file is flushed from memory. Some systems use a freebehind on sequential reads to avoid flushing memory...and you'd find that 1/2 of your 16GB file is still in memory. The same point also holds for the writes: when dd finishes not all the data is on disk. You need to issue a sync() call to make that happen. Use lmdd to ensure that the data is actually all written. In other words, I think your dd results are possibly misleading. It's trivial to demonstrate: $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80 80+0 records in 80+0 records out real0m13.780s user0m0.134s sys 0m13.510s Oops. I just wrote 470MB/s to a file system that has peak write speed of 200MB/s peak. Now, you might say that you wrote a 16GB file on an 8 GB machine so this isn't an issue. It does make your dd numbers look fast as some of the data will be unwritten. I'd also suggest running dd on the same files as postgresql. I suspect you'd find that the layout of the postgresql files isn't that good as they are grown bit by bit, unlike the file created by simply dd'ing a large file. Understand my point: It doesn't matter that there is idle or iowait on the CPU, the postgres executor is not able to drive the I/O rate for two reasons: there is a lot of CPU used for the scan (the 40% you reported) and a lack of asynchrony (the iowait time). That means that by speeding up the CPU you only reduce the first part, but you don't fix the second and v.v. With more aggressive readahead, the second problem (the I/O asynchrony) is handled better by the Linux kernel and filesystem. That's what we're seeing with XFS. I think your point doesn't hold up. Every time you make it, I come away posting another result showing it to be incorrect. The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? The issue here is that the storage system is very active doing reads on the files...which might be somewhat poorly allocated on disk because postgresql grows the tables bit by bit. I had the same readahead in Reiser and in XFS. The XFS performance was better because XFS does a better job of large file allocation on disk, thus resulting in many fewer seeks (generated by the file system itself) to read the files back in. As an example, some file systems like UFS purposely scatter large files across cylinder groups to avoid forcing large seeks on small files; one can tune this behavior so that large files are more tightly allocated. Of course, because this is engineering, I have another obligatory data point: This time it's a 4.2GB table using 137,138 32KB pages with nearly 41 million rows. A select count(1) on the table completes in 14.6 seconds, for an average read rate of 320 MB/s. One cpu was idle, the other averaged 32% system time and 68 user time for the 14 second period. This is on a 2.2Ghz Opteron. A faster cpu would show increased performance as I really am cpu bound finally. Postgresql is clearly able to issue the relevant sequential read() system calls and sink the resulting data without a problem if the file system is capable of providing the data. It can do this up to a speed of ~300MB/s on this class of system. Now it should be fairly simple to tweak the few spots where some excess memory copies are being done and up this result substantially. I hope postgresql is always using the libc memcpy as that's going to be a lot faster then some private routine. -- Alan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange [EMAIL PROTECTED] writes: The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? Well I think what he's saying is that Postgres is issuing a read, then waiting for the data to return. Then it does some processing, and goes back to issue another read. The CPU is idle half the time because Postgres isn't capable of doing any work while waiting for i/o, and the i/o system is idle half the time while the CPU intensive part happens. (Consider as a pathological example a program that reads 8k then sleeps for 10ms, and loops doing that 1,000 times. Now consider the same program optimized to read 8M asynchronously and sleep for 10s. By the time it's finished sleeping it has probably read in all 8M. Whereas the program that read 8k in little chunks interleaved with small sleeps would probably take twice as long and appear to be entirely i/o-bound with 50% iowait and 50% idle.) It's a reasonable theory and it's not inconsistent with the results you sent. But it's not exactly proven either. Nor is it clear how to improve matters. Adding additional threads to handle the i/o adds an enormous amount of complexity and creates lots of opportunity for other contention that could easily eat all of the gains. I also fear that heading in that direction could push Postgres even further from the niche of software that works fine even on low end hardware into the realm of software that only works on high end hardware. It's already suffering a bit from that. -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark wrote: I also fear that heading in that direction could push Postgres even further from the niche of software that works fine even on low end hardware into the realm of software that only works on high end hardware. It's already suffering a bit from that. What's high end hardware for you? I do development on a Celeron 533 machine with 448 MB of RAM and I find it to work well (for a slow value of well, certainly.) If you're talking about embedded hardware, that's another matter entirely and I don't think we really support the idea of running Postgres on one of those things. There's certainly true in that the memory requirements have increased a bit, but I don't think it really qualifies as high end even on 8.1. -- Alvaro Herrera Developer, http://www.PostgreSQL.org Jude: I wish humans laid eggs Ringlord: Why would you want humans to lay eggs? Jude: So I can eat them ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Mon, Nov 21, 2005 at 02:01:26PM -0500, Greg Stark wrote: I also fear that heading in that direction could push Postgres even further from the niche of software that works fine even on low end hardware into the realm of software that only works on high end hardware. It's already suffering a bit from that. Well, there are are alread a bunch of open source DB's that can handle the low end. postgres is the closest thing to being able to handle the high end. Mike Stone ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Would it be worth first agreeing on a common set of criteria to measure? I see many data points going back and forth but not much agreement on what's worth measuring and how to measure. I'm not necessarily trying to herd cats, but it sure would be swell to have the several knowledgeable minds here come up with something that could uniformly tested on a range of machines, possibly even integrated into pg_bench or something. Disagreements on criteria or methodology should be methodically testable. Then I have dreams of a new pg_autotune that would know about these kinds of system-level settings. I haven't been on this list for long, and only using postgres for a handful of years, so forgive it if this has been hashed out before. -Bill - Bill McGonigle, Owner Work: 603.448.4440 BFC Computing, LLC Home: 603.448.1668 [EMAIL PROTECTED] Mobile: 603.252.2606 http://www.bfccomputing.com/Pager: 603.442.1833 Jabber: [EMAIL PROTECTED] Text: [EMAIL PROTECTED] Blog: http://blog.bfccomputing.com/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, On 11/21/05 6:57 AM, Alan Stange [EMAIL PROTECTED] wrote: $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80 80+0 records in 80+0 records out real0m13.780s user0m0.134s sys 0m13.510s Oops. I just wrote 470MB/s to a file system that has peak write speed of 200MB/s peak. How much RAM on this machine? Now, you might say that you wrote a 16GB file on an 8 GB machine so this isn't an issue. It does make your dd numbers look fast as some of the data will be unwritten. This simple test, at 2x memory correlates very closely to Bonnie++ numbers for sequential scan. What's more, we see close to the same peak in practice with multiple scanners. Furthermore, if you run two of them simultaneously (on two filesystems), you can also see the I/O limited. I'd also suggest running dd on the same files as postgresql. I suspect you'd find that the layout of the postgresql files isn't that good as they are grown bit by bit, unlike the file created by simply dd'ing a large file. Can happen if you're not careful with filesystems (see above). There's nothing wrong with the dd test. I think your point doesn't hold up. Every time you make it, I come away posting another result showing it to be incorrect. Prove it - your Reiserfs number was about the same. I also posted an XFS number that was substantially higher than 110-120. The point your making doesn't match my experience with *any* storage or program I've ever used, including postgresql. Your point suggests that the storage system is idle and that postgresql is broken because it isn't able to use the resources available...even when the cpu is very idle. How can that make sense? The issue here is that the storage system is very active doing reads on the files...which might be somewhat poorly allocated on disk because postgresql grows the tables bit by bit. Then you've made my point - if the problem is contiguity of files on disk, then larger allocation blocks would help on the CPU side. The objective is clear: given a high performance filesystem, how much of the available bandwidth can Postgres achieve? I think what we're seeing is that XFS is dramatically improving that objective. I had the same readahead in Reiser and in XFS. The XFS performance was better because XFS does a better job of large file allocation on disk, thus resulting in many fewer seeks (generated by the file system itself) to read the files back in. As an example, some file systems like UFS purposely scatter large files across cylinder groups to avoid forcing large seeks on small files; one can tune this behavior so that large files are more tightly allocated. Our other tests have used ext3, reiser and Solaris 10 UFS, so this might make some sense. Of course, because this is engineering, I have another obligatory data point: This time it's a 4.2GB table using 137,138 32KB pages with nearly 41 million rows. A select count(1) on the table completes in 14.6 seconds, for an average read rate of 320 MB/s. So, assuming that the net memory scan rate is about 2GB/s, and two copies (one from FS cache to buffer cache, one from buffer cache to the agg node), you have a 700MB/s filesystem with the equivalent of DirectIO (no FS cache) because you are reading directly from the I/O cache. You got half of that because the I/O processing in the executor is limited to 320MB/s on that fast CPU. My point is this: if you were to decrease the filesystem speed to say 400MB/s and still use the equivalent of DirectIO, I thinkPostgres would not deliver 320MB/s, but rather something like 220MB/s due to the producer/consumer arch of the executor. If you get that part, then we're on the same track, otherwise we disagree. One cpu was idle, the other averaged 32% system time and 68 user time for the 14 second period. This is on a 2.2Ghz Opteron. A faster cpu would show increased performance as I really am cpu bound finally. Yep, with the equivalent of DirectIO you are. - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Tom, On 11/21/05 6:56 AM, Tom Lane [EMAIL PROTECTED] wrote: Luke Lonergan [EMAIL PROTECTED] writes: OK - slower this time: We've seen between 110MB/s and 120MB/s on a wide variety of fast CPU machines with fast I/O subsystems that can sustain 250MB/s+ using dd, but which all are capped at 120MB/s when doing sequential scans with different versions of Postgres. Luke, sometime it would be nice if you would post your raw evidence and let other people do their own analysis. I for one have gotten tired of reading sweeping generalizations unbacked by any data. This has partly been a challenge to get others to post their results. I find the notion of a magic 120MB/s barrier, independent of either CPU or disk speed, to be pretty dubious to say the least. I would like to know exactly what the wide variety of data points you haven't shown us are. I'll try to put up some of them, they've occurred over the last 3 years on various platforms including: - Dual 3.2GHz Xeon, 2 x Adaptec U320 SCSI attached to 6 x 10K RPM disks, Linux 2.6.4(?) - 2.6.10 kernel, ext2/3 and Reiser filesystems 120-130MB/s Postgres seq scan rate on 7.4 and 8.0. - Dual 1.8 GHz Opteron, 2 x LSI U320 SCSI attached to 6 x 10K RPM disks, Linux 2.6.10 kernel, ext2/3 and Reiser filesystems 110-120MB/s Postgres seq scan rate on 8.0 - Same machine as above running Solaris 10, with UFS filesystem. When I/O caching is tuned, we reach the same 110-120MB/s Postgres seq scan rate - Sam machine as above with 7 x 15K RPM 144GB disks in an external disk tray, same scan rate Only when we got these new SATA systems and tried out XFS with large readahead have we been able to break past the 120-130MB/s. After Alan's post, it seems that XFS might be a big part of that. I think we'll test ext2/3 against XFS on the same machine to find out. It may have to wait a week, as many of us are on vacation. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke, it's time to back yourself up with some numbers. You're claiming the need for a significant rewrite of portions of postgresql and you haven't done the work to make that case. You've apparently made some mistakes on the use of dd to benchmark a storage system. Use lmdd and umount the file system before the read and post your results. Using a file 2x the size of memory doesn't work corectly. You can quote any other numbers you want, but until you use lmdd correctly you should be ignored. Ideally, since postgresql uses 1GB files, you'll want to use 1GB files for dd as well. Luke Lonergan wrote: Alan, On 11/21/05 6:57 AM, Alan Stange [EMAIL PROTECTED] wrote: $ time dd if=/dev/zero of=/fidb1/bigfile bs=8k count=80 80+0 records in 80+0 records out real0m13.780s user0m0.134s sys 0m13.510s Oops. I just wrote 470MB/s to a file system that has peak write speed of 200MB/s peak. How much RAM on this machine? Doesn't matter. The result will always be wrong without a call to sync() or fsync() before the close() if you're trying to measure the speed of the disk subsystem. Add that sync() and the result will be correct for any memory size. Just for completeness: Solaris implicitly calls sync() as part of close. Bonnie used to get this wrong, so quoting Bonnie isn't any good. Note that on some systems using 2x memory for these tests is almost OK. For example, Solaris used to have a hiwater mark that would throttle processes and not allow more than a few 100K of writes to be outstanding on a file. Linux/XFS clearly allows a lot of write data to be outstanding. It's best to understand the tools and know what they do and why they can be wrong than simply quoting some other tool that makes the same mistakes. I find that postgresql is able to achieve about 175MB/s on average from a system capable of delivering 200MB/s peak and it does this with a lot of cpu time to spare. Maybe dd can do a little better and deliver 185MB/s.If I were to double the speed of my IO system, I might find that a single postgresql instance can sink about 300MB/s of data (based on the last numbers I posted). That's why I have multi-cpu opterons and more than one query/client as they soak up the remaining IO capacity. It is guaranteed that postgresql will hit some threshold of performance in the future and possible rewrites of some core functionality will be needed, but no numbers posted here so far have made the case that postgresql is in trouble now. In the mean time, build balanced systems with cpus that match the capabilities of the storage subsystems, use 32KB block sizes for large memory databases that are doing lots of sequential scans, use file systems tuned for large files, use opterons, etc. As always, one has to post some numbers. Here's an example of how dd doesn't do what you might expect: mite02:~ # lmdd if=internal of=/fidb2/bigfile bs=8k count=2k 16.7772 MB in 0.0235 secs, 714.5931 MB/sec mite02:~ # lmdd if=internal of=/fidb2/bigfile bs=8k count=2k sync=1 16.7772 MB in 0.1410 secs, 118.9696 MB/sec Both numbers are correct. But one measures the kernels ability to absorb 2000 8KB writes with no guarantee that the data is on disk and the second measures the disk subsystems ability to write 16MB of data. dd is equivalent to the first result. You can't use the first type of result and complain that postgresql is slow. If you wrote 16G of data on a machine with 8G memory then your dd result is possibly too fast by a factor of two as 8G of the data might not be on disk yet. We won't know until you post some results. Cheers, -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, Looks like Postgres gets sensible scan rate scaling as the filesystem speed increases, as shown below. I'll drop my 120MB/s observation - perhaps CPUs got faster since I last tested this. The scaling looks like 64% of the I/O subsystem speed is available to the executor - so as the I/O subsystem increases in scan rate, so does Postgres' executor scan speed. So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? - Luke = Results === Unless noted otherwise all results posted are for block device readahead set to 16M using blockdev --setra=16384 block_device. All are using the 2.6.9-11 Centos 4.1 kernel. For those who don't have lmdd, here is a comparison of two results on an ext2 filesystem: [EMAIL PROTECTED] dbfast1]# time bash -c (dd if=/dev/zero of=/dbfast1/bigfile bs=8k count=80 sync) 80+0 records in 80+0 records out real0m33.057s user0m0.116s sys 0m13.577s [EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=80 sync=1 6553.6000 MB in 31.2957 secs, 209.4092 MB/sec real0m33.032s user0m0.087s sys 0m13.129s So lmdd with sync=1 is equivalent to a sync after a dd. I use 2x memory with dd for the *READ* performance testing, but let's make sure things are synced on both write and read for this set of comparisons. First, let's test ext2 versus ext3, data=ordered, versus xfs: 16GB write, then read --- ext2: --- [EMAIL PROTECTED] dbfast1]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=200 sync=1 16384. MB in 144.2670 secs, 113.5672 MB/sec [EMAIL PROTECTED] dbfast1]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=200 sync=1 16384. MB in 49.3766 secs, 331.8170 MB/sec --- ext3, data=ordered: --- [EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=200 sync=1 16384. MB in 137.1607 secs, 119.4511 MB/sec [EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=200 sync=1 16384. MB in 48.7398 secs, 336.1527 MB/sec --- xfs: --- [EMAIL PROTECTED] ~]# time lmdd if=/dev/zero of=/dbfast1/bigfile bs=8k count=200 sync=1 16384. MB in 52.6141 secs, 311.3994 MB/sec [EMAIL PROTECTED] ~]# time lmdd if=/dbfast1/bigfile of=/dev/null bs=8k count=200 sync=1 16384. MB in 40.2807 secs, 406.7453 MB/sec I'm liking xfs! Something about the way files are layed out, as Alan suggested seems to dramatically improve write performance and perhaps consequently the read also improves. There doesn't seem to be a difference between ext3 and ext2, as expected. Now on to the Postgres 8 tests. We'll do a 16GB table size to ensure that we aren't reading from the read cache. I'll write this file through Postgres COPY to be sure that the file layout is as Postgres creates it. The alternative would be to use COPY once, then tar/untar onto different filesystems, but that may not duplicate the real world results. These tests will use Bizgres 0_8_1, which is an augmented 8.0.3. None of the augmentations act to improve the executor I/O though, so for these purposes it should be the same as 8.0.3. 26GB of DBT-3 data from the lineitem table llonergan=# select relpages from pg_class where relname='lineitem'; relpages -- 3159138 (1 row) 3159138*8192/100 25879 Million Bytes, or 25.9GB --- xfs: --- llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count --- 119994608 (1 row) Time: 394908.501 ms llonergan=# select count(1) from lineitem; count --- 119994608 (1 row) Time: 99425.223 ms llonergan=# select count(1) from lineitem; count --- 119994608 (1 row) Time: 99187.205 ms --- ext2: --- llonergan=# select relpages from pg_class where relname='lineitem'; relpages -- 3159138 (1 row) llonergan=# \timing Timing is on. llonergan=# select count(1) from lineitem; count --- 119994608 (1 row) Time: 395286.475 ms llonergan=# select count(1) from lineitem; count --- 119994608 (1 row) Time: 195756.381 ms llonergan=# select count(1) from lineitem; count ---
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: So that leaves the question - why not more than 64% of the I/O scan rate? And why is it a flat 64% as the I/O subsystem increases in speed from 333-400MB/s? It might be interesting to see what effect reducing the cpu consumption entailed by the count aggregation has - by (say) writing a little bit of code to heap scan the desired relation (sample attached). Cheers Mark /* * fastcount.c * * Do a count that uses considerably less CPU time than an aggregate. */ #include postgres.h #include funcapi.h #include access/heapam.h #include catalog/namespace.h #include utils/builtins.h extern Datum fastcount(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(fastcount); Datum fastcount(PG_FUNCTION_ARGS) { text *relname = PG_GETARG_TEXT_P(0); RangeVar *relrv; Relationrel; HeapScanDesc scan; HeapTuple tuple; int64 result = 0; /* Use the name to get a suitable range variable and open the relation. */ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname)); rel = heap_openrv(relrv, AccessShareLock); /* Start a heap scan on the relation. */ scan = heap_beginscan(rel, SnapshotNow, 0, NULL); while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { result++; } /* End the scan and close up the relation. */ heap_endscan(scan); heap_close(rel, AccessShareLock); PG_RETURN_INT64(result); } ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark Kirkwood wrote: - I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on my old P3 system even earlier than that) Ahem - after reading Alan's postings I am not so sure, ISTM that there is some more investigation required here too :-). ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange wrote: Luke Lonergan wrote: The aka iowait is the problem here - iowait is not idle (otherwise it would be in the idle column). Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. I'm sure the the theory is nice but here's my experience with iowait just a minute ago. I run Linux/XFce as my desktop -- decided I wanted to lookup some stuff in Wikipedia under Mozilla and my computer system became completely unusable for nearly a minute while who knows what Mozilla was doing. (Probably loading all the language packs.) I could not even switch to IRC (already loaded) to chat with other people while Mozilla was chewing up all my disk I/O. So I went to another computer, connected to mine remotely (slow...) and checked top. 90% in the wa column which I assume is the iowait column. It may be idle in theory but it's not a very useful idle -- wasn't able to switch to any programs already running, couldn't click on the XFce launchbar to run any new programs. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Sat, Nov 19, 2005 at 08:13:09AM -0800, Luke Lonergan wrote: Iowait is time spent waiting on blocking io calls. To be picky, iowait is time spent in the idle task while the I/O queue is not empty. It does not matter if the I/O is blocking or not (from userspace's point of view), and if the I/O was blocking (say, PIO) from the kernel's point of view, it would be counted in system. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
William Yu wrote: Alan Stange wrote: Luke Lonergan wrote: The aka iowait is the problem here - iowait is not idle (otherwise it would be in the idle column). Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. I'm sure the the theory is nice but here's my experience with iowait just a minute ago. I run Linux/XFce as my desktop -- decided I wanted to lookup some stuff in Wikipedia under Mozilla and my computer system became completely unusable for nearly a minute while who knows what Mozilla was doing. (Probably loading all the language packs.) I could not even switch to IRC (already loaded) to chat with other people while Mozilla was chewing up all my disk I/O. So I went to another computer, connected to mine remotely (slow...) and checked top. 90% in the wa column which I assume is the iowait column. It may be idle in theory but it's not a very useful idle -- wasn't able to switch to any programs already running, couldn't click on the XFce launchbar to run any new programs. So, you have a sucky computer.I'm sorry, but iowait is still idle time, whether you believe it or not. -- Alan ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange [EMAIL PROTECTED] writes: Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as predicted, one CPU went 100% busy on the seq scan. During iowait periods, the CPU can be context switched to other users, but as I pointed out earlier, that's not useful for getting response on decision support queries. I don't think that's true. If the syscall was preemptable then it wouldn't show up under iowait, but rather idle. The time spent in iowait is time in uninterruptable sleeps where no other process can be scheduled. iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. Here's the man page: %iowait Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. If the system had some other cpu bound work to perform you wouldn't ever see any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan using the one set of numbers I posted is misunderstanding the actual metrics. That's easy to test. rerun the test with another process running a simple C program like main() {while(1);} (or two invocations of that on your system because of the extra processor). I bet you'll see about half the percentage of iowait because postres will get half as much opportunity to schedule i/o. If what you are saying were true then you should get 0% iowait. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Sun, Nov 20, 2005 at 09:22:41AM -0500, Greg Stark wrote: I don't think that's true. If the syscall was preemptable then it wouldn't show up under iowait, but rather idle. The time spent in iowait is time in uninterruptable sleeps where no other process can be scheduled. You are confusing userspace with kernel space. When a process is stuck in uninterruptable sleep, it means _that process_ can't be interrupted (say, by a signal). The kernel can preempt it without problems. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as predicted, one CPU went 100% busy on the seq scan. During iowait periods, the CPU can be context switched to other users, but as I pointed out earlier, that's not useful for getting response on decision support queries. I don't think that's true. If the syscall was preemptable then it wouldn't show up under iowait, but rather idle. The time spent in iowait is time in uninterruptable sleeps where no other process can be scheduled. That would be wrong. The time spent in iowait is idle time. The iowait stat would be 0 on a machine with a compute bound runnable process available for each cpu. Come on people, read the man page or look at the source code. Just stop making stuff up. iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. Here's the man page: %iowait Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. If the system had some other cpu bound work to perform you wouldn't ever see any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan using the one set of numbers I posted is misunderstanding the actual metrics. That's easy to test. rerun the test with another process running a simple C program like main() {while(1);} (or two invocations of that on your system because of the extra processor). I bet you'll see about half the percentage of iowait because postres will get half as much opportunity to schedule i/o. If what you are saying were true then you should get 0% iowait. Yes, I did this once about 10 years ago. But instead of saying I bet and guessing at the result, you should try it yourself. Without guessing, I can tell you that the iowait time will go to 0%. You can do this loop in the shell, so there's no code to write. Also, it helps to do this with the shell running at a lower priority. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange wrote: Another data point. We had some down time on our system today to complete some maintenance work. It took the opportunity to rebuild the 700GB file system using XFS instead of Reiser. One iostat output for 30 seconds is avg-cpu: %user %nice%sys %iowait %idle 1.580.00 19.69 31.94 46.78 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 343.73175035.73 277.555251072 8326 while doing a select count(1) on the same large table as before. Subsequent iostat output all showed that this data rate was being maintained. The system is otherwise mostly idle during this measurement. The sequential read rate is 175MB/s. The system is the same as earlier, one cpu is idle and the second is ~40% busy doing the scan and ~60% idle. This is postgresql 8.1rc1, 32KB block size. No tuning except for using a 1024KB read ahead. The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel controller). I see no reason why this configuration wouldn't generate higher IO rates if a faster IO connection were available. Can you explain again why you think there's an IO ceiling of 120MB/s because I really don't understand? I think what is going on here is that Luke's observation of the 120 Mb/s rate is taken from data using 8K block size - it looks like we can get higher rates with 32K. A quick test on my P3 system seems to support this (the numbers are a bit feeble, but the difference is interesting): The test is SELECT 1 FROM table, stopping Pg and unmounting the file system after each test. 8K blocksize: 25 s elapsed 48 % idle from vmstat (dual cpu system) 70 % busy from gstat (Freebsd GEOM io monitor) 181819 pages in relation 56 Mb/s effective IO throughput 32K blocksize: 23 s elapsed 44 % idle from vmstat 80 % busy from gstat 45249 pages in relation 60 Mb/s effective IO throughput I re-ran these several times - very repeatable (+/- 0.25 seconds). This is Freebsd 6.0 with the readahead set to 16 blocks, UFS2 filesystem created with 32K blocksize (both cases). It might be interesting to see the effect of using 16K (the default) with the 8K Pg block size, I would expect this to widen the gap. Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark Kirkwood wrote: The test is SELECT 1 FROM table That should read The test is SELECT count(1) FROM table ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark, On 11/18/05 6:27 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: That too, meaning the business of 1 executor random reading a given relation file whilst another is sequentially scanning (some other) part of it I think it should actually improve things - each I/O will read 16MB into the I/O cache, then the next scanner will seek for 10ms to get the next 16MB into cache, etc. It should minimize the seek/data ratio nicely. As long as the tables are contiguous it should rock and roll. - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, On 11/18/05 11:39 AM, Alan Stange [EMAIL PROTECTED] wrote: Yes and no. The one cpu is clearly idle. The second cpu is 40% busy and 60% idle (aka iowait in the above numbers). The aka iowait is the problem here - iowait is not idle (otherwise it would be in the idle column). Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as predicted, one CPU went 100% busy on the seq scan. During iowait periods, the CPU can be context switched to other users, but as I pointed out earlier, that's not useful for getting response on decision support queries. Thanks for your data, it exemplifies many of the points brought up: - Lots of disks and expensive I/O hardware does not help improve performance on large table queries because I/O bandwidth does not scale beyond 110-120MB/s on the fastest CPUs - OLTP performance optimizations are different than decision support Regards, - Luke ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Mark, On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: If you alter this to involve more complex joins (e.g 4. way star) and (maybe add a small number of concurrent executors too) - is it still the case? 4-way star, same result, that's part of my point. With Bizgres MPP, the 4-way star uses 4 concurrent scanners, though not all are active all the time. And that's per segment instance - we normally use one segment instance per CPU, so our concurrency is NCPUs plus some. Luke - I don't think I was clear enough about what I was asking, sorry. I added the more complex joins comment because: - I am happy that seqscan is cpu bound after ~110M/s (It's cpu bound on my old P3 system even earlier than that) - I am curious if the *other* access methods (indexscan, nested loop, hash, merge, bitmap) also suffer then same fate. I'm guessing from your comment that you have tested this too, but I think its worth clarifying! With respect to Bizgres MPP, scan parallelism is a great addition... very nice! (BTW - is that in 0.8, or are we talking a new product variant?) regards Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 11:39 AM, Alan Stange [EMAIL PROTECTED] wrote: Yes and no. The one cpu is clearly idle. The second cpu is 40% busy and 60% idle (aka iowait in the above numbers). The aka iowait is the problem here - iowait is not idle (otherwise it would be in the idle column). Iowait is time spent waiting on blocking io calls. As another poster pointed out, you have a two CPU system, and during your scan, as predicted, one CPU went 100% busy on the seq scan. During iowait periods, the CPU can be context switched to other users, but as I pointed out earlier, that's not useful for getting response on decision support queries. iowait time is idle time. Period. This point has been debated endlessly for Solaris and other OS's as well. Here's the man page: %iowait Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request. If the system had some other cpu bound work to perform you wouldn't ever see any iowait time. Anyone claiming the cpu was 100% busy on the sequential scan using the one set of numbers I posted is misunderstanding the actual metrics. Thanks for your data, it exemplifies many of the points brought up: - Lots of disks and expensive I/O hardware does not help improve performance on large table queries because I/O bandwidth does not scale beyond 110-120MB/s on the fastest CPUs I don't think that is the conclusion from anecdotal numbers I posted. This file subsystem doesn't perform as well as expected for any tool. Bonnie, dd, star, etc., don't get a better data rate either. In fact, the storage system wasn't built for performance; it was build to reliably hold a big chunk of data. Even so, postgresql is reading at 130MB/s on it, using about 30% of a single cpu, almost all of which was system time. I would get the same 130MB/s on a system with cpus that were substantially slower; the limitation isn't the cpus, or postgresql. It's the IO system that is poorly configured for this test, not postgresqls ability to use it. In fact, given the numbers I posted, it's clear this system could handily generate more than 120 MB/s using a single cpu given a better IO subsystem; it has cpu time to spare. A simple test can be done: build the database in /dev/shm and time the scans. It's the same read() system call being used and now one has made the IO system infinitely fast. The claim is being made that standard postgresql is unable to generate more than 120MB/s of IO on any IO system due to an inefficient use of the kernel API and excessive memory copies, etc. Having the database be on a ram based file system is an example of expensive IO hardware and all else would be the same. Hmmm, now that I think about this, I could throw a medium sized table onto /dev/shm using tablespaces on one of our 8GB linux boxes.So why is this experiment not valid, or what is it about the above assertion that I am missing? Anyway, if one cares about high speed sequential IO, then one should use a much larger block size to start. Using 8KB IOs is inappropriate for such a configuration. We happen to be using 32KB blocks on our largest database and it's been the best move for us. -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Another data point. We had some down time on our system today to complete some maintenance work. It took the opportunity to rebuild the 700GB file system using XFS instead of Reiser. One iostat output for 30 seconds is avg-cpu: %user %nice%sys %iowait %idle 1.580.00 19.69 31.94 46.78 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 343.73175035.73 277.555251072 8326 while doing a select count(1) on the same large table as before. Subsequent iostat output all showed that this data rate was being maintained. The system is otherwise mostly idle during this measurement. The sequential read rate is 175MB/s. The system is the same as earlier, one cpu is idle and the second is ~40% busy doing the scan and ~60% idle. This is postgresql 8.1rc1, 32KB block size. No tuning except for using a 1024KB read ahead. The peak speed of the attached storage is 200MB/s (a 2Gb/s fiber channel controller). I see no reason why this configuration wouldn't generate higher IO rates if a faster IO connection were available. Can you explain again why you think there's an IO ceiling of 120MB/s because I really don't understand? -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 17-Nov-05, at 2:50 PM, Alex Turner wrote: Just pick up a SCSI drive and a consumer ATA drive. Feel their weight. You don't have to look inside to tell the difference. At one point stereo manufacturers put weights in the case just to make them heavier. The older ones weighed more and the consumer liked heavy stereos. Be careful what you measure. Dave Alex On 11/16/05, David Boreham [EMAIL PROTECTED] wrote: I suggest you read this on the difference between enterprise/SCSI and desktop/IDE drives: http://www.seagate.com/content/docs/pdf/whitepaper/ D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf This is exactly the kind of vendor propaganda I was talking about and it proves my point quite well : that there's nothing specific relating to reliability that is different between SCSI and SATA drives cited in that paper. It does have a bunch of FUD such as 'oh yeah we do a lot more drive characterization during manufacturing'. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote: Greg, On 11/17/05 9:17 PM, Greg Stark [EMAIL PROTECTED] wrote: Ok, a more productive point: it's not really the size of the database that controls whether you're I/O bound or CPU bound. It's the available I/O bandwidth versus your CPU speed. Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1. A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. Dave Regards, - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Dave Cramer wrote: On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote: Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1. A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000 system if he's got one going :-) -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 5:41 AM, Alan Stange [EMAIL PROTECTED] wrote: That's interesting, as I occasionally see more than 110MB/s of postgresql IO on our system. I'm using a 32KB block size, which has been a huge win in performance for our usage patterns. 300GB database with a lot of turnover. A vacuum analyze now takes about 3 hours, which is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory, Linux 2.6.11, FC drives. 300GB / 3 hours = 27MB/s. That's 3 hours under load, with 80 compute clients beating on the database at the same time. We have the stats turned way up, so the analyze tends to read a big chunk of the tables a second time as well.We typically don't have three hours a day of idle time. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Dave, On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote: Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. Part 2: The answer System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel. On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: thats 66 MB/s. Not the efficiency Id hope from the onboard SATA controller that Id like, I would have expected to get 85% of the 100MB/s raw read performance. So thats $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout1 count -- 1000 (1 row) real 0m32.565s user 0m0.002s sys 0m0.003s Size of the table data: [EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base 2121648 dgtestdb/base System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. Its running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel. Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) Thats 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, lets try it with the default (I think) setting of 256KB AHA! Now we get 171.4 seconds or 99.3MB/s. So, using the tuned setting of blockdev setra 16384 we get $6,000 / 244MB/s = 24.6 $/MB/s If we use the default Linux setting its 2.5x worse. Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout3 count -- 8000 (1 row) real 1m9.875s user 0m0.000s sys 0m0.004s [EMAIL PROTECTED] IVP]$ !du du -sk dgtestdb/base 17021260 dgtestdb/base Summary: cough, cough OK you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s. - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases
While I agree with you in principle that pg becomes CPU bound relatively easily compared to other DB products (at ~110-120MBps according to a recent thread), there's a bit of hyperbole in your post. a. There's a big difference between the worst performing 1C x86 ISA CPU available and the best performing 2C one (IIRC, that's the 2.4GHz, 1MB L2 cache AMDx2 4800+ as of this writing) b. Two 2C CPU's vs one 1C CPU means that a pg process will almost never be waiting on other non pg processes. It also means that 3-4 pg processes, CPU bound or not, can execute in parallel. Not an option with one 1C CPU. c. Mainboards with support for multiple CPUs and lots' of RAM are _not_ the cheap ones. d. No one should ever use RAID 0 for valuable data. Ever. So at the least you need 4 HD's for a RAID 10 set (RAID 5 is not a good option unless write performance is unimportant. 4HD RAID 5 is particularly not a good option.) e. The server usually needs to talk to things over a network connection. Often performance here matters. Mainboards with 2 1GbE NICs and/or PCI-X (or PCI-E) slots for 10GbE cards are not the cheap ones. f. Trash HDs mean poor IO performance and lower reliability. While TOTL 15Krpm 4Gb FC HDs are usually overkill (Not always. It depends on context.), you at least want SATA II HDs with NCQ or TCQ support. And you want them to have a decent media warranty- preferably a 5 year one if you can get it. Again, these are not the cheapest HD's available. g. Throughput limitations say nothing about latency considerations. OLTP-like systems _want_ HD spindles. AMAP. Even non OLTP-like systems need a fair number of spindles to optimize HD IO: dedicated WAL set, multiple dedicated DB sets, dedicated OS and swap space set, etc, etc. At 50MBps ASTR, you need 16 HD's operating in parallel to saturate the bandwidth of a PCI-X channel. That's ~8 independent pg tasks (queries using different tables, dedicated WAL IO, etc) running in parallel. Regardless of application domain. h. Decent RAID controllers and HBAs are not cheap either. Even SW RAID benefits from having a big dedicated RAM buffer to talk to. While the above may not cost you $80K, it sure isn't costing you $1K either. Maybe ~$15-$20K, but not $1K. Ron At 01:07 AM 11/18/2005, Luke Lonergan wrote: Greg, On 11/17/05 9:17 PM, Greg Stark [EMAIL PROTECTED] wrote: Ok, a more productive point: it's not really the size of the database that controls whether you're I/O bound or CPU bound. It's the available I/O bandwidth versus your CPU speed. Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1. A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Regards, - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Dave, On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote: Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. OK, here we go: The $1,000 system (System A): - I bought 16 of these in 2003 for $1,200 each. They have Intel or Asus motherboards, Intel P4 3.0GHz CPUs with an 800MHz FSB. They have a system drive and two RAID0 SATA drives, the Western Digital 74GB Raptor (10K RPM). They have 1GB of RAM. A test of write and read performance on the RAID0: [EMAIL PROTECTED] raid0]$ time dd if=/dev/zero of=bigfile bs=8k count=25 25+0 records in 25+0 records out real 0m17.453s user 0m0.249s sys 0m10.246s [EMAIL PROTECTED] raid0]$ time dd if=bigfile of=/dev/null bs=8k 25+0 records in 25+0 records out real 0m18.930s user 0m0.130s sys 0m3.590s So, the write performance is 114MB/s and read performance is 106MB/s. The $6,000 system (System B): I just bought 5 of these systems for $6,000 each. They are dual Opteron systems with 8GB of RAM and 2x 250 model CPUs, which are close to the fastest. They have the new 3Ware 9550SX SATA RAID adapters coupled to Western Digital 400GB RE2 model hard drives. They are organized as a RAID5. A test of write and read performance on the RAID5: [EMAIL PROTECTED] dbfast1]# time dd if=/dev/zero of=bigfile bs=8k count=200 200+0 records in 200+0 records out real 0m51.441s user 0m0.288s sys 0m29.119s [EMAIL PROTECTED] dbfast1]# time dd if=bigfile of=/dev/null bs=8k 200+0 records in 200+0 records out real 0m39.605s user 0m0.244s sys 0m19.207s So, the write performance is 314MB/s and read performance is 404MB/s (!) This is the fastest Ive seen 8 disk drives perform. So, the question is: which of these systems (A or B) can scan a large table faster using non-MPP postgres? How much faster would you wager? Send your answer, and Ill post the result. Regards, - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 18-Nov-05, at 8:30 AM, Luke Lonergan wrote: Richard, On 11/18/05 5:22 AM, "Richard Huxton" dev@archonet.com wrote: Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000 system if he's got one going :-) Finally, a game worth playing! Except it’s backward – I’ll show you 80 $1,000 systems performing 80 times faster than one $80,000 system.Now you wouldn't happen to be selling a system that would enable this for postgres, now would ya ? On your proposition – I don’t have any $80,000 systems for trade, do you? - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote:Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s.Yeah, and mysql would probably be faster on your trivial queries. Try concurrent large joins and updates and see which system is faster.
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Nov 18, 2005, at 1:07 AM, Luke Lonergan wrote: A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Hahahahahahahahahahahahaha! Whooo... needed to fall out of my chair laughing this morning. I can tell you from direct personal experience that you're just plain wrong. I've had to move my primary DB server from a dual P3 1GHz with 4-disk RAID10 SCSI, to Dual P3 2GHz with 14-disk RAID10 and faster drives, to Dual Opteron 2GHz with 8-disk RAID10 and even faster disks to keep up with my load on a 60+ GB database. The Dual opteron system has just a little bit of extra capacity if I offload some of the reporting operations to a replicated copy (via slony1). If I run all the queries on the one DB it can't keep up. One most telling point about the difference in speed is that the 14- disk array system cannot keep up with the replication being generated by the dual opteron, even when it is no doing any other queries of its own. The I/O system just ain't fast enough. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, On 11/18/05 5:41 AM, Alan Stange [EMAIL PROTECTED] wrote: That's interesting, as I occasionally see more than 110MB/s of postgresql IO on our system. I'm using a 32KB block size, which has been a huge win in performance for our usage patterns. 300GB database with a lot of turnover. A vacuum analyze now takes about 3 hours, which is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory, Linux 2.6.11, FC drives. 300GB / 3 hours = 27MB/s. If you are using the 2.6 linux kernel, you may be fooled into thinking you burst more than you actually get in net I/O because the I/O stats changed in tools like iostat and vmstat. The only meaningful stats are (size of data) / (time to process data). Do a sequential scan of one of your large tables that you know the size of, then divide by the run time and report it. I'm compiling some new test data to make my point now. Regards, - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Richard Huxton wrote: Dave Cramer wrote: On 18-Nov-05, at 1:07 AM, Luke Lonergan wrote: Postgres + Any x86 CPU from 2.4GHz up to Opteron 280 is CPU bound after 110MB/s of I/O. This is true of Postgres 7.4, 8.0 and 8.1. A $1,000 system with one CPU and two SATA disks in a software RAID0 will perform exactly the same as a $80,000 system with 8 dual core CPUs and the world's best SCSI RAID hardware on a large database for decision support (what the poster asked about). Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. That's interesting, as I occasionally see more than 110MB/s of postgresql IO on our system. I'm using a 32KB block size, which has been a huge win in performance for our usage patterns. 300GB database with a lot of turnover. A vacuum analyze now takes about 3 hours, which is much shorter than before. Postgresql 8.1, dual opteron, 8GB memory, Linux 2.6.11, FC drives. -- Alan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Richard, On 11/18/05 5:22 AM, Richard Huxton dev@archonet.com wrote: Well, I'm prepared to swap Luke *TWO* $1000 systems for one $80,000 system if he's got one going :-) Finally, a game worth playing! Except its backward Ill show you 80 $1,000 systems performing 80 times faster than one $80,000 system. On your proposition I dont have any $80,000 systems for trade, do you? - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 8:13 AM, Alan Stange [EMAIL PROTECTED] wrote: I told you in my initial post that I was observing numbers in excess of what you claiming, but you seemed to think I didn't know how to measure an IO rate. Prove me wrong, post your data. I should note too that our system uses about 20% of a single cpu when performing a table scan at 100MB/s of IO. I think you claimed the system would be cpu bound at this low IO rate. See above. Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. We've done nothing fancy and achieved results you claim shouldn't be possible. This is a system that was re-installed yesterday, no tuning was done to the file systems, kernel or storage array. What am I doing wrong? 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO (for a DOE lab). And now I don't know what I'm doing, Cheers, -- Alan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alex, On 11/18/05 8:28 AM, Alex Turner [EMAIL PROTECTED] wrote: Ok - so I ran the same test on my system and get a total speed of 113MB/sec. Why is this? Why is the system so limited to around just 110MB/sec? I tuned read ahead up a bit, and my results improve a bit.. OK! Now we're on the same page. Finally someone who actually tests! Check the CPU usage while it's doing the scan. Know what it's doing? Memory copies. We've profiled it extensively. So - that's the suckage - throwing more CPU power helps a bit, but the underlying issue is poorly optimized code in the Postgres executor and lack of I/O asynchrony. - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Bill, On 11/18/05 7:55 AM, Bill McGonigle [EMAIL PROTECTED] wrote: There is some truth to it. For an app I'm currently running (full-text search using tsearch2 on ~100MB of data) on: Do you mean 100GB? Sounds like you are more like a decision support /warehousing application. Dev System: Asus bare-bones bookshelf case/mobo 3GHz P4 w/ HT 800MHz memory Bus Fedora Core 3 (nightly update) 1GB RAM 1 SATA Seagate disk (7200RPM, 8MB Cache) $800 worst-case query: 7.2 seconds About the same machine I posted results for, except I had two faster disks. now, the machine I'm deploying to: Dell SomthingOrOther (4) 2.4GHz Xeons 533MHz memory bus RedHat Enterprise 3.6 1GB RAM (5) 15 RPM Ultra SCSI 320 on an Adaptec RAID 5 controller $1 same worst-case query: 9.6 seconds Your problem here is the HW RAID controller - if you dump it and use the onboard SCSI channels and Linux RAID you will see a jump from 40MB/s to about 220MB/s in read performance and from 20MB/s to 110MB/s write performance. It will use less CPU too. Now it's not apples-to-apples. There's a kernel 2.4 vs. 2.6 difference and the memory bus is much faster and I'm not sure what kind of context switching hit you get with the Xeon MP memory controller. On a previous postgresql app I did I ran nearly identically spec'ed machines except for the memory bus and saw about a 30% boost in performance just with the 800MHz bus. I imagine the Opteron bus does even better. Memory bandwidth is so high on both that it's not a factor. Context switching / memory bus contention isn't either. So the small machine is probably slower on disk but makes up for it in single-threaded access to CPU and memory speed. But if this app were to be scaled it would make much more sense to cluster several $800 machines than it would to buy 'big-iron'. Yes it does - by a lot too. Also, having a multiprocessing executor gets all of each machine by having multiple CPUs scan simultaneously. - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Ok - so I ran the same test on my system and get a total speed of 113MB/sec. Why is this? Why is the system so limited to around just 110MB/sec? I tuned read ahead up a bit, and my results improve a bit.. Alex On 11/18/05, Luke Lonergan [EMAIL PROTECTED] wrote: Dave, On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote: Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. Part 2: The answer System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel. On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that's 66 MB/s. Not the efficiency I'd hope from the onboard SATA controller that I'd like, I would have expected to get 85% of the 100MB/s raw read performance. So that's $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout1 count -- 1000 (1 row) real0m32.565s user0m0.002s sys 0m0.003s Size of the table data: [EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base 2121648 dgtestdb/base System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. It's running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel. Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) That's 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, let's try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s. So, using the tuned setting of blockdev —setra 16384 we get $6,000 / 244MB/s = 24.6 $/MB/s If we use the default Linux setting it's 2.5x worse. Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout3 count -- 8000 (1 row) real1m9.875s user0m0.000s sys 0m0.004s [EMAIL PROTECTED] IVP]$ !du du -sk dgtestdb/base 17021260dgtestdb/base Summary: cough, cough OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldn't go any faster than 244MB/s. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Alan, On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote: Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. Cool - thanks for the results. Is that % of one CPU, or of 2? Was the system otherwise idle? Actually, this was dual cpu and there was other activity during the full minute, but it was on other file devices, which I didn't include in the above output. Given that, and given what I see on the box now I'd raise the 20% to 30% just to be more conservative. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. One IMHO obvious improvement would be to have vacuum and analyze only do direct IO. Now they appear to be very effective memory flushing tools. Table scans on tables larger than say 4x memory should probably also use direct IO for reads. We've done nothing fancy and achieved results you claim shouldn't be possible. This is a system that was re-installed yesterday, no tuning was done to the file systems, kernel or storage array. Are you happy with 130MB/s? How much did you pay for that? Is it more than $2,000, or double my 2003 PC? I don't know what the system cost. It was part of block of dual opterons from Sun that we got some time ago. I think the 130MB/s is slow given the hardware, but it's acceptable. I'm not too price sensitive; I care much more about reliability, uptime, etc. What am I doing wrong? 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO (for a DOE lab). And now I don't know what I'm doing, Cool. Would that be Sandia? We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on complex queries. Disk?! 4 StorageTek tape silos. That would be .002 TB/s. One has to change how you think when you have that much data. And hope you don't have a fire, because there's no backup. That work was while I was at BNL. I believe they are now at 4PB of tape and 150TB of disk. -- Alan ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan, On 11/18/05 10:30 AM, Alan Stange [EMAIL PROTECTED] wrote: Actually, this was dual cpu and there was other activity during the full minute, but it was on other file devices, which I didn't include in the above output. Given that, and given what I see on the box now I'd raise the 20% to 30% just to be more conservative. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. These are all things that help on the IO wait side possibly, however, there is a producer/consumer problem in postgres that goes something like this: - Read some (small number of, sometimes 1) 8k pages - Do some work on those pages, including lots of copies - repeat This back and forth without threading (like AIO, or a multiprocessing executor) causes cycling and inefficiency that limits throughput. Optimizing some of the memcopies and other garbage out, plus increasing the internal (postgres) readahead would probably double the disk bandwidth. But to be disk-bound (meaning that the disk subsystem is running at full speed), requires asynchronous I/O. We do this now with Bizgres MPP, and we get fully saturated disk channels on every machine. That means that even on one machine, we run many times faster than non-MPP postgres. One IMHO obvious improvement would be to have vacuum and analyze only do direct IO. Now they appear to be very effective memory flushing tools. Table scans on tables larger than say 4x memory should probably also use direct IO for reads. That's been suggested many times prior - I agree, but this also needs AIO to be maximally effective. I don't know what the system cost. It was part of block of dual opterons from Sun that we got some time ago. I think the 130MB/s is slow given the hardware, but it's acceptable. I'm not too price sensitive; I care much more about reliability, uptime, etc. Then I know what they cost - we have them too (V20z and V40z). You should be getting 400MB/s+ with external RAID. What am I doing wrong? 9 years ago I co-designed a petabyte data store with a goal of 1GB/s IO (for a DOE lab). And now I don't know what I'm doing, Cool. Would that be Sandia? We routinely sustain 2,000 MB/s from disk on 16x 2003 era machines on complex queries. Disk?! 4 StorageTek tape silos. That would be .002 TB/s. One has to change how you think when you have that much data. And hope you don't have a fire, because there's no backup. That work was while I was at BNL. I believe they are now at 4PB of tape and 150TB of disk. We had 1.5 Petabytes on 2 STK Silos at NAVO from 1996-1998 where I ran RD. We also had a Cray T932 an SGI Origin 3000 with 256 CPUs, a Cray T3E with 1280 CPUs, 2 Cray J916s with 1 TB of shared disk, a Cray C90-16, a Sun E10K, etc etc, along with clusters of Alpha machines and lots of SGIs. It's nice to work with a $40M annual budget. Later, working with FSL we implemented a weather forecasting cluster that ultimately became the #5 fastest computer on the TOP500 supercomputing list from 512 Alpha cluster nodes. That machine had a 10-way shared SAN, tape robotics and a Myrinet interconnect and ran 64-bit Linux (in 1998). - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alan Stange [EMAIL PROTECTED] writes: Luke Lonergan wrote: Alan, On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote: Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. Cool - thanks for the results. Is that % of one CPU, or of 2? Was the system otherwise idle? Actually, this was dual cpu I hate to agree with him but that looks like a dual machine with one CPU pegged. Yes most of the time is being spent in the kernel, but you're still basically cpu limited. That said, 130MB/s is nothing to sneeze at, that's maxing out two high end drives and quite respectable for a 3-disk stripe set, even reasonable for a 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and only getting 130MB/s then it does seem likely the cpu is actually holding you back here. Still it doesn't show Postgres being nearly so CPU wasteful as the original poster claimed. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. Well changing to direct i/o would still be changing Postgres so that's unclear. And there are plenty of more mundane ways that Postgres is responsible for how efficiently or not the kernel is used. Just using fewer syscalls to do the same amount of reading would reduce cpu consumption. One IMHO obvious improvement would be to have vacuum and analyze only do direct IO. Now they appear to be very effective memory flushing tools. Table scans on tables larger than say 4x memory should probably also use direct IO for reads. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg, On 11/18/05 11:07 AM, Greg Stark [EMAIL PROTECTED] wrote: That said, 130MB/s is nothing to sneeze at, that's maxing out two high end drives and quite respectable for a 3-disk stripe set, even reasonable for a 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and only getting 130MB/s then it does seem likely the cpu is actually holding you back here. With an FC array, it's undoubtedly more like 14 drives, in which case 130MB/s is laughable. On the other hand, I wouldn't be surprised if it were a single 200MB/s Fibre Channel attachment. It does make you wonder why people keep recommending 15K RPM drives, like it would help *not*. Still it doesn't show Postgres being nearly so CPU wasteful as the original poster claimed. It's partly about waste, and partly about lack of a concurrent I/O mechanism. We've profiled it for the waste, we've implemented concurrent I/O to prove the other point. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. Well changing to direct i/o would still be changing Postgres so that's unclear. And there are plenty of more mundane ways that Postgres is responsible for how efficiently or not the kernel is used. Just using fewer syscalls to do the same amount of reading would reduce cpu consumption. Bingo. - Luke ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: opterons from Sun that we got some time ago. I think the 130MB/s is slow given the hardware, but it's acceptable. I'm not too price sensitive; I care much more about reliability, uptime, etc. I don't know what the system cost. It was part of block of dual Then I know what they cost - we have them too (V20z and V40z). You should be getting 400MB/s+ with external RAID. Yes, but we don't. This is where I would normally begin a rant on how craptacular Linux can be at times. But, for the sake of this discussion, postgresql isn't reading the data any more slowly than does any other program. And we don't have the time to experiment with the box. I know it should be better, but it's good enough for our purposes at this time. -- Alan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Greg Stark wrote: Alan Stange [EMAIL PROTECTED] writes: Luke Lonergan wrote: Alan, On 11/18/05 9:31 AM, Alan Stange [EMAIL PROTECTED] wrote: Here's the output from one iteration of iostat -k 60 while the box is doing a select count(1) on a 238GB table. avg-cpu: %user %nice%sys %iowait %idle 0.990.00 17.97 32.40 48.64 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdd 345.95130732.53 0.007843952 0 We're reading 130MB/s for a full minute. About 20% of a single cpu was being used. The remainder being idle. Cool - thanks for the results. Is that % of one CPU, or of 2? Was the system otherwise idle? Actually, this was dual cpu I hate to agree with him but that looks like a dual machine with one CPU pegged. Yes most of the time is being spent in the kernel, but you're still basically cpu limited. That said, 130MB/s is nothing to sneeze at, that's maxing out two high end drives and quite respectable for a 3-disk stripe set, even reasonable for a 4-disk stripe set. If you're using 5 or more disks in RAID-0 or RAID 1+0 and only getting 130MB/s then it does seem likely the cpu is actually holding you back here. Still it doesn't show Postgres being nearly so CPU wasteful as the original poster claimed. Yes and no. The one cpu is clearly idle. The second cpu is 40% busy and 60% idle (aka iowait in the above numbers). Of that 40%, other things were happening as well during the 1 minute snapshot. During some iostat outputs that I didn't post the cpu time was ~ 20%. So, you can take your pick. The single cpu usage is somewhere between 20% and 40%. As I can't remove other users of the system, it's the best measurement that I can make right now. Either way, it's not close to being cpu bound. This is with Opteron 248, 2.2Ghz cpus. Note that the storage system has been a bit disappointing: it's an IBM Fast T600 with a 200MB/s fiber attachment. It could be better, but it's not been the bottleneck in our work, so we haven't put any energy into it. It's all in the kernel either way; using a different scheduler or file system would change that result. Even better would be using direct IO to not flush everything else from memory and avoid some memory copies from kernel to user space. Note that almost none of the time is user time. Changing postgresql won't change the cpu useage. Well changing to direct i/o would still be changing Postgres so that's unclear. And there are plenty of more mundane ways that Postgres is responsible for how efficiently or not the kernel is used. Just using fewer syscalls to do the same amount of reading would reduce cpu consumption. Absolutely. This is why we're using a 32KB block size and also switched to using O_SYNC for the WAL syncing method. That's many MB/s that don't need to be cached in the kernel (thus evicting other data), and we avoid all the fysnc/fdatasync syscalls. The purpose of direct IO isn't to make the vacuum or analyze faster, but to lessen their impact on queries with someone waiting for the results. That's our biggest hit: running a sequential scan on 240GB of data and flushing everything else out of memory. Now that I'm think about this a bit, a big chunk of time is probably being lost in TLB misses and other virtual memory events that would be avoided if a larger page size was being used. -- Alan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Vivek, On 11/18/05 8:07 AM, Vivek Khera [EMAIL PROTECTED] wrote: On Nov 18, 2005, at 10:13 AM, Luke Lonergan wrote: Still, there is a CPU limit here this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s. Yeah, and mysql would probably be faster on your trivial queries. Try concurrent large joins and updates and see which system is faster. Thats what we do to make a living. And its Oracle that a lot faster because they implemented a much tighter, optimized I/O path to disk than Postgres. Since you asked, we bought the 5 systems as a cluster and with Bizgres MPP we get close to 400MB/s per machine on complex queries. - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke,Interesting numbers. I'm a little concerned about the use of blockdev —setra 16384. If I understand this correctly it assumes that the table is contiguous on the disk does it not ?DaveOn 18-Nov-05, at 10:13 AM, Luke Lonergan wrote: Dave, On 11/18/05 5:00 AM, "Dave Cramer" [EMAIL PROTECTED] wrote:Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ?This should draw some interesting posts. Part 2: The answer System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel. On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: that’s 66 MB/s. Not the efficiency I’d hope from the onboard SATA controller that I’d like, I would have expected to get 85% of the 100MB/s raw read performance. So that’s $1,200 / 66 MB/s (without adjusting for 2003 price versus now) = 18.2 $/MB/s Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c "select count(*) from ivp.bigtable1" dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout1 count -- 1000 (1 row) real 0m32.565s user 0m0.002s sys 0m0.003s Size of the table data: [EMAIL PROTECTED] IVP]$ du -sk dgtestdb/base 2121648 dgtestdb/base System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. It’s running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel. Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) That’s 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, let’s try it with the default (I think) setting of 256KB – AHA! Now we get 171.4 seconds or 99.3MB/s. So, using the tuned setting of “blockdev —setra 16384” we get $6,000 / 244MB/s = 24.6 $/MB/s If we use the default Linux setting it’s 2.5x worse. Raw data: [EMAIL PROTECTED] IVP]$ cat scan.sh #!/bin/bash time psql -c "select count(*) from ivp.bigtable1" dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout3 count -- 8000 (1 row) real 1m9.875s user 0m0.000s sys 0m0.004s [EMAIL PROTECTED] IVP]$ !du du -sk dgtestdb/base 17021260 dgtestdb/base Summary: cough, cough OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s. - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases
Breaking the ~120MBps pg IO ceiling by any means is an important result. Particularly when you get a ~2x improvement. I'm curious how far we can get using simple approaches like this. At 10:13 AM 11/18/2005, Luke Lonergan wrote: Dave, On 11/18/05 5:00 AM, Dave Cramer [EMAIL PROTECTED] wrote: Now there's an interesting line drawn in the sand. I presume you have numbers to back this up ? This should draw some interesting posts. Part 2: The answer System A: This system is running RedHat 3 Update 4, with a Fedora 2.6.10 Linux kernel. On a single table with 15 columns (the Bizgres IVP) at a size double memory (2.12GB), Postgres 8.0.3 with Bizgres enhancements takes 32 seconds to scan the table: thats 66 MB/s. Not the efficiency Id hope from the onboard SATA controller that Id like, I would have expected to get 85% of the 100MB/s raw read performance. Have you tried the large read ahead trick with this system? It would be interesting to see how much it would help. It might even be worth it to do the experiment at all of [default, 2x default, 4x default, 8x default, etc] read ahead until either a) you run out of resources to support the desired read ahead, or b) performance levels off. I can imagine the results being very enlightening. System B: This system is running an XFS filesystem, and has been tuned to use very large (16MB) readahead. Its running the Centos 4.1 distro, which uses a Linux 2.6.9 kernel. Same test as above, but with 17GB of data takes 69.7 seconds to scan (!) Thats 244.2MB/s, which is obviously double my earlier point of 110-120MB/s. This system is running with a 16MB Linux readahead setting, lets try it with the default (I think) setting of 256KB AHA! Now we get 171.4 seconds or 99.3MB/s. The above experiment would seem useful here as well. Summary: cough, cough OK you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldnt go any faster than 244MB/s. - Luke I respect your honesty in reporting results that were different then your expectations or previously taken stance. Alan Stange's comment re: the use of direct IO along with your comments re: async IO and mem copies plus the results of these experiments could very well point us directly at how to most easily solve pg's CPU boundness during IO. [HACKERS] are you watching this? Ron ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: (mass snippage) time psql -c select count(*) from ivp.bigtable1 dgtestdb [EMAIL PROTECTED] IVP]$ cat sysout3 count -- 8000 (1 row) real1m9.875s user0m0.000s sys 0m0.004s [EMAIL PROTECTED] IVP]$ !du du -sk dgtestdb/base 17021260dgtestdb/base Summary: cough, cough OK – you can get more I/O bandwidth out of the current I/O path for sequential scan if you tune the filesystem for large readahead. This is a cheap alternative to overhauling the executor to use asynch I/O. Still, there is a CPU limit here – this is not I/O bound, it is CPU limited as evidenced by the sensitivity to readahead settings. If the filesystem could do 1GB/s, you wouldn’t go any faster than 244MB/s. Luke, Interesting - but possibly only representative for a workload consisting entirely of one executor doing SELECT ... FROM my_single_table. If you alter this to involve more complex joins (e.g 4. way star) and (maybe add a small number of concurrent executors too) - is it still the case? Cheers Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Mark, On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: If you alter this to involve more complex joins (e.g 4. way star) and (maybe add a small number of concurrent executors too) - is it still the case? 4-way star, same result, that's part of my point. With Bizgres MPP, the 4-way star uses 4 concurrent scanners, though not all are active all the time. And that's per segment instance - we normally use one segment instance per CPU, so our concurrency is NCPUs plus some. The trick is the small number of concurrent executors part. The only way to get this with normal postgres is to have concurrent users, and normally they are doing different things, scanning different parts of the disk. These are competing things, and for concurrency enhancement something like sync scan would be an effective optimization. But in reporting, business analytics and warehousing in general, there are reports that take hours to run. If you can knock that down by factors of 10 using parallelism, it's a big win. That's the reason that Teradata did $1.5 Billion in business last year. More importantly - that's the kind of work that everyone using internet data for analytics wants right now. - Luke ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Title: Re: [PERFORM] Hardware/OS recommendations for large databases ( Mark, On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: If you alter this to involve more complex joins (e.g 4. way star) and (maybe add a small number of concurrent executors too) - is it still the case? I may not have listened to you - are you asking about whether the readahead works for these cases? Ill be running some massive TPC-H benchmarks on these machines soon well see then. - Luke
Re: [PERFORM] Hardware/OS recommendations for large databases (
Luke Lonergan wrote: Mark, On 11/18/05 3:46 PM, Mark Kirkwood [EMAIL PROTECTED] wrote: If you alter this to involve more complex joins (e.g 4. way star) and (maybe add a small number of concurrent executors too) - is it still the case? I may not have listened to you - are you asking about whether the readahead works for these cases? I’ll be running some massive TPC-H benchmarks on these machines soon – we’ll see then. That too, meaning the business of 1 executor random reading a given relation file whilst another is sequentially scanning (some other) part of it Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Hardware/OS recommendations for large databases (
Welty, Richard wrote: David Boreham wrote: I guess I've never bought into the vendor story that there are two reliability grades. Why would they bother making two different kinds of bearing, motor etc ? Seems like it's more likely an excuse to justify higher prices. then how to account for the fact that bleeding edge SCSI drives turn at twice the rpms of bleeding edge consumer drives? The motors spin twice as fast? I'm pretty sure the original comment was based on drives w/ similar specs. E.g. 7200RPM enterprise drives versus 7200RPM consumer drives. Next time one of my 7200RPM SCSIs fail, I'll take it apart and compare the insides to an older 7200RPM IDE from roughly the same era. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (5TB)
On Nov 16, 2005, at 4:50 PM, Claus Guttesen wrote: I'm (also) FreeBSD-biased but I'm not shure whether the 5 TB fs will work so well if tools like fsck are needed. Gvinum could be one option but I don't have any experience in that area. Then look into an external filer and mount via NFS. Then it is not FreeBSD's responsibility to manage the volume. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/16/05, William Yu [EMAIL PROTECTED] wrote: Alex Turner wrote: Spend a fortune on dual core CPUs and then buy crappy disks... I bet for most applications this system will be IO bound, and you will see a nice lot of drive failures in the first year of operation with consumer grade drives. Spend your money on better Disks, and don't bother with Dual Core IMHO unless you can prove the need for it. I would say the opposite -- you always want Dual Core nowadays. DC Opterons simply give you better bang for the buck than single core Opterons. Price out a 1xDC system against a 2x1P system -- the 1xDC will be cheaper. Do the same for 2xDC versus 4x1P, 4xDC versus 8x1P, 8xDC versus 16x1P, etc. -- DC gets cheaper by wider and wider margins because those mega-CPU motherboards are astronomically expensive. Opteron 242 - $178.00 Opteron 242 - $178.00 Tyan S2882 - $377.50 Total: $733.50 Opteron 265 - $719.00 Tyan K8E - $169.00 Total: $888.00 Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples. Infact I couldn't find a single CPU slot board that did, so you pretty much have to buy a dual CPU board to get PCI-X. 1xDC is _not_ cheaper. Our DB application does about 5 queries/second peak, plus a heavy insert job once per day. We only _need_ two CPUs, which is true for a great many DB applications. Unless you like EJB of course, which will thrash the crap out of your system. Consider the two most used regions for DBs: a) OLTP - probably IO bound, large number of queries/sec updating info on _disks_, not requiring much CPU activity except to retrieve item infomration which is well indexed and normalized. b) Data wharehouse - needs CPU, but probably still IO bound, large data set that won't fit in RAM will required large amounts of disk reads. CPU can easily keep up with disk reads. I have yet to come across a DB system that wasn't IO bound. DC also gives you a better upgrade path. Let's say you do testing and figure 2x246 is the right setup to handle the load. Well instead of getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a DC/270. Now you have a server that can be upgraded to +80% more CPU by popping in another DC/270 versus throwing out the entire thing to get a 4x1P setup. No argument there. But it's pointless if you are IO bound. The only questions would be: (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other processes from running. At least 2CPUs is always good for precisely those reasons. More than 2CPUs gives diminishing returns. (2) Does a DC system perform better than it's Nx1P cousin? My experience is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244 and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP, etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups. Maybe true, but the 265 does have a 25% faster FSB than the 244, which might perhaps play a role. (3) Do you need an insane amount of memory? Well here's the case where the more expensive motherboard will serve you better since each CPU slot has its own bank of memory. Spend more money on memory, get cheaper single-core CPUs. Remember - large DB is going to be IO bound. Memory will get thrashed for file block buffers, even if you have large amounts, it's all gonna be cycled in and out again. Of course, this doesn't apply if you are an Intel/Dell-only shop. Xeon DCs, while cheaper than their corresponding single-core SMPs, don't have the same performance profile of Opteron DCs. Basically, you're paying a bit extra so your server can generate a ton more heat. Dell/Xeon/Postgres is just a bad combination any day of the week ;) Alex. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
Just pick up a SCSI drive and a consumer ATA drive. Feel their weight. You don't have to look inside to tell the difference. Alex On 11/16/05, David Boreham [EMAIL PROTECTED] wrote: I suggest you read this on the difference between enterprise/SCSI and desktop/IDE drives: http://www.seagate.com/content/docs/pdf/whitepaper/D2c_More_than_Interface_ATA_vs_SCSI_042003.pdf This is exactly the kind of vendor propaganda I was talking about and it proves my point quite well : that there's nothing specific relating to reliability that is different between SCSI and SATA drives cited in that paper. It does have a bunch of FUD such as 'oh yeah we do a lot more drive characterization during manufacturing'. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/16/05, Joshua D. Drake [EMAIL PROTECTED] wrote: The only questions would be: (1) Do you need a SMP server at all? I'd claim yes -- you always need 2+ cores whether it's DC or 2P to avoid IO interrupts blocking other processes from running. I would back this up. Even for smaller installations (single raid 1, 1 gig of ram). Why? Well because many applications are going to be CPU bound. For example we have a PHP application that is a CMS. On a single CPU machine, RAID 1 it takes about 300ms to deliver a single page, point to point. We are not IO bound. So what happens is that under reasonable load we are actually waiting for the CPU to process the code. This is the performance profile for PHP, not for Postgresql. This is the postgresql mailing list. A simple upgrade to an SMP machine literally doubles our performance because we are still not IO bound. I strongly suggest that everyone use at least a single dual core because of this experience. Performance of PHP, not postgresql. (3) Do you need an insane amount of memory? Well here's the case where the more expensive motherboard will serve you better since each CPU slot has its own bank of memory. Spend more money on memory, get cheaper single-core CPUs. Agreed. A lot of times the slowest dual-core is 5x what you actually need. So get the slowest, and bulk up on memory. If nothing else memory is cheap today and it might not be tomorrow. [snip] Running postgresql on a single drive RAID 1 with PHP on the same machine is not a typical installation. 300ms for PHP in CPU time? wow dude - that's quite a page. PHP typical can handle up to 30-50 pages per second for a typical OLTP application on a single CPU box. Something is really wrong with that system if it takes 300ms per page. Alex. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware/OS recommendations for large databases (
Remember - large DB is going to be IO bound. Memory will get thrashed for file block buffers, even if you have large amounts, it's all gonna be cycled in and out again. 'fraid I have to disagree here. I manage ERP systems for manufacturing companies of various sizes. My systems are all completely cpu bound...even though the larger database are well into two digit gigabyte sizes, the data turnover while huge is relatively constrained and well served by the O/S cache. OTOH, query latency is a *huge* factor and we do everything possible to lower it. Even if the cpu is not 100% loaded, faster processors make the application 'feel' faster to the client. Merlin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
Alex Turner wrote: Opteron 242 - $178.00 Opteron 242 - $178.00 Tyan S2882 - $377.50 Total: $733.50 Opteron 265 - $719.00 Tyan K8E - $169.00 Total: $888.00 You're comparing the wrong CPUs. The 265 is the 2x of the 244 so you'll have to bump up the price more although not enough to make a difference. Looks like the price of the 2X MBs have dropped since I last looked at it. Just a few months back, Tyan duals were $450-$500 which is what I was basing my priced less statement from. Tyan K8E - doesn't have any PCI-X, so it's also not apples to apples. Infact I couldn't find a single CPU slot board that did, so you pretty much have to buy a dual CPU board to get PCI-X. You can get single CPU boards w/ PCIe and use PCIe controller cards. Probably expensive right now because they're so bleeding-edge new but definitely on the downswing. a) OLTP - probably IO bound, large number of queries/sec updating info on _disks_, not requiring much CPU activity except to retrieve item infomration which is well indexed and normalized. Not in my experience. I find on our OLTP servers, we run 98% in RAM and hence are 100% CPU-bound. Our DB is about 50GB in size now, servers run w/ 8GB of RAM. We were *very* CPU limited running 2x244. During busy hours of the day, our avg user transaction time were jumping from 0.8sec to 1.3+sec. Did the 2x265 and now we're always in the 0.7sec to 0.9sec range. DC also gives you a better upgrade path. Let's say you do testing and figure 2x246 is the right setup to handle the load. Well instead of getting 2x1P, use the same 2P motherboard but only populate 1 CPU w/ a DC/270. Now you have a server that can be upgraded to +80% more CPU by popping in another DC/270 versus throwing out the entire thing to get a 4x1P setup. No argument there. But it's pointless if you are IO bound. Why would you just accept we're IO bound, nothing we can do? I'd do everything in my power to make my app go from IO bound to CPU bound -- whether by optimizing my code or buying more hardware. I can tell you if our OLTP servers were IO bound, it would run like crap. Instead of 1 sec, we'd be looking at 5-10 seconds per user transaction and our users would be screaming bloody murder. In theory, you can always convert your IO bound DB to CPU bound by stuffing more and more RAM into your server. (Or partitioning the DB across multiple servers.) Whether it's cost effective depends on the DB and how much your users are paying you -- and that's a case-by-case analysis. Not a global statement of IO-bound, pointless. (2) Does a DC system perform better than it's Nx1P cousin? My experience is yes. Did some rough tests in a drop-in-replacement 1x265 versus 2x244 and saw about +10% for DC. All the official benchmarks (Spec, Java, SAP, etc) from AMD/Sun/HP/IBM show DCs outperforming the Nx1P setups. Maybe true, but the 265 does have a 25% faster FSB than the 244, which might perhaps play a role. Nope. There's no such thing as FSB on Opterons. On-die memory controller runs @ CPU speed and hence connects at whatever the memory runs at (rounded off to some multiplier math). There's the HT speed that controls the max IO bandwidth but that's based on the motherboard, not the CPU. Plus the 265 and 244 both run at 1.8Ghz so the memory multiplier HT IO are both the same. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/17/05, William Yu [EMAIL PROTECTED] wrote: No argument there.But it's pointless if you are IO bound.Why would you just accept we're IO bound, nothing we can do? I'd doeverything in my power to make my app go from IO bound to CPU bound -- whether by optimizing my code or buying more hardware. I can tell you ifour OLTP servers were IO bound, it would run like crap. Instead of 1sec, we'd be looking at 5-10 seconds per user transaction and our users would be screaming bloody murder.In theory, you can always convert your IO bound DB to CPU bound bystuffing more and more RAM into your server. (Or partitioning the DBacross multiple servers.) Whether it's cost effective depends on the DB and how much your users are paying you -- and that's a case-by-caseanalysis. Not a global statement of IO-bound, pointless. We all want our systems to be CPU bound, but it's not always possible. Remember, he is managing a 5 TB Databse. That's quite a bit different than a 100 GB or even 500 GB database.