Re: [PERFORM] RAID stripe size question
WRT seek performance, we're doing 2500 seeks per second on the Sun/Thumper on 36 disks. Luke, Have you had time to run benchmarksql against it yet? I'm just curious about the IO seeks/s vs. transactions/minute correlation... /Mikael ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] RAID stripe size question
On 8/3/06, Luke Lonergan [EMAIL PROTECTED] wrote: Merlin, moving a gigabyte around/sec on the server, attached or no, is pretty heavy lifting on x86 hardware. Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA. that is pretty amazing, that works out to 55 mb/sec/drive, close to theoretical maximums. are you using pci-e sata controller and raptors im guessing? this is doubly impressive if we are talking raid 5 here. do you find that software raid is generally better than hardware at the highend? how much does this tax the cpu? WRT seek performance, we're doing 2500 seeks per second on the Sun/Thumper on 36 disks. You might do better with 15K RPM disks and great controllers, but I haven't seen it reported yet. thats pretty amazing too. only a highly optimized raid system can pull this off. BTW - I'm curious about the HP P600 SAS host based RAID controller - it has very good specs, but is the Linux driver solid? have no clue. i sure hope i dont go through the same headaches as with ibm scsi drivers (rebranded adaptec btw). sas looks really promising however. the adaptec sas gear is so cheap it might be worth it to just buy some and see what it can do. merlin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] RAID stripe size question
On 7/18/06, Alex Turner [EMAIL PROTECTED] wrote: Remember when it comes to OLTP, massive serial throughput is not gonna help you, it's low seek times, which is why people still buy 15k RPM drives, and why you don't necessarily need a honking SAS/SATA controller which can harness the full 1066MB/sec of your PCI-X bus, or more for PCIe. Of course, hm. i'm starting to look seriously at SAS to take things to the next level. it's really not all that expensive, cheaper than scsi even, and you can mix/match sata/sas drives in the better enclosures. the real wild card here is the raid controller. i still think raptors are the best bang for the buck and SAS gives me everything i like about sata and scsi in one package. moving a gigabyte around/sec on the server, attached or no, is pretty heavy lifting on x86 hardware. merlin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] RAID stripe size question
Merlin, moving a gigabyte around/sec on the server, attached or no, is pretty heavy lifting on x86 hardware. Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA. WRT seek performance, we're doing 2500 seeks per second on the Sun/Thumper on 36 disks. You might do better with 15K RPM disks and great controllers, but I haven't seen it reported yet. BTW - I'm curious about the HP P600 SAS host based RAID controller - it has very good specs, but is the Linux driver solid? - Luke ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] RAID stripe size question
From: Alex Turner [EMAIL PROTECTED] Sent: Jul 18, 2006 12:21 AM To: Ron Peacetree [EMAIL PROTECTED] Cc: Mikael Carneholm [EMAIL PROTECTED], pgsql-performance@postgresql.org Subject: Re: [PERFORM] RAID stripe size question On 7/17/06, Ron Peacetree [EMAIL PROTECTED] wrote: -Original Message- From: Mikael Carneholm [EMAIL PROTECTED] Sent: Jul 17, 2006 5:16 PM To: Ron Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org Subject: RE: [PERFORM] RAID stripe size question I use 90% of the raid cache for writes, don't think I could go higher than that. Too bad the emulex only has 256Mb though :/ If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater. I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install. For those access patterns, no amount of RAID cache commercially available was enough to find the flattening point of the cache percentage curve. 256MB of BB RAID cache per HBA is just not that much for many IO patterns. 90% as in 90% of the RAM, not 90% hit rate I'm imagining. Either way, =particularly= for OLTP-like I/O patterns, the more RAID cache the better unless the IO pattern is completely random. In which case the best you can do is cache the entire sector map of the RAID set and use as many spindles as possible for the tables involved. I've seen high end set ups in Fortune 2000 organizations that look like some of the things you read about on tpc.org: =hundreds= of HDs are used. Clearly, completely random IO patterns are to be avoided whenever and however possible. Thankfully, most things can be designed to not have completely random IO and stuff like WAL IO are definitely not random. The important point here about cache size is that unless you make cache large enough that you see a flattening in the cache behavior, you probably can still use more cache. Working sets are often very large for DB applications. The controller is a FC2143 ( http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet. This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's. 28 such HDs are =definitely= IO choked on this HBA. Not they aren't. This is OLTP, not data warehousing. I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second. WAL IO patterns are not OLTP-like. Neither are most support or decision support IO patterns. Even in an OLTP system, there are usually only a few scenarios and tables where the IO pattern is pessimal. Alex is quite correct that those few will be the bottleneck on overall system performance if the system's primary function is OLTP-like. For those few, you dedicate as many spindles and RAID cache as you can afford and as show any performance benefit. I've seen an entire HBA maxed out with cache and as many HDs as would saturate the attainable IO rate dedicated to =1= table (unfortunately SSD was not a viable option in this case). The arithmetic suggests you need a better HBA or more HBAs or both. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO. Or to put it another way, the scenarios and tables that have the most random looking IO patterns are going to be the performance bottleneck on the whole system. In an OLTP-like system, WAL IO is unlikely to be your biggest performance issue. As in any other performance tuning effort, you only gain by speeding up the current bottleneck. According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.) I guess I'll have to find out which theory that holds by good ol? trial and error... :) IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW. New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time. Updated records
Re: [PERFORM] RAID stripe size question
This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's. Looking at http://h30094.www3.hp.com/product.asp?sku=2260908extended=1, I notice that the controller has a Ultra160 SCSI interface which implies that the theoretical max throughput is 160Mb/s. Ouch. However, what's more important is the seeks/s - ~530/s on a 28 disk array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array as mentioned by Mark here: http://archives.postgresql.org/pgsql-performance/2006-07/msg00170.php. Could be the disk RPM (10K vs 15K) that makes the difference here... I will test another stripe size (128K) for the DATA lun (28 disks) to see what difference that makes, I think I read somewhere that linux flushes blocks of 128K at a time, so it might be worth evaluating. /Mikael ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] RAID stripe size question
Title: Re: [PERFORM] RAID stripe size question Mikael, On 7/18/06 6:34 AM, Mikael Carneholm [EMAIL PROTECTED] wrote: However, what's more important is the seeks/s - ~530/s on a 28 disk array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500: === Single Stream With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 + +++ + +++ + +++ 30850 99 + +++ + +++ === Two Streams Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26085 90 + +++ 5700 98 21448 97 + +++ 4381 97 Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26082 99 + +++ 5588 98 21399 88 + +++ 4272 97 So thats 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read. === - Luke
Re: [PERFORM] RAID stripe size question
This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance. Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5. Alex.On 7/18/06, Luke Lonergan [EMAIL PROTECTED] wrote: Mikael, On 7/18/06 6:34 AM, Mikael Carneholm [EMAIL PROTECTED] wrote: However, what's more important is the seeks/s - ~530/s on a 28 disk array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500: === Single Stream With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 120453 99 467814 98 290391 58 109371 99 993344 94 1801 4 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 + +++ + +++ + +++ 30850 99 + +++ + +++ === Two Streams Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together): Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 111441 95 212536 54 171798 51 106184 98 719472 88 1233 2 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26085 90 + +++ 5700 98 21448 97 + +++ 4381 97 Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP thumperdw-i-1 32G 116355 99 212509 54 171647 50 106112 98 715030 87 1274 3 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 26082 99 + +++ 5588 98 21399 88 + +++ 4272 97 So that's 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read. === - Luke
Re: [PERFORM] RAID stripe size question
On Tue, 2006-07-18 at 14:27, Alex Turner wrote: This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance. Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5. Also, consider testing a mix, where your hardware RAID controller does the mirroring and the OS stripes ((R)AID 0) over the top of it. I've gotten good performance from mediocre hardware cards doing this. It has the advantage of still being able to use the battery backed cache and its instant fsync while not relying on some cards that have issues layering RAID layers one atop the other. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID stripe size question
Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, OS does RAID 0)? If so, what were the results? Ron -Original Message- From: Scott Marlowe [EMAIL PROTECTED] Sent: Jul 18, 2006 3:37 PM To: Alex Turner [EMAIL PROTECTED] Cc: Luke Lonergan [EMAIL PROTECTED], Mikael Carneholm [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org Subject: Re: [PERFORM] RAID stripe size question On Tue, 2006-07-18 at 14:27, Alex Turner wrote: This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance. Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5. Also, consider testing a mix, where your hardware RAID controller does the mirroring and the OS stripes ((R)AID 0) over the top of it. I've gotten good performance from mediocre hardware cards doing this. It has the advantage of still being able to use the battery backed cache and its instant fsync while not relying on some cards that have issues layering RAID layers one atop the other. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] RAID stripe size question
Nope, haven't tried that. At the time I was testing this I didn't even think of trying it. I'm not even sure I'd heard of RAID 50 at the time... :) I basically had an old MegaRAID 4xx series card in a dual PPro 200 and a stack of 6 9 gig hard drives. Spare parts. And even though the RAID 1+0 was relatively much faster on this hardware, the Dual P IV 2800 with a pair of 15k USCSI drives and a much later model MegaRAID at it for lunch with a single mirror set, and was plenty fast for our use at the time, so I never really had call to test it in production. But it definitely made our test server, the aforementioned PPro200 machine, more livable. On Tue, 2006-07-18 at 14:43, Ron Peacetree wrote: Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, OS does RAID 0)? If so, what were the results? Ron -Original Message- From: Scott Marlowe [EMAIL PROTECTED] Sent: Jul 18, 2006 3:37 PM To: Alex Turner [EMAIL PROTECTED] Cc: Luke Lonergan [EMAIL PROTECTED], Mikael Carneholm [EMAIL PROTECTED], Ron Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org Subject: Re: [PERFORM] RAID stripe size question On Tue, 2006-07-18 at 14:27, Alex Turner wrote: This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance. Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result. RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5. Also, consider testing a mix, where your hardware RAID controller does the mirroring and the OS stripes ((R)AID 0) over the top of it. I've gotten good performance from mediocre hardware cards doing this. It has the advantage of still being able to use the battery backed cache and its instant fsync while not relying on some cards that have issues layering RAID layers one atop the other. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] RAID stripe size question
According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, -theoretically increasing transfer performance, but decreasing positioning performance.) Mikael, In OLTP you utterly need best possible latency. If you decompose the response time if you physical request you will see positioning performance plays the dominant role in the response time (ignore for a moment caches and their effects). So, if you need really good response times of your SQL queries, choose 15 rpm disks(and add as much cache as possible to magnify the effect ;) ) Best Regards. Milen ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID stripe size question
With 18 disks dedicated to data, you could make 100/7*9 seeks/second (7ms av seeks time, 9 independant units) which is 128seeks/second writing on average 64kb of data, which is 4.1MB/sec throughput worst case, probably 10x best case so 40Mb/sec - you might want to take more disks for your data and less for your WAL. Someone check my math here...And as always - run benchmarks with your app to verifyAlex.On 7/16/06, Mikael Carneholm [EMAIL PROTECTED] wrote: I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3. Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never searched during normal operation). And for disks holding the data, a larger stripe size (32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway. So, does this make sense? Has anyone tried it and seen any performance gains from it? Regards, Mikael.
Re: [PERFORM] RAID stripe size question
Yeah, it seems to be a waste of disk space (spindles as well?). I was unsure how much activity the WAL disks would have compared to the data disks, so I created an array from 10 disks as the application is very write intense (many spindles / high throughput is crucial). I guess that a mirror of two disks is enough from a disk space perspective, but from a throughput perspective it will limit me to ~25Mb/s (roughly calculated). An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL activity correlates to normal data activity (is it 1:1, 1:2, 1:4, ...?) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Stone Sent: den 17 juli 2006 02:04 To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] RAID stripe size question On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote: I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope You've got 1.4TB assigned to the WAL, which doesn't normally have more than a couple of gigs? Mike Stone ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID stripe size question
Hi, Mikael, Mikael Carneholm wrote: An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL activity correlates to normal data activity (is it 1:1, 1:2, 1:4, ...?) I think the main difference is that the WAL activity is mostly linear, where the normal data activity is rather random access. Thus, a mirror of few disks (or, with good controller hardware, raid6 on 4 disks or so) for WAL should be enough to cope with a large set of data and index disks, who have a lot more time spent in seeking. Btw, it may make sense to spread different tables or tables and indices onto different Raid-Sets, as you seem to have enough spindles. And look into the commit_delay/commit_siblings settings, they allow you to deal latency for throughput (means a little more latency per transaction, but much more transactions per second throughput for the whole system.) HTH, Markus -- Markus Schaber | Logical TrackingTracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] RAID stripe size question
I think the main difference is that the WAL activity is mostly linear, where the normal data activity is rather random access. That was what I was expecting, and after reading http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html I figured that a different stripe size for the WAL set could be worth investigating. I have now dropped the old sets (10+18) and created two new raid1+0 sets (4 for WAL, 24 for data) instead. Bonnie++ is still running, but I'll post the numbers as soon as it has finished. I did actually use different stripe sizes for the sets as well, 8k for the WAL disks and 64k for the data. It's quite painless to do these things with HBAnywhere, so it's no big deal if I have to go back to another configuration. The battery cache only has 256Mb though and that botheres me, I assume a larger (512Mb - 1Gb) cache would make quite a difference. Oh well. Btw, it may make sense to spread different tables or tables and indices onto different Raid-Sets, as you seem to have enough spindles. This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirror everything) setup. From a development perspective it's easier to use SAME as the developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way. And look into the commit_delay/commit_siblings settings, they allow you to deal latency for throughput (means a little more latency per transaction, but much more transactions per second throughput for the whole system.) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming tests as well. Regards, Mikael. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] RAID stripe size question
Hi, Mikael, Mikael Carneholm wrote: This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirror everything) setup. From a development perspective it's easier to use SAME as the developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way. Usually, it's not the developers task to care about that, but the DBAs responsibility. And look into the commit_delay/commit_siblings settings, they allow you to deal latency for throughput (means a little more latency per transaction, but much more transactions per second throughput for the whole system.) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming tests as well. How many parallel transactions do you have? Markus -- Markus Schaber | Logical TrackingTracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID stripe size question
This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirror everything) setup. From a development perspective it's easier to use SAME as the developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way. Usually, it's not the developers task to care about that, but the DBAs responsibility. As we don't have a full-time dedicated DBA (although I'm the one who do most DBA related tasks) I would aim for making physical location as transparent as possible, otherwise I'm afraid I won't be doing anything else than supporting developers with that - and I *do* have other things to do as well :) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming tests as well. How many parallel transactions do you have? That was when running BenchmarkSQL (http://sourceforge.net/projects/benchmarksql) with 100 concurrent users (terminals), which I assume means 100 parallel transactions at most. The target application for this DB has 3-4 times as many concurrent connections so it's possible that one would have to find other cs/cd numbers better suited for that scenario. Tweaking bgwriter is another task I'll look into as well.. Btw, here's the bonnie++ results from two different array sets (10+18, 4+24) on the MSA1500: LUN: WAL, 10 disks, stripe size 32K Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell0132G 56139 93 73250 22 16530 3 30488 45 57489 5 477.3 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2458 90 + +++ + +++ 3121 99 + +++ 10469 98 LUN: WAL, 4 disks, stripe size 8K -- Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell0132G 49170 82 60108 19 13325 2 15778 24 21489 2 266.4 0 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2432 86 + +++ + +++ 3106 99 + +++ 10248 98 LUN: DATA, 18 disks, stripe size 32K - Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell0132G 59990 97 87341 28 19158 4 30200 46 57556 6 495.4 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1640 92 + +++ + +++ 1736 99 + +++ 10919 99 LUN: DATA, 24 disks, stripe size 64K - Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell0132G 59443 97 118515 39 25023 5 30926 49 60835 6 531.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2499 90 + +++ + +++ 2817 99 + +++ 10971 100 Regards, Mikael ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] RAID stripe size question
From: Mikael Carneholm [EMAIL PROTECTED] Sent: Jul 16, 2006 6:52 PM To: pgsql-performance@postgresql.org Subject: [PERFORM] RAID stripe size question I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. (unless they are old?) I'm not just being pedantic. The correct, let alone optimal, answer to your question depends on your exact HW characteristics as well as your SW config and your usage pattern. 15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. Most modern HDs in this class will do ~60MB/s inner tracks ~75MB/s avg and ~90MB/s outer tracks. If you are doing OLTP-like things, you are more sensitive to latency than most and should use the absolute lowest latency HDs available within you budget. The current latency best case is 15Krpm FC HDs. currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3. I strongly suspect having only 1 controller is an I/O choke w/ 28 HDs. 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. If both sets are to run at peak average speed, the Emulex would have to be able to handle ~1050MBps on average. It is doubtful the 1 Emulex can do this. In order to handle this level of bandwidth, a RAID controller must aggregate multiple FC, SCSI, or SATA streams as well as down any RAID 5 checksumming etc that is required. Very, very few RAID controllers can do = 1GBps One thing that help greatly with bursty IO patterns is to up your battery backed RAID cache as high as you possibly can. Even multiple GBs of BBC can be worth it. Another reason to have multiple controllers ;-) Then there is the question of the BW of the bus that the controller is plugged into. ~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel. PCI-E channels are usually good for 1/10 their rated speed in bps as Bps. So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc. At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus. ...and we haven't even touched on OS, SW, and usage pattern issues. Bottom line is that the IO chain is only as fast as its slowest component. Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? The short answer is Yes. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. Tables with OLTP-like characteristics need smaller DB pages and stripes to minimize latency issues (although locality of reference can make the optimum stripe size larger). Tables with Data Mining like characteristics usually work best with larger DB pages sizes and RAID stripe sizes. OS and FS overhead can make things more complicated. So can DB layout and access pattern issues. Side note: a 10 HD RAID 10 seems a bit much for WAL. Do you really need 375MBps IO on average to your WAL more than you need IO capacity for other tables? If WAL IO needs to be very high, I'd suggest getting a SSD or SSD-like device that fits your budget and having said device async mirror to HD. Bottom line is to optimize your RAID stripe sizes =after= you optimize your OS, FS, and pg design for best IO for your usage pattern(s). Hope this helps, Ron ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID stripe size question
On Mon, Jul 17, 2006 at 09:40:30AM -0400, Ron Peacetree wrote: Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. (unless they are old?) There are still 146GB SCSI 1rpm disks being sold here, at least. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] RAID stripe size question
On 7/17/06, Mikael Carneholm [EMAIL PROTECTED] wrote: This is something I'd also would like to test, as a common best-practice these days is to go for a SAME (stripe all, mirroreverything) setup. From a development perspective it's easier to use SAME as the developers won't have to think about physical location for new tables/indices, so if there's no performance penalty with SAME I'll gladly keep it that way.Usually, it's not the developers task to care about that, but the DBAs responsibility.As we don't have a full-time dedicated DBA (although I'm the one who domost DBA related tasks) I would aim for making physical location astransparent as possible, otherwise I'm afraid I won't be doing anything else than supporting developers with that - and I *do* have other thingsto do as well :) In a previous test, using cd=5000 and cs=20 increased transaction throughput by ~20% so I'll definitely fiddle with that in the coming tests as well.How many parallel transactions do you have?That was when running BenchmarkSQL(http://sourceforge.net/projects/benchmarksql ) with 100 concurrent users(terminals), which I assume means 100 parallel transactions at most.The target application for this DB has 3-4 times as many concurrentconnections so it's possible that one would have to find other cs/cd numbers better suited for that scenario. Tweaking bgwriter is anothertask I'll look into as well..Btw, here's the bonnie++ results from two different array sets (10+18,4+24) on the MSA1500:LUN: WAL, 10 disks, stripe size 32K Version1.03 --Sequential Output-- --Sequential Input- --Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 5613993 7325022 16530 3 3048845 57489 5 477.3 1--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16245890 + +++ + +++312199 + +++ 1046998LUN: WAL, 4 disks, stripe size 8K--Version1.03 --Sequential Output-- --Sequential Input- --Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 4917082 6010819 13325 2 1577824 21489 2 266.4 0--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16243286 + +++ + +++310699 + +++ 1024898LUN: DATA, 18 disks, stripe size 32K-Version1.03 --Sequential Output-- --Sequential Input---Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 5999097 8734128 19158 4 3020046 57556 6 495.4 1--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16164092 + +++ + +++173699 + +++ 1091999LUN: DATA, 24 disks, stripe size 64K-Version1.03 --Sequential Output-- --Sequential Input---Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CPsesell0132G 5944397 11851539 25023 5 3092649 60835 6 531.8 1--Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --ReadDelete-- files/sec %CP/sec %CP/sec %CP/sec %CP/sec %CP /sec %CP 16249990 + +++ + +++281799 + +++ 10971 100These bonnie++ number are very worrying. Your controller should easily max out your FC interface on these tests passing 192MB/sec with ease on anything more than an 6 drive RAID 10 . This is a bad omen if you want high performance... Each mirror pair can do 60-80MB/sec. A 24Disk RAID 10 can do 12*60MB/sec which is 740MB/sec - I have seen this performance, it's not unreachable, but time and again, we see these bad perf numbers from FC and SCSI systems alike. Consider a different controller, because this one is not up to snuff. A single drive would get better numbers than your 4 disk RAID 10, 21MB/sec read speed is really pretty sorry, it should be closer to 120Mb/sec. If you can't swap out, software RAID may turn out to be your friend. The only saving grace is that this is OLTP, and perhaps, just maybe, the controller will be better at ordering IOs, but I highly doubt it. Please people, do the numbers, benchmark before you buy, many many HBAs really suck under Linux/Free BSD, and you may end up paying vast sums of money for very sub-optimal performance (I'd say sub-standard, but alas, it seems that this kind of poor performance is tolerated, even though it's way off where it should be). There's no point having a 40disk cab, if your controller can't handle
Re: [PERFORM] RAID stripe size question
Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K. In the spec we got from HP, they are listed as model 286716-B22 (http://www.dealtime.com/xPF-Compaq_HP_146_8_GB_286716_B22) which seems to run at 10K. Don't know how old those are, but that's what we got from HP anyway. 15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic? 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. I guess it's still limited by the 2Gbit FC (192Mb/s), right? Very, very few RAID controllers can do = 1GBps One thing that help greatly with bursty IO patterns is to up your battery backed RAID cache as high as you possibly can. Even multiple GBs of BBC can be worth it. Another reason to have multiple controllers ;-) I use 90% of the raid cache for writes, don't think I could go higher than that. Too bad the emulex only has 256Mb though :/ Then there is the question of the BW of the bus that the controller is plugged into. ~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel. PCI-E channels are usually good for 1/10 their rated speed in bps as Bps. So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc. At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus. The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet. Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? The short answer is Yes. Ok WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.) I guess I'll have to find out which theory that holds by good ol´ trial and error... :) - Mikael ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] RAID stripe size question
Mikael Carneholm wrote: Btw, here's the bonnie++ results from two different array sets (10+18, 4+24) on the MSA1500: LUN: DATA, 24 disks, stripe size 64K - Version 1.03 --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP sesell0132G 59443 97 118515 39 25023 5 30926 49 60835 6 531.8 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2499 90 + +++ + +++ 2817 99 + +++ 10971 100 It might be interesting to see if 128K or 256K stripe size gives better sequential throughput, while still leaving the random performance ok. Having said that, the seeks/s figure of 531 not that great - for instance I've seen a 12 disk (15K SCSI) system report about 1400 seeks/s in this test. Sorry if you mentioned this already - but what OS and filesystem are you using? (if Linux and ext3, it might be worth experimenting with xfs or jfs). Cheers Mark ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID stripe size question
-Original Message- From: Mikael Carneholm [EMAIL PROTECTED] Sent: Jul 17, 2006 5:16 PM To: Ron Peacetree [EMAIL PROTECTED], pgsql-performance@postgresql.org Subject: RE: [PERFORM] RAID stripe size question 15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of 7-8ms. Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic? Ah, the games vendors play. average seek time for a 10Krpm HD may very well be 4.9ms. However, what matters to you the user is average =access= time. The 1st is how long it takes to position the heads to the correct track. The 2nd is how long it takes to actually find and get data from a specified HD sector. 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s, ~75*9= ~675MB/s. I guess it's still limited by the 2Gbit FC (192Mb/s), right? No. A decent HBA has multiple IO channels on it. So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!). Nominally, this card can push 8Gbps= 800MBps. ~600-700MBps is the RW number. Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080. Very, very few RAID controllers can do = 1GBps One thing that help greatly with bursty IO patterns is to up your battery backed RAID cache as high as you possibly can. Even multiple GBs of BBC can be worth it. Another reason to have multiple controllers ;-) I use 90% of the raid cache for writes, don't think I could go higher than that. Too bad the emulex only has 256Mb though :/ If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater. I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install. For those access patterns, no amount of RAID cache commercially available was enough to find the flattening point of the cache percentage curve. 256MB of BB RAID cache per HBA is just not that much for many IO patterns. The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID=), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet. This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's. 28 such HDs are =definitely= IO choked on this HBA. The arithmatic suggests you need a better HBA or more HBAs or both. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards. Big DB pages and big RAID stripes makes sense for WALs. According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.) I guess I'll have to find out which theory that holds by good ol� trial and error... :) IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] RAID stripe size question
On 7/17/06, Ron Peacetree [EMAIL PROTECTED] wrote: -Original Message-From: Mikael Carneholm [EMAIL PROTECTED]Sent: Jul 17, 2006 5:16 PMTo: RonPeacetree [EMAIL PROTECTED], pgsql-performance@postgresql.orgSubject: RE: [PERFORM] RAID stripe size question15Krpm HDs will have average access times of 5-6ms.10Krpm ones of 7-8ms. Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?Ah, the games vendors play.average seek time for a 10Krpm HD may very well be 4.9ms.However, what matters to you the user is average =access= time.The 1st is how long it takes to position the heads to the correct track.The 2nd is how long it takes to actually find and get data from a specified HD sector. 28HDs as above setup as 2 RAID 10's = ~75MBps*5= ~375MB/s,~75*9= ~675MB/s.I guess it's still limited by the 2Gbit FC (192Mb/s), right?No.A decent HBA has multiple IO channels on it.So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!).Nominally, this card can push 8Gbps= 800MBps.~600-700MBps is the RW number. Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.Very, very few RAID controllers can do = 1GBps One thing that help greatly withbursty IO patterns is to up your battery backed RAID cache as high as you possibly can.Even multiple GBs of BBC can be worth it.Another reason to have multiple controllers ;-)I use 90% of the raid cache for writes, don't think I could go higher than that.Too bad the emulex only has 256Mb though :/ If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater.I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install.For those access patterns, no amount of RAID cache commercially available was enough to find the flattening point of the cache percentage curve.256MB of BB RAID cache per HBA is just not that much for many IO patterns. 90% as in 90% of the RAM, not 90% hit rate I'm imagining. The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=configProductLineId=450FamilyId=1449BaseId=17621oi=E9CEDBEID=19701SBLID= ), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet.This is a relatively low end HBA with 1 4Gb FC on it.Max sustained IO on it is going to be ~320MBps.Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's. 28 such HDs are =definitely= IO choked on this HBA.Not they aren't. This is OLTP, not data warehousing. I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second. The arithmatic suggests you need a better HBA or more HBAs or both. WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards.Big DB pages and big RAID stripes makes sense for WALs.unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO. According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? (As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.) I guess I'll have to find out which theory that holds by good ol� trial and error... :)IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW. New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time. Updated records are modified in place unless they have grown too big to be in place. If you are updated mutiple tables on each transaction, a 64kb stripe size or lower is probably going to be best as block sizes are just 8kb. How much data does your average transaction write? How many xacts per second, this will help determine how many writes your cache will queue up before it flushes, and therefore what the optimal stripe size will be. Of course, the fastest and most accurate way is probably just to try different settings and see how it works. Alas some controllers seem to handle some stripe sizes more effeciently in defiance of any logic. Work out how big your xacts are, how many xacts/second you can post, and you will figure out how fast WAL will be writting. Allocate enough disk for peak load plus planned expansion on WAL and then put the rest to tablespace. You may well find that a single RAID 1 is enough for WAL (if you acheive theoretical performance levels,
Re: [PERFORM] RAID stripe size question
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote: Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never searched during normal operation). For large writes (ie. sequential write throughput), it doesn't really matter what the stripe size is; all the disks will have to both seek and write anyhow. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] RAID stripe size question
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote: I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope You've got 1.4TB assigned to the WAL, which doesn't normally have more than a couple of gigs? Mike Stone ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match