Re: [PERFORM] Scaling further up
Can you describe the vendors/components of a cheap SAN setup? Thanks, Anjan -Original Message- From: Rod Taylor [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 02, 2004 5:57 PM To: Scott Marlowe Cc: Anjan Dave; Chris Ruprecht; [EMAIL PROTECTED]; William Yu; Postgresql Performance Subject: Re: [PERFORM] Scaling further up For speed, the X86 32 and 64 bit architectures seem to be noticeable faster than Sparc. However, running Linux or BSD on Sparc make them pretty fast too, but you lose the fault tolerant support for things like hot swappable CPUs or memory. Agreed.. You can get a Quad Opteron with 16GB memory for around 20K. Grab 3, a cheap SAN and setup a little master/slave replication with failover (how is Slony coming?), and you're all set. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Scaling further up
For the disks part - I am looking at a SAN implementation, and I will be planning a separate RAID group for the WALs. The controller is a PERC, with 128MB cache, and I think it is writeback. Other than the disks, I am curious what other people are using in terms of the horsepower needed. The Quad server has been keeping up, but we are expecting quite high loads in the near future, and I am not sure if just by having the disks on a high-end storage will do it. Thanks, Anjan -Original Message- From: Magnus Hagander [mailto:[EMAIL PROTECTED] Sent: Monday, March 01, 2004 3:54 PM To: Anjan Dave; [EMAIL PROTECTED] Subject: RE: [PERFORM] Scaling further up All: We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments. What is the next step up in terms of handling very heavy loads? Clustering? I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound. More spindles = more parallell operations = faster under load. Consider adding 15KRPM disks as well, they're not all that much more expensive, and should give you better performance than 10KRPM. Also, make sure you put your WAL disks on a separate RAIDset if possible (not just a separate partition on existing RAIDset). Finally, if you don't already have it, look for a battery-backed RAID controller that can do writeback-cacheing, and enable that. (Don't even think about enabling it unless it's battery backed!) And add as much RAM as you can to that controller. //Magnus ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Scaling further up
On Fri, Mar 12, 2004 at 06:25:48PM -0500, Anjan Dave wrote: We upgraded from 8GB to 12GB RAM a month or so ago, but even in the past, I've never seen the system exhaust on it's system cache (~6GB, in 'top'), while it's swapping. Some one had mentioned why not have the entire DB in memory? How do I configure that, for knowledge? You don't. It'll automatically be in memory if (a) you have enough memory, (b) you don't have anything else on the machine using the memory, and (c) it's been read at least one time. A -- Andrew Sullivan | [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Scaling further up
On Tue, 2004-03-16 at 07:28, Matt Davies wrote: This is the preferred method, but you could create a memory disk if running linux. This has several caveats, though. 1. You may have to recompile the kernel for support. 2. You must store the database on a hard drive partition during reboots. 3. Because of #2 this option is generally useful if you have static content that is loaded to the MD upon startup of the system. And 4. You use twice as much memory - one lot for the FS, the second for buffer cache. It's generally going to be slower than simply doing some typical queries to preload the data into buffer cache, I think. Stephen signature.asc Description: This is a digitally signed message part
Re: [PERFORM] Scaling further up
Sorry about not chiming in before - I've been too swamped to think. I agree with most of the points, but a lot of these posts are interesting and seem to describe systems from an SA perspective to my DBA-centric view. - Original Message - From: Marty Scholes [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, March 10, 2004 6:29 PM Subject: Re: [PERFORM] Scaling further up I have some suggestions based on my anecdotal experience. 1. This is a relatively small DB -- the working set will likely be in RAM at any moment in time, making read I/O time mostly irrelevant. 2. The killer will be write times -- specifically log writes. Small and heavily synchronized writes, log and data writes, will drag down an impressive hardware RAID setup. We run mirrored hardware RAID 5 arrays with write back cache and are constantly seeking ways to improve write performance. We do a lot of batch processing, though, so we do a lot of write I/Os. My experience with RAID5 for streaming sequential writes is bad. This is sometimes helped by the hardware caching to cover the cost of the additional I/Os for striping (write through RAID5 + big cache acts like RAID 1+0 until you run out of cache). Batch processing is different from high concurrency transactions because it needs faster volume streaming, while TP is dependant on the speed of ack'ing (few big writes with less synchronous waits vs. lots of small writes which serialize everyone). (RAID 3 worked for me in the past for logging, but I haven't used it in years.) 3. Be very careful with battery backed write cache. It usually works as advertised. More than once in the past decade I have seen spontaneous cache corruption after power losss. The corruption usually happens when some admin, including me, has assumed that the cache will ALWAYS survive a power failure unblemished and has no plan B. Make sure you have a contingency plan for corruption, or don't enable the cache. I agree strongly. There is also the same problem with disk write back cache and even with SCSI controllers with write through enabled. PITR would help here. A lot of these problems are due to procedural error post crash. 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of the setup, and might hinder, not help small write I/O performance. In a high volume system without write caching you are almost always going to see queuing, which can make the larger buffer mostly irrelevant, if it's not huge. Write caching thrives on big block sizes (which is a key reason why Symmetrix doesn't do worse than it does) by reducing I/O counts. Most shops I've set up or seen use mirroring or RAID 10 for logs. Note also that many RAID 10 controllers in a non-write cached setup allows having a race between the two writers, acknowledging when the first of the two completes - increasing throughput by about 1/4. 5. Most (almost all) of the I/O time will be due to the access time (head seek + head settle + rotational latency) and very little of the I/O time will due to data transfer time. In other words, getting drives that provide faster transfer rates will barely improve performance. The secret is lowering the access time. True. This is very much a latency story. Even in volume batch, you can see access time that clearly shows some other system configuration bottleneck that happens elsewhere before hitting I/O capacity. 6. A relatively cheap way to drastically drop the access time is to get large drive(s) and only use a portion of them for storage. The less space used on the drive, the less area the heads need to cover for seeks. At one extreme, you could make the partition the size of a single cylinder. This would make access time (ignoring OS and controller overhead) identical to rotational latency, which is as low as 4.2 ms for a cheap 7200 RPM drive. This is a good strategy for VLDB, and may not be relevant in this case. Also - big sequential writes and 15K rpm drives, in the case of writethrough, is a beautiful thing - they look like a manufacturers' demo. A primary performance role of a RDBMS is to convert random I/O to sequential (by buffering reads and using a streaming log to defer random writes to checkpoints). RDBMS's are the prime beneficiaries of the drive speed improvements - since logging, backups, and copies are about the only things (ignoring bad perl scripts and find commands) that generate loads of 50+ mB/sec. /Aaron ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Scaling further up
I have some suggestions based on my anecdotal experience. 1. This is a relatively small DB -- the working set will likely be in RAM at any moment in time, making read I/O time mostly irrelevant. 2. The killer will be write times -- specifically log writes. Small and heavily synchronized writes, log and data writes, will drag down an impressive hardware RAID setup. We run mirrored hardware RAID 5 arrays with write back cache and are constantly seeking ways to improve write performance. We do a lot of batch processing, though, so we do a lot of write I/Os. 3. Be very careful with battery backed write cache. It usually works as advertised. More than once in the past decade I have seen spontaneous cache corruption after power losss. The corruption usually happens when some admin, including me, has assumed that the cache will ALWAYS survive a power failure unblemished and has no plan B. Make sure you have a contingency plan for corruption, or don't enable the cache. 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of the setup, and might hinder, not help small write I/O performance. 5. Most (almost all) of the I/O time will be due to the access time (head seek + head settle + rotational latency) and very little of the I/O time will due to data transfer time. In other words, getting drives that provide faster transfer rates will barely improve performance. The secret is lowering the access time. 6. A relatively cheap way to drastically drop the access time is to get large drive(s) and only use a portion of them for storage. The less space used on the drive, the less area the heads need to cover for seeks. At one extreme, you could make the partition the size of a single cylinder. This would make access time (ignoring OS and controller overhead) identical to rotational latency, which is as low as 4.2 ms for a cheap 7200 RPM drive. 7. A drive with a 5 ms average service time, servicing 8 KB blocks, will yield as much as 1.6 MB/s sustained write throughput. Not bad for a cheap uncached solution. Any OS aggregation of writes during the fsync() call will further improve this number -- it is basically a lower bound for throughput. 8. Many people, especially managers, cannot stomach buying disk space and only using a portion of it. In many cases, it seems more palatable to purchase a much more expensive solution to get to the same speeds. Good luck. scott.marlowe wrote: On Wed, 3 Mar 2004, Paul Thomas wrote: On 02/03/2004 23:25 johnn wrote: [snip] random_page_cost should be set with the following things taken into account: - seek speed Which is not exactly the same thing as spindle speed as it's a combination of spindle speed and track-to-track speed. I think you'll find that a 15K rpm disk, whilst it will probably have a lower seek time than a 10K rpm disk, won't have a proportionately (i.e., 2/3rds) lower seek time. There are three factors that affect how fast you can get to the next sector: seek time settle time rotational latency Most drives only list the first, and don't bother to mention the other two. On many modern drives, the seek times are around 5 to 10 milliseconds. The settle time varies as well. the longer the seek, the longer the settle, generally. This is the time it takes for the head to stop shaking and rest quietly over a particular track. Rotational Latency is the amount of time you have to wait, on average, for the sector you want to come under the heads. Assuming an 8 ms seek, and 2 ms settle (typical numbers), and that the rotational latency on average is 1/2 of a rotation: At 10k rpm, a rotation takes 1/166.667 of a second, or 6 mS. So, a half a rotation is approximately 3 mS. By going to a 15k rpm drive, the latency drops to 2 mS. So, if we add them up, on the same basic drive, one being 10k and one being 15k, we get: 10krpm: 8+2+3 = 13 mS 15krpm: 8+2+2 = 12 mS So, based on the decrease in rotational latency being the only advantage the 15krpm drive has over the 10krpm drive, we get an decrease in access time of only 1 mS, or only about an 8% decrease in actual seek time. So, if you're random page cost on 10krpm drives was 1.7, you'd need to drop it to 1.57 or so to reflect the speed increase from 15krpm drives. I.e. it's much more likely that going from 1 gig to 2 gigs of ram will make a noticeable difference than going from 10k to 15k drives. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Scaling further up
On Mon, 2004-03-08 at 11:40, William Yu wrote: Anjan Dave wrote: Great response, Thanks. Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't understand is that even though the OS caches most of the memory and PG can use it if it needs it, why would the system swap (not much, only during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB, effective cache size is 2GB, sort mem is 2MB, rest is default values. It also happens that a large query (reporting type) can hold up the other queries, and the load averages shoot up during peak times. In regards to your system going to swap, the only item I see is sort_mem at 2MB. How many simultaneous transactions do you get? If you get hundreds or thousands like your first message stated, every select sort would take up 2MB of memory regardless of whether it needed it or not. That could cause your swap activity during peak traffic. The only other item to bump up is the effective cache size -- I'd set it to 12GB. Was surprised that no one corrected this bit of erroneous info (or at least I didn't see it) so thought I would for completeness. a basic explanation is that sort_mem controls how much memory a given query is allowed to use before spilling to disk, but it will not grab that much memory if it doesn't need it. See the docs for a more detailed explanation: http://www.postgresql.org/docs/7.4/interactive/runtime-config.html#RUNTIME-CONFIG-RESOURCE Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Scaling further up
I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound Given a a 13G database on a 12G system, with a low growth rate, it is likely that there is almost no I/O for most activities. The exception is checkpointing. The first thing I'd do is try to build a spreadsheet model of: - select frequency, and # logical and physical reads involved - insert/delete/update frequency, and # logical and physical read and writes involved - commit frequency, etc. (start out with simplistic assumptions, and do it for peak load) - system events (checkpoints, vacuum) I assume that the only high I/O you will see will be for logging. The RAID issue there is basically obviated by the sequential write nature of WAL. If that is the case, EMC is not the most cost effective or obvious solution - since the value they provide is mostly manageability for disaster recovery. The goal in this case is to write at the application max speed, and with mimimal latency. Any responsible battery backed up write through (mirrored) cached controller can do that for you. On the other hand, if your requests are not *all* trivial, you are going to test the hardware and scheduling algorithms of OS and pg. Even if 0.1% of 3,000 tps take a second - that ends up generating 3 seconds of load Any, even slightly, slow transactions will generate enormous queues which slow down everything. In most systems of this volume I've seen, the mix of activities is constantly invalidating cache, making L2 caching less important. Memory to CPU bus speed is a limiting factor, as well as raw CPU speed in processing the requests. Xeon is not a great architecture for this because of FSB contention; I suspect a 4-way will be completely FSB bottlenecked so a more than 4 way would likely not change performance. I would try to get a simple model/benchmark going and test against it. You should be talking to the big iron vendors for their take on your issues and get their capacity benchmarks. __ Do you Yahoo!? Yahoo! Search - Find what youre looking for faster http://search.yahoo.com ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Scaling further up
Great response, Thanks. Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't understand is that even though the OS caches most of the memory and PG can use it if it needs it, why would the system swap (not much, only during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB, effective cache size is 2GB, sort mem is 2MB, rest is default values. It also happens that a large query (reporting type) can hold up the other queries, and the load averages shoot up during peak times. Regarding a baseline - -We have docs and monitor for frequency of sql statements, most expensive ones, etc. (IronEye) -I am monitoring disk reads/writes using iostat -How do I measure commit frequency, and system events like checkpoint? (vacuum is done nightly during less or no load) Thanks, Anjan -Original Message- From: Aaron W [mailto:[EMAIL PROTECTED] Sent: Thursday, March 04, 2004 8:58 AM To: [EMAIL PROTECTED]; Anjan Dave Subject: Re: Scaling further up I'd look at adding more disks first. Depending on what type of query load you get, that box sounds like it will be very much I/O bound Given a a 13G database on a 12G system, with a low growth rate, it is likely that there is almost no I/O for most activities. The exception is checkpointing. The first thing I'd do is try to build a spreadsheet model of: - select frequency, and # logical and physical reads involved - insert/delete/update frequency, and # logical and physical read and writes involved - commit frequency, etc. (start out with simplistic assumptions, and do it for peak load) - system events (checkpoints, vacuum) I assume that the only high I/O you will see will be for logging. The RAID issue there is basically obviated by the sequential write nature of WAL. If that is the case, EMC is not the most cost effective or obvious solution - since the value they provide is mostly manageability for disaster recovery. The goal in this case is to write at the application max speed, and with mimimal latency. Any responsible battery backed up write through (mirrored) cached controller can do that for you. On the other hand, if your requests are not *all* trivial, you are going to test the hardware and scheduling algorithms of OS and pg. Even if 0.1% of 3,000 tps take a second - that ends up generating 3 seconds of load Any, even slightly, slow transactions will generate enormous queues which slow down everything. In most systems of this volume I've seen, the mix of activities is constantly invalidating cache, making L2 caching less important. Memory to CPU bus speed is a limiting factor, as well as raw CPU speed in processing the requests. Xeon is not a great architecture for this because of FSB contention; I suspect a 4-way will be completely FSB bottlenecked so a more than 4 way would likely not change performance. I would try to get a simple model/benchmark going and test against it. You should be talking to the big iron vendors for their take on your issues and get their capacity benchmarks. __ Do you Yahoo!? Yahoo! Search - Find what you're looking for faster http://search.yahoo.com ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Scaling further up
On Thu, 4 Mar 2004, Paul Thomas wrote: On 03/03/2004 18:23 scott.marlowe wrote: [snip] There are three factors that affect how fast you can get to the next sector: seek time settle time rotational latency Most drives only list the first, and don't bother to mention the other two. Ah yes, one of my (very) few still functioning brain cells was nagging about another bit of time in the equation :) On many modern drives, the seek times are around 5 to 10 milliseconds. [snip] Going back to the OPs posting about random_page_cost, imagine I have 2 servers identical in every way except the disk drive. Server A has a 10K rpm drive and server B has a 15K rpm drive. Seek/settle times aren't spectacularly different between the 2 drives. I'm wondering if drive B might actually merit a _higher_ random_page_cost than drive A as, once it gets settled on a disk track, it can suck the data off a lot faster. opinions/experiences anyone? It might well be that you have higher settle times that offset the small gain in rotational latency. I haven't looked into it, so I don't know one way or the other, but it seems a reasonable assumption. However, a common misconception is that the higher angular velocity of the 15krpm drives would allow you to read data faster. In fact, the limit of how fast you can read is set by the head. There's a maximum frequency that it can read, and the areal density / rpm have to be such that you don't exceed that frequency. OFten, the speed at which you read off the platters is exactly the same between a 10k and 15k of the same family. The required lower areal density is the reason 15krpm drives show up in the lower capacities first. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Scaling further up
On 02/03/2004 23:25 johnn wrote: [snip] random_page_cost should be set with the following things taken into account: - seek speed Which is not exactly the same thing as spindle speed as it's a combination of spindle speed and track-to-track speed. I think you'll find that a 15K rpm disk, whilst it will probably have a lower seek time than a 10K rpm disk, won't have a proportionately (i.e., 2/3rds) lower seek time. - likelihood of page to be cached in memory by the kernel That's effective cache size. -- Paul Thomas +--+-+ | Thomas Micro Systems Limited | Software Solutions for the Smaller Business | | Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk | +--+-+ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Scaling further up
On Tue, Mar 02, 2004 at 04:50:04PM -0500, Anjan Dave wrote: time/resources to do extensive testing, I am not sure if Postgres/Solaris9 is really suggested by the community for high-performance, as opposed to a XEON/Linux setup. Storage being a separate discussion. I can tell you from experience that performance on Solaris is nowhere close to what you'd expect, given the coin you're forking over for it. I think the reason to use Solaris is its support for all the nifty hot-swappable hardware, and not for its speed or any putative benefit you might get from having 64 bits at your disposal. A -- Andrew Sullivan | [EMAIL PROTECTED] The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Scaling further up
On 03/03/2004 18:23 scott.marlowe wrote: [snip] There are three factors that affect how fast you can get to the next sector: seek time settle time rotational latency Most drives only list the first, and don't bother to mention the other two. Ah yes, one of my (very) few still functioning brain cells was nagging about another bit of time in the equation :) On many modern drives, the seek times are around 5 to 10 milliseconds. [snip] Going back to the OPs posting about random_page_cost, imagine I have 2 servers identical in every way except the disk drive. Server A has a 10K rpm drive and server B has a 15K rpm drive. Seek/settle times aren't spectacularly different between the 2 drives. I'm wondering if drive B might actually merit a _higher_ random_page_cost than drive A as, once it gets settled on a disk track, it can suck the data off a lot faster. opinions/experiences anyone? -- Paul Thomas +--+-+ | Thomas Micro Systems Limited | Software Solutions for the Smaller Business | | Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk | +--+-+ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Scaling further up
Anjan Dave wrote: We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments. The quick and dirty method would be to upgrade to the recently announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get another +60% there due to the huge L3 hiding the Xeon's shared bus penalty. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Scaling further up
On Tue, 2004-03-02 at 17:42, William Yu wrote: Anjan Dave wrote: We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments. The quick and dirty method would be to upgrade to the recently announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get another +60% there due to the huge L3 hiding the Xeon's shared bus penalty. If you are going to have thousands of 'concurrent' users you should seriously consider the 2.6 kernel if you are running Linux or as an alternative going with FreeBSD. You will need to load test your system and become an expert on tuning Postgres to get the absolute maximum performance from each and every query you have. And you will need lots of hard drives. By lots I mean dozen(s) in a raid 10 array with a good controller. Thousands of concurrent users means hundreds or thousands of transactions per second. I've personally seen it scale that far but in my opinion you will need a lot more hard drives and ram than cpu. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Scaling further up
By lots I mean dozen(s) in a raid 10 array with a good controller. I believe, for RAID-10, I will need even number of drives. Currently, the size of the database is about 13GB, and is not expected to grow exponentially with thousands of concurrent users, so total space is not of paramount importance compared to performance. Does this sound reasonable setup? 10x36GB FC drives on RAID-10 4x36GB FC drives for the logs on RAID-10 (not sure if this is the correct ratio)? 1 hotspare Total=15 Drives per enclosure. Tentatively, I am looking at an entry-level EMC CX300 product with 2GB RAID cache, etc. Question - Are 73GB drives supposed to give better performance because of higher number of platters? Thanks, Anjan -Original Message- From: Fred Moyer [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 02, 2004 5:57 AM To: William Yu; Anjan Dave Cc: [EMAIL PROTECTED] Subject: Re: [PERFORM] Scaling further up On Tue, 2004-03-02 at 17:42, William Yu wrote: Anjan Dave wrote: We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments. The quick and dirty method would be to upgrade to the recently announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get another +60% there due to the huge L3 hiding the Xeon's shared bus penalty. If you are going to have thousands of 'concurrent' users you should seriously consider the 2.6 kernel if you are running Linux or as an alternative going with FreeBSD. You will need to load test your system and become an expert on tuning Postgres to get the absolute maximum performance from each and every query you have. And you will need lots of hard drives. By lots I mean dozen(s) in a raid 10 array with a good controller. Thousands of concurrent users means hundreds or thousands of transactions per second. I've personally seen it scale that far but in my opinion you will need a lot more hard drives and ram than cpu. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Scaling further up
On Tue, 2 Mar 2004, Anjan Dave wrote: By lots I mean dozen(s) in a raid 10 array with a good controller. I believe, for RAID-10, I will need even number of drives. Correct. Currently, the size of the database is about 13GB, and is not expected to grow exponentially with thousands of concurrent users, so total space is not of paramount importance compared to performance. Does this sound reasonable setup? 10x36GB FC drives on RAID-10 4x36GB FC drives for the logs on RAID-10 (not sure if this is the correct ratio)? 1 hotspare Total=15 Drives per enclosure. Putting the Logs on RAID-10 is likely to be slower than, or no faster than putting them on RAID-1, since the RAID-10 will have to write to 4 drives, while the RAID-1 will only have to write to two drives. now, if you were reading in the logs a lot, it might help to have the RAID-10. Tentatively, I am looking at an entry-level EMC CX300 product with 2GB RAID cache, etc. Pick up a spare, I'll get you my home address, etc... :-) Seriously, that's huge. At that point you may well find that putting EVERYTHING on a big old RAID-5 performs best, since you've got lots of caching / write buffering going on. Question - Are 73GB drives supposed to give better performance because of higher number of platters? Generally, larger hard drives perform better than smaller hard drives because they a: have more heads and / or b: have a higher areal density. It's a common misconception that faster RPM drives are a lot faster, when, in fact, their only speed advantage is slight faster seeks. The areal density of faster spinning hard drives tends to be somewhat less than the slower spinning drives, since the maximum frequency the heads can work in on both drives, assuming the same technology, is the same. I.e. the speed at which you can read data off of the platter doesn't usually go up with a higher RPM drive, only the speed with which you can get to the first sector. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Scaling further up
That was part of my original question - whether it makes sense to go for a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to high amounts of memory, and shouldn't have any issues addressing it all. I've had that kind of setup once temporarily on a V480 (quad UltraSparc, 16GB RAM) machine, and it did well in production use. Without having the time/resources to do extensive testing, I am not sure if Postgres/Solaris9 is really suggested by the community for high-performance, as opposed to a XEON/Linux setup. Storage being a separate discussion. Thanks, Anjan -Original Message- From: Chris Ruprecht [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 02, 2004 4:17 PM To: Anjan Dave; [EMAIL PROTECTED]; William Yu Cc: [EMAIL PROTECTED] Subject: Re: [PERFORM] Scaling further up Hi all, If you have a DB of 'only' 13 GB and you do not expect it to grow much, it might be advisable to have enough memory (RAM) to hold the entire DB in shared memory (everything is cached). If you have a server with say 24 GB or memory and can allocate 20 GB for cache, you don't care about the speed of disks any more - all you worry about is the speed of your memory and your network connection. I believe, this not possible using 32-bit technology, you would have to go to some 64-bit platform, but if it's speed you want ... You can also try solid state hard disk drives. These are actually just meory, there are no moving parts, but the look and behave like very very fast disk drives. I have seen them at capacities of 73 GB - but they didn't mention the price (I'd probably have a heart attack when I look at the price tag). Best regards, Chris On Tuesday 02 March 2004 14:41, Anjan Dave wrote: By lots I mean dozen(s) in a raid 10 array with a good controller. I believe, for RAID-10, I will need even number of drives. Currently, the size of the database is about 13GB, and is not expected to grow exponentially with thousands of concurrent users, so total space is not of paramount importance compared to performance. Does this sound reasonable setup? 10x36GB FC drives on RAID-10 4x36GB FC drives for the logs on RAID-10 (not sure if this is the correct ratio)? 1 hotspare Total=15 Drives per enclosure. Tentatively, I am looking at an entry-level EMC CX300 product with 2GB RAID cache, etc. Question - Are 73GB drives supposed to give better performance because of higher number of platters? Thanks, Anjan -Original Message- From: Fred Moyer [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 02, 2004 5:57 AM To: William Yu; Anjan Dave Cc: [EMAIL PROTECTED] Subject: Re: [PERFORM] Scaling further up On Tue, 2004-03-02 at 17:42, William Yu wrote: Anjan Dave wrote: We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives. We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments. The quick and dirty method would be to upgrade to the recently announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get another +60% there due to the huge L3 hiding the Xeon's shared bus penalty. If you are going to have thousands of 'concurrent' users you should seriously consider the 2.6 kernel if you are running Linux or as an alternative going with FreeBSD. You will need to load test your system and become an expert on tuning Postgres to get the absolute maximum performance from each and every query you have. And you will need lots of hard drives. By lots I mean dozen(s) in a raid 10 array with a good controller. Thousands of concurrent users means hundreds or thousands of transactions per second. I've personally seen it scale that far but in my opinion you will need a lot more hard drives and ram than cpu. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Scaling further up
Here's what I recorded today from iostat (linux, iostat -x -k, sda3 is the pg slice, logs included) during peak time on the RAID-10 array - What i see is mostly writes, and sometimes, quite a bit of writing, during which the average wait times shoot up. Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util /dev/sda3 18.81 113.21 3.90 36.33 181.54 1207.7590.77 603.88 34.54 0.490.73 0.22 0.87 /dev/sda30.00 208.00 0.00 150.000.00 2884.00 0.00 1442.00 19.23 0.750.50 0.33 5.00 /dev/sda30.00 239.00 0.00 169.000.00 3264.00 0.00 1632.00 19.31 2.151.27 0.33 5.50 /dev/sda30.00 224.50 0.00 158.000.00 3060.00 0.00 1530.00 19.37 1.901.20 0.28 4.50 /dev/sda30.00 157.00 0.00 117.000.00 2192.00 0.00 1096.00 18.74 0.400.34 0.30 3.50 /dev/sda30.00 249.50 0.00 179.000.00 3596.00 0.00 1798.00 20.0921.40 10.78 0.39 7.00 /dev/sda30.00 637.50 0.00 620.500.00 9936.00 0.00 4968.00 16.01 1137.15 183.55 1.85 115.00 /dev/sda30.00 690.00 0.00 548.500.00 9924.00 0.00 4962.00 18.0943.107.82 0.46 25.50 /dev/sda30.00 485.00 0.00 392.000.00 7028.00 0.00 3514.00 17.9386.90 22.21 1.14 44.50 /dev/sda30.00 312.50 0.00 206.500.00 4156.00 0.00 2078.00 20.13 3.501.69 0.53 11.00 /dev/sda30.00 386.50 0.00 275.500.00 5336.00 0.00 2668.00 19.3716.806.10 0.60 16.50 /dev/sda30.00 259.00 0.00 176.500.00 3492.00 0.00 1746.00 19.78 3.251.84 0.40 7.00 /dev/sda30.00 196.00 0.00 99.000.00 2360.00 0.00 1180.00 23.84 0.100.10 0.10 1.00 /dev/sda30.00 147.00 0.00 100.000.00 1976.00 0.00 988.00 19.76 0.500.50 0.45 4.50 /dev/sda30.00 126.50 0.00 94.500.00 1768.00 0.00 884.00 18.71 0.200.21 0.21 2.00 /dev/sda30.00 133.50 0.00 106.500.00 1920.00 0.00 960.00 18.03 0.500.47 0.47 5.00 /dev/sda30.00 146.50 0.00 118.000.00 2116.00 0.00 1058.00 17.93 0.200.21 0.17 2.00 /dev/sda30.00 156.00 0.00 128.500.00 2276.00 0.00 1138.00 17.71 0.350.27 0.27 3.50 /dev/sda30.00 145.00 0.00 105.000.00 2000.00 0.00 1000.00 19.05 0.250.24 0.24 2.50 /dev/sda30.00 72.96 0.00 54.510.00 1019.74 0.00 509.87 18.71 0.170.31 0.31 1.72 /dev/sda30.00 168.50 0.00 139.500.00 2464.00 0.00 1232.00 17.66 0.650.47 0.39 5.50 /dev/sda30.00 130.50 0.00 100.000.00 1844.00 0.00 922.00 18.44 0.000.00 0.00 0.00 /dev/sda30.00 122.00 0.00 101.000.00 1784.00 0.00 892.00 17.66 0.250.25 0.25 2.50 /dev/sda30.00 143.00 0.00 121.500.00 2116.00 0.00 1058.00 17.42 0.250.21 0.21 2.50 /dev/sda30.00 134.50 0.00 96.500.00 1848.00 0.00 924.00 19.15 0.350.36 0.36 3.50 /dev/sda30.00 153.50 0.00 115.000.00 2148.00 0.00 1074.00 18.68 0.350.30 0.30 3.50 /dev/sda30.00 101.50 0.00 80.000.00 1452.00 0.00 726.00 18.15 0.200.25 0.25 2.00 /dev/sda30.00 108.50 0.00 92.500.00 1608.00 0.00 804.00 17.38 0.250.27 0.27 2.50 /dev/sda30.00 179.00 0.00 132.500.00 2492.00 0.00 1246.00 18.81 0.550.42 0.42 5.50 /dev/sda31.00 113.00 1.00 83.00 16.00 1568.00 8.00 784.00 18.86 0.150.18 0.12 1.00 /dev/sda30.00 117.00 0.00 97.500.00 1716.00 0.00 858.00 17.60 0.200.21 0.21 2.00 /dev/sda30.00 541.00 0.00 415.500.00 7696.00 0.00 3848.00 18.52 146.50 35.09 1.37 57.00 /dev/sda30.00 535.00 0.00 392.500.00 7404.00 0.00 3702.00 18.86 123.70 31.67 1.31 51.50 /dev/sda30.00 993.50 0.00 697.500.00 13544.00 0.00 6772.00 19.42 174.25 24.98 1.25 87.00 /dev/sda30.00 245.00 0.00 108.500.00 2832.00 0.00 1416.00 26.10 0.550.51 0.51 5.50 -Original Message- From: scott.marlowe [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 02, 2004 4:16 PM To: Anjan Dave Cc: [EMAIL PROTECTED]; William Yu; [EMAIL PROTECTED] Subject: Re: [PERFORM] Scaling further up On Tue, 2 Mar 2004, Anjan Dave wrote: By lots I mean dozen(s) in a raid 10 array with a good controller. I believe, for RAID-10, I will need even number of drives. Correct. Currently, the size of the database is about 13GB, and is not expected to grow exponentially with thousands of concurrent users, so total space is not of paramount importance compared to performance. Does this sound reasonable setup? 10x36GB FC drives on RAID-10 4x36GB FC drives for the logs on RAID-10 (not sure if this is the correct ratio)? 1 hotspare Total=15 Drives per enclosure. Putting the Logs
Re: [PERFORM] Scaling further up
On Tue, 2 Mar 2004, Anjan Dave wrote: That was part of my original question - whether it makes sense to go for a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to high amounts of memory, and shouldn't have any issues addressing it all. I've had that kind of setup once temporarily on a V480 (quad UltraSparc, 16GB RAM) machine, and it did well in production use. Without having the time/resources to do extensive testing, I am not sure if Postgres/Solaris9 is really suggested by the community for high-performance, as opposed to a XEON/Linux setup. Storage being a separate discussion. Some folks on the list have experience with Postgresql on Solaris, and they generally say they use Solaris not for performance reasons, but for reliability reasons. I.e. the bigger Sun hardware is fault tolerant. For speed, the X86 32 and 64 bit architectures seem to be noticeable faster than Sparc. However, running Linux or BSD on Sparc make them pretty fast too, but you lose the fault tolerant support for things like hot swappable CPUs or memory. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Scaling further up
On Tue, Mar 02, 2004 at 02:16:24PM -0700, scott.marlowe wrote: It's a common misconception that faster RPM drives are a lot faster, when, in fact, their only speed advantage is slight faster seeks. The areal density of faster spinning hard drives tends to be somewhat less than the slower spinning drives, since the maximum frequency the heads can work in on both drives, assuming the same technology, is the same. I.e. the speed at which you can read data off of the platter doesn't usually go up with a higher RPM drive, only the speed with which you can get to the first sector. This would imply that an upgrade in drive RPM should be accompanied by a decrease in random_page_cost, correct? random_page_cost should be set with the following things taken into account: - seek speed - likelihood of page to be cached in memory by the kernel - anything else? Sorry, i realize this pulls the thread a bit off-topic, but i've heard that about RPM speeds before, and i just want some confirmation that my deductions are reasonable. -johnnn ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Scaling further up
On Mar 2, 2004, at 5:36 PM, scott.marlowe wrote: Some folks on the list have experience with Postgresql on Solaris, and they generally say they use Solaris not for performance reasons, but for reliability reasons. I.e. the bigger Sun hardware is fault tolerant. Solaris isn't nearly as bad for PG as it used to be. But as you say - the #1 reason to use sun is reliability. (In my case, it was because we had a giant sun laying around :) I'm trying to remember exactly what happens.. but I know on sun if it had a severe memory error it kills off processes with data on that dimm (Since it has no idea if it is bad or not. Thanks to ECC this is very rare, but it can happen.). I want to say if a CPU dies any processes running on it at that moment are also killed. but the more I think about that th emore I don't think that is the case. As for x86.. if ram or a cpu goes bad you're SOL. Although opterons are sexy you need to remember they really are brand new cpus - I'm sure AMD has done tons of testing but sun ultrasparc's have been in every situation conceivable in production. If you are going to really have thousands of users you probably want to bet the farm on something proven. lots and lots of spindles lots and lots of ram You may also want to look into a replication solution as a hot backup. -- Jeff Trout [EMAIL PROTECTED] http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Scaling further up
On Tue, 2004-03-02 at 18:24, Anjan Dave wrote: Can you describe the vendors/components of a cheap SAN setup? heh.. Excellent point. My point was that you could get away with a smaller setup (number of disks) if it doesn't have to deal with reads and writes are not time dependent than you will if you attempt to pull 500MB/sec off the disks. If it is foreseeable that the database can be held in Ram, that it is much easier and cheaper way to get high IO than with physical disks. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Scaling further up
Anjan, Other than the disks, I am curious what other people are using in terms of the horsepower needed. The Quad server has been keeping up, but we are expecting quite high loads in the near future, and I am not sure if just by having the disks on a high-end storage will do it. Do a performance analysis of RH9. My experience with RH on Xeon has been quite discouraging lately, and I've been recommending swapping stock kernels for the RH kernel. Of course, if this is RHES, rather than the standard, then test talk to RH instead. -- -Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster