Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra wrote: > On 18.2.2014 02:23, KONDO Mitsumasa wrote: >> Hi, >> >> I don't have PERC H710 raid controller, but I think he would like to >> know raid striping/chunk size or read/write cache ratio in >> writeback-cache setting is the best. I'd like to know it, too:) > > We do have dozens of H710 controllers, but not with SSDs. I've been > unable to find reliable answers how it handles TRIM, and how that works > with wearout reporting (using SMART). AFAIK (I haven't looked for a few months), they don't support TRIM. The only hardware RAID vendor that has even basic TRIM support Intel and that's no accident; I have a theory that enterprise storage vendors are deliberately holding back SSD: SSD (at least, the newer, better ones) destroy the business model for "enterprise storage equipment" in a large percentage of applications. A 2u server with, say, 10 s3700 drives gives *far* superior performance to most SANs that cost under 100k$. For about 1/10th of the price. If you have a server that is i/o constrained as opposed to storage constrained (AKA: a database) hard drives make zero economic sense. If your vendor is jerking you around by charging large multiples of market rates for storage and/or disallowing drives that actually perform well in their storage gear, choose a new vendor. And consider using software raid. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
On Wed, Feb 19, 2014 at 8:13 AM, Merlin Moncure wrote: > On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra wrote: >> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>> Hi, >>> >>> I don't have PERC H710 raid controller, but I think he would like to >>> know raid striping/chunk size or read/write cache ratio in >>> writeback-cache setting is the best. I'd like to know it, too:) >> >> We do have dozens of H710 controllers, but not with SSDs. I've been >> unable to find reliable answers how it handles TRIM, and how that works >> with wearout reporting (using SMART). > > AFAIK (I haven't looked for a few months), they don't support TRIM. > The only hardware RAID vendor that has even basic TRIM support Intel > and that's no accident; I have a theory that enterprise storage > vendors are deliberately holding back SSD: SSD (at least, the newer, > better ones) destroy the business model for "enterprise storage > equipment" in a large percentage of applications. A 2u server with, > say, 10 s3700 drives gives *far* superior performance to most SANs > that cost under 100k$. For about 1/10th of the price. > > If you have a server that is i/o constrained as opposed to storage > constrained (AKA: a database) hard drives make zero economic sense. > If your vendor is jerking you around by charging large multiples of > market rates for storage and/or disallowing drives that actually > perform well in their storage gear, choose a new vendor. And consider > using software raid. You can also do the old trick of underprovisioning and / or underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a HW RAID controller in RAID-10, then only parititon out 1/2 the storage you get from that. so you get 1.5TB os storage and the drives are underutilized enough to have spare space. Right now I'm testing on a machine with 2x Intel E5-2690s (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, but storage wise it's WAY cheaper than most SANS and very fast. pg_xlog is on a pair of non-descript SATA spinners btw. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
Hi, On 19.2.2014 03:45, KONDO Mitsumasa wrote: > (2014/02/19 5:41), Tomas Vondra wrote: >> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>> Hi, >>> >>> I don't have PERC H710 raid controller, but I think he would like to >>> know raid striping/chunk size or read/write cache ratio in >>> writeback-cache setting is the best. I'd like to know it, too:) >> >> The stripe size is actually a very good question. On spinning drives it >> usually does not matter too much - unless you have a very specialized >> workload, the 'medium size' is the right choice (AFAIK we're using 64kB >> on H710, which is the default). > > I am interested that raid stripe size of PERC H710 is 64kB. In HP > raid card, default chunk size is 256kB. If we use two disks with raid > 0, stripe size will be 512kB. I think that it might too big, but it > might be optimized in raid card... In actually, it isn't bad in that > settings. With HP controllers this depends on RAID level (and maybe even controller). Which HP controller are you talking about? I have some basic experience with P400/P800, and those have 16kB (RAID6), 64kB (RAID5) or 128kB (RAID10) defaults. None of them has 256kB. See http://bit.ly/1bN3gIs (P800) and http://bit.ly/MdsEKN (P400). > I'm interested in raid card internal behavior. Fortunately, linux raid > card driver is open souce, so we might good at looking the source code > when we have time. What do you mean by "linux raid card driver"? Afaik the admin tools may be available, but the interesting stuff happens inside the controller, and that's still proprietary. >> With SSDs this might actually matter much more, as the SSDs work with >> "erase blocks" (mostly 512kB), and I suspect using small stripe might >> result in repeated writes to the same block - overwriting one block >> repeatedly and thus increased wearout. But maybe the controller will >> handle that just fine, e.g. by coalescing the writes and sending them to >> the drive as a single write. Or maybe the drive can do that in local >> write cache (all SSDs have that). > > I have heard that genuine raid card with genuine ssds are optimized in > these ssds. It is important that using compatible with ssd for > performance. If the worst case, life time of ssd is be short, and will > be bad performance. Well, that's the main question here, right? Because if the "worst case" actually happens to be true, then what's the point of SSDs? You have a disk that does not provite the performance you expected, died much sooner than you expected and maybe suddenly so it interrupted the operation. So instead of paying more for higher performance, you paid more for bad performance and much shorter life of the disk. Coincidentally we're currently trying to find the answer to this question too. That is - how long will the SSD endure in that particular RAID level? Does that pay off? BTW what you mean by "genuine raid card" and "genuine ssds"? > I'm wondering about effective of readahead in OS and raid card. In > general, readahead data by raid card is stored in raid cache, and > not stored in OS caches. Readahead data by OS is stored in OS cache. > I'd like to use all raid cache for only write cache, because fsync() > becomes faster. But then, it cannot use readahead very much by raid > card.. If we hope to use more effectively, we have to clear it, but > it seems difficult:( I've done a lot of testing of this on H710 in 2012 (~18 months ago), measuring combinations of * read-ahead on controller (adaptive, enabled, disabled) * read-ahead in kernel (with various sizes) * scheduler The test was the simplest and most suitable workload for this - just "dd" with 1MB block size (AFAIK, would have to check the scripts). In short, my findings are that: * read-ahead in kernel matters - tweak this * read-ahead on controller sucks - either makes no difference, or actually harms performance (adaptive with small values set for kernel read-ahead) * scheduler made no difference (at least for this workload) So we disable readahead on the controller, use 24576 for kernel and it works fine. I've done the same test with fusionio iodrive (attached to PCIe, not through controller) - absolutely no difference. Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
On Wed, Feb 19, 2014 at 12:09 PM, Scott Marlowe wrote: > On Wed, Feb 19, 2014 at 8:13 AM, Merlin Moncure wrote: >> On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra wrote: >>> On 18.2.2014 02:23, KONDO Mitsumasa wrote: Hi, I don't have PERC H710 raid controller, but I think he would like to know raid striping/chunk size or read/write cache ratio in writeback-cache setting is the best. I'd like to know it, too:) >>> >>> We do have dozens of H710 controllers, but not with SSDs. I've been >>> unable to find reliable answers how it handles TRIM, and how that works >>> with wearout reporting (using SMART). >> >> AFAIK (I haven't looked for a few months), they don't support TRIM. >> The only hardware RAID vendor that has even basic TRIM support Intel >> and that's no accident; I have a theory that enterprise storage >> vendors are deliberately holding back SSD: SSD (at least, the newer, >> better ones) destroy the business model for "enterprise storage >> equipment" in a large percentage of applications. A 2u server with, >> say, 10 s3700 drives gives *far* superior performance to most SANs >> that cost under 100k$. For about 1/10th of the price. >> >> If you have a server that is i/o constrained as opposed to storage >> constrained (AKA: a database) hard drives make zero economic sense. >> If your vendor is jerking you around by charging large multiples of >> market rates for storage and/or disallowing drives that actually >> perform well in their storage gear, choose a new vendor. And consider >> using software raid. > > You can also do the old trick of underprovisioning and / or > underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a > HW RAID controller in RAID-10, then only parititon out 1/2 the storage > you get from that. so you get 1.5TB os storage and the drives are > underutilized enough to have spare space. > > Right now I'm testing on a machine with 2x Intel E5-2690s > (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) > 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI > MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on > a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, > but storage wise it's WAY cheaper than most SANS and very fast. > pg_xlog is on a pair of non-descript SATA spinners btw. Yeah -- underprovisioing certainly helps but for any write heavy configuration, all else being equal, TRIM support will perform faster and have less wear. Those drives are likely the older 320 600gb. The newer s3700 are much faster although they cost around twice as much. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
On 19.2.2014 16:13, Merlin Moncure wrote: > On Tue, Feb 18, 2014 at 2:41 PM, Tomas Vondra wrote: >> On 18.2.2014 02:23, KONDO Mitsumasa wrote: >>> Hi, >>> >>> I don't have PERC H710 raid controller, but I think he would like to >>> know raid striping/chunk size or read/write cache ratio in >>> writeback-cache setting is the best. I'd like to know it, too:) >> >> We do have dozens of H710 controllers, but not with SSDs. I've been >> unable to find reliable answers how it handles TRIM, and how that works >> with wearout reporting (using SMART). > > AFAIK (I haven't looked for a few months), they don't support TRIM. > The only hardware RAID vendor that has even basic TRIM support Intel > and that's no accident; I have a theory that enterprise storage > vendors are deliberately holding back SSD: SSD (at least, the newer, > better ones) destroy the business model for "enterprise storage > equipment" in a large percentage of applications. A 2u server with, > say, 10 s3700 drives gives *far* superior performance to most SANs > that cost under 100k$. For about 1/10th of the price. Yeah, maybe. I'm generally a bit skeptic when it comes to conspiration theories like this, but for ~1 year we all know that it might easily happen to be true. So maybe ... Nevertheless, I'd guess this is another case of the "Nobody ever got fired for buying X", where X is a storage product based on spinning drives, proven to be reliable, with known operational statistics and pretty good understanding of how it works. While "Y" is a new thing based on SSDs, that got rather bad reputation initially because of a hype and premature usage of consumer-grade products for unsuitable stuff. Also, each vendor of Y uses different tricks, which makes application of experiences across vendors (or even various generations of drives of the same vendor) very difficult. Factor in how conservative DBAs happen to be, and I think it might be this particular feedback loop, forcing the vendors not to push this. > If you have a server that is i/o constrained as opposed to storage > constrained (AKA: a database) hard drives make zero economic sense. > If your vendor is jerking you around by charging large multiples of > market rates for storage and/or disallowing drives that actually > perform well in their storage gear, choose a new vendor. And consider > using software raid. Yeah, exactly. Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
On 19.2.2014 19:09, Scott Marlowe wrote: > > You can also do the old trick of underprovisioning and / or > underutilizing all the space on SSDs. I.e. put 10 600GB SSDs under a > HW RAID controller in RAID-10, then only parititon out 1/2 the storage > you get from that. so you get 1.5TB os storage and the drives are > underutilized enough to have spare space. Yeah. AFAIK that's basically what Intel did with S3500 -> S3700. What I'm trying to find is the 'sweet spot' considering lifespan, capacity, performance and price. That's why I'm still wondering if there are some experiences with current generation of SSDs and RAID controllers, with RAID levels other than RAID-10. Say I have 8x 400GB SSD, 75k/32k read/write IOPS each (i.e. it's basically the S3700 from Intel). Assuming the writes are ~25% of the workload, this is what I get for RAID10 vs. RAID6 (math done using http://www.wmarow.com/strcalc/) | capacity GB | bandwidth MB/s | IOPS --- RAID-10 |1490 | 2370 | 300k RAID-6 |2230 | 1070 | 130k Let's say the app can't really generate 130k IOPS (we'll saturate CPU way before that), so even if the real-world numbers will be less than 50% of this, we're not going to hit disks as the main bottleneck. So let's assume there's no observable performance difference between RAID10 and RAID6 in our case. But we could put 1.5x the amount of data on the RAID6, making it much cheaper (we're talking about non-trivial numbers of such machines). The question is - how long will it last before the SSDs die because of wearout? Will the RAID controller make it worse due to (not) handling TRIM? Will we know how much time we have left, i.e. will the controller provide the info the drives provide through SMART? > Right now I'm testing on a machine with 2x Intel E5-2690s > (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) > 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI Most likely S3500. S3700 are not offered with 600GB capacity. > MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on > a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, > but storage wise it's WAY cheaper than most SANS and very fast. > pg_xlog is on a pair of non-descript SATA spinners btw. Nice. I've done some testing with fusionio iodrive duo (2 devices in RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So considering the price, the 7.2k TPS is really good IMHO. regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimal settings for RAID controller - optimized for writes
On Wed, Feb 19, 2014 at 6:10 PM, Tomas Vondra wrote: > On 19.2.2014 19:09, Scott Marlowe wrote: >> Right now I'm testing on a machine with 2x Intel E5-2690s >> (http://ark.intel.com/products/64596/intel-xeon-processor-e5-2690-20m-cache-2_90-ghz-8_00-gts-intel-qpi) >> 512GB RAM and 6x600GB Intel SSDs (not sure which ones) under a LSI > > Most likely S3500. S3700 are not offered with 600GB capacity. > >> MegaRAID 9266. I'm able to crank out 6500 to 7200 TPS under pgbench on >> a scale 1000 db at 8 to 60 clients on that machine. It's not cheap, >> but storage wise it's WAY cheaper than most SANS and very fast. >> pg_xlog is on a pair of non-descript SATA spinners btw. > > Nice. I've done some testing with fusionio iodrive duo (2 devices in > RAID0) ~ year ago, and I got 12k TPS (or ~15k with WAL on SAS RAID). So > considering the price, the 7.2k TPS is really good IMHO. The part number reported by the LSI is: SSDSC2BB600G4 so I'm assuming it's an SLC drive. Done some further testing, I keep well over 6k tps right up to 128 clients. At no time is there any IOWait under vmstat, and if I turn off fsync speed goes up by some tiny amount, so I'm guessing I'm CPU bound at this point. This machine has dual 8 core HT Intels CPUs. We have another class of machine running on FusionIO IODrive2 MLC cards in RAID-1 and 4 6 core non-HT CPUs. It's a bit slower (1366 versus 1600MHz Memory, slower CPU clocks and interconects etc) and it can do about 5k tps and again, like the ther machine, no IO Wait, all CPU bound. I'd say once you get to a certain level of IO Subsystem it gets harder and harder to max it out. I'd love to have a 64 core 4 socket AMD top of the line system to compare here. But honestly both class of machines are more than fast enough for what we need, and our major load is from select statements so fitting the db into RAM is more important that IOPs for what we do. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Problem with ExclusiveLock on inserts
Vladimir, pgbouncer works with pl/proxy in transaction pooling mode. A wide spread phrase that statement mode is for plproxy does not mean any limitations for transaction pooling mode until you have atocommit on client. Anyway, try to reduce connections. try to set your autovacuum a bit more aggressive: autovacuum_analyze_scale_factor=0.05 #or like that autovacuum_analyze_threshold=5 autovacuum_freeze_max_age=2 autovacuum_max_workers=20 # that is fine for slow disks autovacuum_naptime=1 autovacuum_vacuum_cost_delay=5 # or at least 10 autovacuum_vacuum_cost_limit =-1 autovacuum_vacuum_scale_factor=0.01 # this setting is to be really aggressive, otherwise you simply postpone huge vacuums and related disk io, smaller portions are better autovacuum_vacuum_threshold=20 probably you will also need some ionice for autovacuum workers On Thu, Feb 13, 2014 at 11:26 AM, Бородин Владимир wrote: > > 13.02.2014, в 13:29, Ilya Kosmodemiansky > написал(а): > > Vladimir, > > And, any effect on your problem? > > > It worked without problems longer than previous configuration but repeated > again several minutes ago :( > > > On Thu, Feb 13, 2014 at 9:35 AM, Бородин Владимир > wrote: > > I have limited max connections to 1000, reduced shared buffers to 8G and > restarted postgres. > > > 1000 is still to much in most cases. With pgbouncer in transaction > pooling mode normaly pool size 8-32, max_connections = 100 (default > value) and client_connections 500-1500 looks more reasonable. > > > Clients for this db are plproxy hosts. As far as I know plproxy can work > only with statement pooling. > > > > I have also noticed that this big tables stopped vacuuming automatically a > couple of weeks ago. It could be the reason of the problem, I will now try > to tune autovacuum parameters to turn it back. But yesterday I ran "vacuum > analyze" for all relations manually but that did not help. > > > How do your autovacuum parameters look like now? > > > They were all default except for vacuum_defer_cleanup_age = 10. I have > increased autovacuum_max_workers = 20 because I have 10 databases with > about 10 tables each. That did not make better (I haven't seen more than > two auto vacuum workers simultaneously). Then I have tried to > set vacuum_cost_limit = 1000. Still not vacuuming big tables. Right now the > parameters look like this: > > root@rpopdb01e ~ # fgrep vacuum > /var/lib/pgsql/9.3/data/conf.d/postgresql.conf > #vacuum_cost_delay = 0 # 0-100 milliseconds > #vacuum_cost_page_hit = 1 # 0-1 credits > #vacuum_cost_page_miss = 10 # 0-1 credits > #vacuum_cost_page_dirty = 20# 0-1 credits > vacuum_cost_limit = 1000# 1-1 credits > vacuum_defer_cleanup_age = 10 # number of xacts by which cleanup > is delayed > autovacuum = on # Enable autovacuum subprocess? > 'on' > log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions > and > autovacuum_max_workers = 20 # max number of autovacuum > subprocesses > #autovacuum_naptime = 1min # time between autovacuum runs > #autovacuum_vacuum_threshold = 50 # min number of row updates before > # vacuum > #autovacuum_analyze_threshold = 50 # min number of row updates before > #autovacuum_vacuum_scale_factor = 0.2 # fraction of table size before > vacuum > #autovacuum_analyze_scale_factor = 0.1 # fraction of table size before > analyze > #autovacuum_freeze_max_age = 2 # maximum XID age before forced > vacuum > #autovacuum_vacuum_cost_delay = 20ms# default vacuum cost delay for > # autovacuum, in milliseconds; > # -1 means use vacuum_cost_delay > #autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for > # autovacuum, -1 means use > # vacuum_cost_limit > #vacuum_freeze_min_age = 5000 > #vacuum_freeze_table_age = 15000 > root@rpopdb01e ~ # > > > 13.02.2014, в 0:14, Ilya Kosmodemiansky написал(а): > > On Wed, Feb 12, 2014 at 8:57 PM, Бородин Владимир > wrote: > > > Yes, this is legacy, I will fix it. We had lots of inactive connections > but right now we use pgbouncer for this. When the workload is normal we > have some kind of 80-120 backends. Less than 10 of them are in active > state. Having problem with locks we get lots of sessions (sometimes more > than 1000 of them are in active state). According to vmstat the number of > context switches is not so big (less than 20k), so I don't think it is the > main reason. Yes, it can aggravate the problem, but imho not create it. > > > > I'am afraid that is the problem. More than 1000 backends, most of them > are simply waiting. > > > > I don't understand the correlation of shared buffers size and > synchronous_commit.