Re: [PERFORM] New server: SSD/RAID recommendations?
Hi Wes 1. The first interesting thing is that prior to my mentioning this problem to C_ a year or two back, the power loss protection was advertised everywhere as simply that, without qualifiers about 'not inflight data'. Check out the marketing of the M500 for the first year or so and try to find an example where they say 'but inflight data isn't protected!'. 2. The second (and more important) interesting thing is that this is irrelevant! Fsync'd data is BY DEFINITION not data in flight. Fsync means This data is secure on the disk! However, the drives corrupt it. Postgres's sanity depends on a reliable fsync. That's why we see posts on the performance list saying 'fsync=no makes your postgres faster but really, don't do it in production. We are talking about internal DB corruption, not just a crash and a few lost transactions. These drives return from fsync while data is still in volatile cache. That's breaking the spec, and it's why they are not OK for postgres by themselves. This is not about 'in-flight' data, it's about fsync'd wal log data. Graeme. On 07 Jul 2015, at 16:15, Wes Vaske (wvaske) wva...@micron.com wrote: The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or Intel S3500 S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cache and you pull the power. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer drives will have power loss protection against Data Corruption and the enterprise drive will have power loss protection against BOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protect against that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumer drives from *any* company from consideration for use with Postgres. Wes Vaske | Senior Storage Solutions Engineer Micron Technology -Original Message- From: Graeme B. Bell [mailto:graeme.b...@nibio.no] Sent: Tuesday, July 07, 2015 8:26 AM To: Merlin Moncure Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org Subject: Re: [PERFORM] New server: SSD/RAID recommendations? As I have warned elsewhere, The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect yourself. The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data is preserved in the event of a power loss, which completely undermines postgres's sanity. I would be extremely skeptical about the M500DC given the name and manufacturer. I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after extensive testing (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result: they stopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about protecting all data during a power loss'. The only safe way to use an M500/M550 with postgres is: a) disable the disk cache, which will cripple performance to about 3-5% of normal. b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the peak performance of the flash on the raid controller. If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before commiting to several. For myself, my time is valuable enough that it will be cheaper to buy intel in future. Graeme. On 07 Jul 2015, at 15:12, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) wva...@micron.com wrote: Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review The only database-specific info we have available are for Cassandra and MSSQL: http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf (some of that info might be relevant) In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: Micron M500DC (20nm) - 2 DWPD Intel S3500 (20nm)
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
After a plug-pull during the create, reboot and here is the verify: root@Dbms2:/var/tmp # ./diskchecker.pl -s newfs verify /test/biteme verifying: 0.00% verifying: 3.81% verifying: 10.91% verifying: 18.71% verifying: 26.46% verifying: 33.95% verifying: 41.20% verifying: 49.48% verifying: 57.23% verifying: 64.89% verifying: 72.54% verifying: 80.04% verifying: 87.96% verifying: 95.15% verifying: 100.00% Total errors: 0 da6 at mps0 bus 0 scbus0 target 17 lun 0 da6: ATA INTEL SSDSC2BP24 0420 Fixed Direct Access SPC-4 SCSI device da6: Serial Number BTJR446401KW240AGN da6: 600.000MB/s transfers da6: Command Queueing enabled da6: 228936MB (468862128 512 byte sectors: 255H 63S/T 29185C) # smartctl -a /dev/da6 === START OF INFORMATION SECTION === Model Family: Intel 730 and DC S3500/S3700 Series SSDs Device Model: INTEL SSDSC2BP240G4 Serial Number:BTJR446401KW240AGN LU WWN Device Id: 5 5cd2e4 04b71afc7 Firmware Version: L2010420 User Capacity:240,057,409,536 bytes [240 GB] Sector Size: 512 bytes logical/physical Rotation Rate:Solid State Device Form Factor: 2.5 inches Device is:In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is:Tue Jul 7 17:01:36 2015 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Note -- same firmware between all three series of Intel devices.. :-) Yes, I like these SSDs -- they don't lie and they don't lose data on a power-pull. On 7/7/2015 08:08, Graeme B. Bell wrote: Thanks, this is very useful to know about the 730. When you say 'tested it with plug-pulls', you were using diskchecker.pl, right? Graeme. On 07 Jul 2015, at 14:39, Karl Denninger k...@denninger.net wrote: Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here: 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 643 (4 6868) -- Karl Denninger k...@denninger.net mailto:k...@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] New server: SSD/RAID recommendations?
Why would you think that you don't need RAID for ZFS? Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects. Because you have zraid. :-) https://blogs.oracle.com/bonwick/entry/raid_z General points: 1. It's my understanding that ZFS is designed to talk to the hardware directly, and so it would be bad to hide the physical layer from ZFS unless you had to. After all, I don't think they implemented a raid-like system inside ZFS just for the fun of it. 2. You have zraid built in and easy to manage within ZFS - and well tested compared to NewRaidController (TM) - why add another layer of management to your disk storage? 3. You reintroduce the raid write hole. 4. There might be some argument for hardware raid (existing system) but with software raid (the point I was addressing) it makes little sense at all. 5. If you're on hardware raid and your controller dies, you're screwed in several ways. It's harder to get a new raid controller than a new pc. Your chances of recovery are lower than zfs. IMHO more scary to recover from a failed raid controller, too. 6. Recovery is faster if the disks aren't full. e.g. ZFS recovers what it is there. This might not seem a big deal but chances are it would save you 50% of your downtime in a crisis. However, I think with Linux you might want to use RAID for the boot disk. I don't know if linux can boot from ZFS yet. I would (and am) using Freebsd with zfs. Graeme. On 07 Jul 2015, at 18:56, Wei Shan weishan@gmail.com wrote: Hi Graeme, Why would you think that you don't need RAID for ZFS? Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects. Regards, Wei Shan On 8 July 2015 at 00:46, Graeme B. Bell graeme.b...@nibio.no wrote: RAID controllers are completely unnecessary for SSD as they currently exist. Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs in it as a starting point. So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has worked very well. Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass the cache. The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme and trash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'll probably put one in my next gaming PC. Re: software raid. I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS and thinking 'why the heck am I even using software raid?' G -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Regards, Ang Wei Shan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] wildcard text filter switched to boolean column, performance is way worse
I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? First query: SELECT * FROM lead WHERE ( NOT ( ( lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_1' ) OR ( lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_2' ) AND lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE UPPER(U1.response::text) LIKE UPPER('%success%') ) ) ) ) AND NOT (lead.ReferenceNumber = '') ) ; explain/analyze result: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 Rows Removed by Filter: 375369 SubPlan 1 - Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4) (actual time=0.005..171.350 rows=7414 loops=1) Filter: ((event_type)::text = 'type_1'::text) Rows Removed by Filter: 1099436 SubPlan 2 - Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4) (actual time=0.006..219.092 rows=373298 loops=1) Filter: ((event_type)::text = 'type_2'::text) Rows Removed by Filter: 733552 SubPlan 3 - Seq Scan on event u1_2 (cost=0.00..45175.75 rows=111 width=4) (actual time=0.040..3389.550 rows=712952 loops=1) Filter: (upper(response) ~~ '%SUCCESS%'::text) Rows Removed by Filter: 393898 The main thing that sticks out to me for this plan is the low estimate for the rows it will return on the %SUCCESS% filter. Here is the second query with explain: SELECT * FROM lead WHERE ( NOT ( (lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type ='type_1' ) OR (lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE U1.event_type = 'type_2 ) AND lead.id IN ( SELECT U1.lead_id AS lead_id FROM event U1 WHERE successful ) ) ) ) AND NOT (lead.ReferenceNumber = '') ) ; explain result: Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 SubPlan 1 - Seq Scan on event u1 (cost=0.00..42408.62 rows=7748 width=4) Filter: ((event_type)::text = 'type_1'::text) SubPlan 2 - Seq Scan on event u1_1 (cost=0.00..42408.62 rows=375665 width=4) Filter: ((event_type)::text = 'type_2'::text) SubPlan 3 - Materialize (cost=0.00..46154.43 rows=731185 width=4) - Seq Scan on event u1_2 (cost=0.00..39641.50 rows=731185 width=4) Filter: successful Here is does a materialize and estimates rows properly, but as stated this query just hangs and pegs load. There are no locks and its in an active state the whole time. I am running these queries in a test environment on a recently exported full schema from production, with a reindex and a vacuum/analyze. This is postgres 9.3.6 on rhel6. When I run just the different subquery element: SELECT U1.lead_id AS lead_id FROM event U1 WHERE successful; it returns in about 250ms, with the text field %SUCCESS% it runs in about 4 seconds. This seemed like a low hanging fruit query improvement so I'm surprised its not working, it seems like we are just lucky that the planner is estimating that filter incorrectly in the original form. I'm sure the query just needs to be completely overhauled and am starting to pull it apart and work with the engineers to get something more efficient set up overall, but I am not sure how to answer the question as to why this original attempt at improving the query is not successful. Any guidance is greatly appreciated, thanks!
Re: [PERFORM] New server: SSD/RAID recommendations?
On 07 Jul 2015, at 16:59, Heikki Linnakangas hlinn...@iki.fi wrote: So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. On Tue, Jul 7, 2015 at 10:58 AM, Graeme B. Bell graeme.b...@nibio.no wrote: Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough. 100% agree with your sentiments. I do believe that there are other enterprise SSD vendors that offer reliable parts but not at the price point intel does for the cheaper drives. The consumer grade vendors are simply not trustworthy unless proven otherwise (I had my own unpleasant experience with OCZ for example). Intel played the same game with their early parts but have since become a model of how to ship drives to the market. RAID controllers are completely unnecessary for SSD as they currently exist. Software raid is superior in every way; the hardware features of raid controllers, BBU, write caching, and write consolidation are redundant to what the SSD themselves do (being themselves RAID 0 basically). A hypothetical SSD optimized raid controller is possible; it could do things like balance wear and optimize writes across multiple physical drives. This would require deep participation between the drive and the controller and FWICT no such things exists excepting super expensive sans which I don't recommend anyways. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
RAID controllers are completely unnecessary for SSD as they currently exist. Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs in it as a starting point. So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has worked very well. Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass the cache. The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme and trash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'll probably put one in my next gaming PC. Re: software raid. I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS and thinking 'why the heck am I even using software raid?' G -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
Hi. How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after it was fsynced by the drive. As I know, there is no other magic command for drive to tell controller that the data is safe now and can be removed from BBU cache. Вт, 7 лип. 2015 11:59 Graeme B. Bell graeme.b...@nibio.no пише: Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough. G On 07 Jul 2015, at 16:59, Heikki Linnakangas hlinn...@iki.fi wrote: So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
After bumping up work_mem from 12MB to 25MB that last materialize is indeed hashing and this cut the query time by about 60%. Thanks, this was very helpful and gives me something else to look for when troubleshooting explains. On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers mbro...@gmail.com wrote: Thanks, very informative! I'll experiment with work_mem settings and report back. On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote: Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. regards, tom lane
Re: [PERFORM] New server: SSD/RAID recommendations?
The comment on HDDs is true and gave me another thought. These new 'shingled' HDDs (the 8TB ones) rely on rewriting all the data on tracks that overlap your data, any time you change the data. Result: disks 8-20x slower during writes, after they fill up. Do they have power loss protection for the data being rewritten during reshingling? You could have data commited at position X and you accidentally nuke data at position Y. [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just want to max out the disk bays in their server... ] Graeme. On 07 Jul 2015, at 19:28, Michael Nolan htf...@gmail.com wrote: On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote: The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or Intel S3500 S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cache and you pull the power. Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug. That can (and does) happen with spinning disks, too. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer drives will have power loss protection against Data Corruption and the enterprise drive will have power loss protection against BOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protect against that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumer drives from *any* company from consideration for use with Postgres. So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status. -- Mike Nolan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough. G On 07 Jul 2015, at 16:59, Heikki Linnakangas hlinn...@iki.fi wrote: So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Sudden connection and load average spikes with postgresql 9.3
Note that if you still have the settings you showed in your original post you're just moving the goal posts a few feet further back. Any heavy load can still trigger this kind of behaviour. On Tue, Jul 7, 2015 at 5:29 AM, eudald_v reaven.galaeind...@gmail.com wrote: Hello guys! I finally got rid of it. It looks that at the end it was all due to transparent_hugepages values. I disabled them and cpu spikes disappeared. I am sorry cause it's something I usually disable on postgresql servers, but I forgot to do so on this one and never thought about it. Thanks a lot for all your helpful messages! Eudald -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- To understand recursion, one must first understand recursion. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
Thanks, very informative! I'll experiment with work_mem settings and report back. On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane t...@sss.pgh.pa.us wrote: Mike Broers mbro...@gmail.com writes: I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it. I suggested that the engineers include a new boolean column for successful status. They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour). Can someone help me understand why that is the case and how to resolve it? It's hashing the subplan output in the first case and not the second: Seq Scan on lead (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3 vs Seq Scan on lead (cost=85775.78..9005687281.12 rows=139957 width=369) Filter: ((NOT (hashed SubPlan 1)) AND ((ReferenceNumber)::text ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3 ^ Presumably, the new more-accurate rows count causes the planner to realize that the hash table will exceed work_mem so it doesn't choose to hash ... but for your situation, you'd rather it did, because what you're getting instead is a Materialize node that spills to disk (again, because the data involved exceeds work_mem) and that's a killer for this query. You should be able to get back the old behavior if you raise work_mem enough. Another idea you might think about is changing the OR'd IN conditions to a single IN over a UNION ALL of the subselects. I'm not really sure if that would produce a better plan, but it's worth trying if it wouldn't require too much app-side contortion. regards, tom lane
Re: [PERFORM] New server: SSD/RAID recommendations?
Hi Graeme, Why would you think that you don't need RAID for ZFS? Reason I'm asking if because we are moving to ZFS on FreeBSD for our future projects. Regards, Wei Shan On 8 July 2015 at 00:46, Graeme B. Bell graeme.b...@nibio.no wrote: RAID controllers are completely unnecessary for SSD as they currently exist. Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs in it as a starting point. So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has worked very well. Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass the cache. The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme and trash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'll probably put one in my next gaming PC. Re: software raid. I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS and thinking 'why the heck am I even using software raid?' G -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Regards, Ang Wei Shan
Re: [PERFORM] New server: SSD/RAID recommendations?
On Tue, Jul 7, 2015 at 11:43 AM, Graeme B. Bell graeme.b...@nibio.no wrote: The comment on HDDs is true and gave me another thought. These new 'shingled' HDDs (the 8TB ones) rely on rewriting all the data on tracks that overlap your data, any time you change the data. Result: disks 8-20x slower during writes, after they fill up. Do they have power loss protection for the data being rewritten during reshingling? You could have data commited at position X and you accidentally nuke data at position Y. [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just want to max out the disk bays in their server... ] Let's just say no online backup companies are using those disks. :) Biggest current production spinners being used I know of are 4TB, non-shingled. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On 07 Jul 2015, at 19:47, Scott Marlowe scott.marl...@gmail.com wrote: [I know that using a shingled disk sounds crazy (it sounds crazy to me) but you can bet there are people that just want to max out the disk bays in their server... ] Let's just say no online backup companies are using those disks. :) I'm not so sure. Literally the most famous online backup company is (or was planning to): https://www.backblaze.com/blog/6-tb-hard-drive-face-off/ But I think that a massive read-only archive really is the only use for these things. I hope they go out of fashion, soon. But I was thinking more of the 'small company postgres server' or 'charitable organisation postgres server'. Someone is going to make this mistake, you can bet. Probably not someone on THIS list, of course... Biggest current production spinners being used I know of are 4TB, non-shingled. I think we may have some 6TB WD reds around here. I'll need to look around. G -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On Tue, Jul 7, 2015 at 11:46 AM, Graeme B. Bell graeme.b...@nibio.no wrote: RAID controllers are completely unnecessary for SSD as they currently exist. Agreed. The best solution is not to buy cheap disks and not to buy RAID controllers now, imho. In my own situation, I had a tight budget, high performance demand and a newish machine with RAID controller and HDDs in it as a starting point. So it was more a question of 'what can you do with a free raid controller and not much money' back in 2013. And it has worked very well. Still, I had hoped for a bit more from the cheaper SSDs though, I'd hoped to use fastpath on the controller and bypass the cache. The way NVMe prices are going though, I wouldn't do it again if I was doing it this year. I'd just go direct to nvme and trash the raid controller. These sammy and intel nvmes are basically enterprise hardware at consumer prices. Heck, I'll probably put one in my next gaming PC. Re: software raid. I agree, but once you accept that software raid is now pretty much superior to hardware raid, you start looking at ZFS and thinking 'why the heck am I even using software raid?' Good point. At least for me, I've yet to jump on the ZFS bandwagon and so don't have an opinion on it. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On 07/07/2015 10:59 PM, Graeme B. Bell wrote: Cache flushing isn't an atomic operation though. Even if the ordering is right, you are likely to have a partial fsync on the disk when the lights go out - isn't your FS still corrupt? If the filesystem is worth its salt, no. Journaling filesystems for example rely on the journal to work around that problem, and there are other mechanisms. PostgreSQL has exactly the same problem and uses the WAL to solve it. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
Cache flushing isn't an atomic operation though. Even if the ordering is right, you are likely to have a partial fsync on the disk when the lights go out - isn't your FS still corrupt? On 07 Jul 2015, at 21:53, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/07/2015 09:01 PM, Wes Vaske (wvaske) wrote: Right, to be precise, the problem isn't the drive lies about fsync(). It lies about FLUSH CACHE instead. Search replace fsync() with FLUSH CACHE, and the same question remains: When the drive breaks its promise wrt. FLUSH CACHE, does it nevertheless guarantee that the order the data is eventually flushed to disk is consistent with the order in which the data and FLUSH CACHE were sent to the drive? That's an important distinction, because it makes the difference between the most recent data the application saved might be lost even though the FLUSH CACHE command returned and your filesystem is corrupt. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
Regarding: “lie about their fsync status.” This is mostly semantics but it might help google searches on the issue. A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive will do a FLUSH CACHE. Before kernels 2.6.low numbers the fsync() call wouldn’t sent any ATA or SCSI command to flush the disk cache. Whereas—AFAICT—modern kernels and file system versions *will* do this. When ‘sync’ is called the filesystem will issue the appropriate command to the disk to flush the write cache. For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use: [root@postgres ~]# smartctl --identify /dev/sdu | grep FLUSH CACHE 83 13 1 FLUSH CACHE EXT supported 83 12 1 FLUSH CACHE supported 86 13 1 FLUSH CACHE EXT supported 86 12 1 FLUSH CACHE supported The 1s in the 3rd column represent SUPPORTED for the feature listed in the last column. Cheers, Wes Vaske From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Michael Nolan Sent: Tuesday, July 07, 2015 12:28 PM To: hlinn...@iki.fi Cc: Wes Vaske (wvaske); Graeme B. Bell; pgsql-performance@postgresql.org Subject: Re: [PERFORM] New server: SSD/RAID recommendations? On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas hlinn...@iki.fimailto:hlinn...@iki.fi wrote: On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote: The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or Intel S3500 S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cache and you pull the power. Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug. That can (and does) happen with spinning disks, too. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer drives will have power loss protection against Data Corruption and the enterprise drive will have power loss protection against BOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protect against that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumer drives from *any* company from consideration for use with Postgres. So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status. -- Mike Nolan
Re: [PERFORM] New server: SSD/RAID recommendations?
On 07/07/2015 09:01 PM, Wes Vaske (wvaske) wrote: Regarding: “lie about their fsync status.” This is mostly semantics but it might help google searches on the issue. A drive doesn’t support fsync(), that’s a filesystem/kernel process. A drive will do a FLUSH CACHE. Before kernels 2.6.low numbers the fsync() call wouldn’t sent any ATA or SCSI command to flush the disk cache. Whereas—AFAICT—modern kernels and file system versions*will* do this. When ‘sync’ is called the filesystem will issue the appropriate command to the disk to flush the write cache. For ATA, this is “FLUSH CACHE” (E7h). To check support for the command use: [root@postgres ~]# smartctl --identify /dev/sdu | grep FLUSH CACHE 83 13 1 FLUSH CACHE EXT supported 83 12 1 FLUSH CACHE supported 86 13 1 FLUSH CACHE EXT supported 86 12 1 FLUSH CACHE supported The 1s in the 3rd column represent SUPPORTED for the feature listed in the last column. Right, to be precise, the problem isn't the drive lies about fsync(). It lies about FLUSH CACHE instead. Search replace fsync() with FLUSH CACHE, and the same question remains: When the drive breaks its promise wrt. FLUSH CACHE, does it nevertheless guarantee that the order the data is eventually flushed to disk is consistent with the order in which the data and FLUSH CACHE were sent to the drive? That's an important distinction, because it makes the difference between the most recent data the application saved might be lost even though the FLUSH CACHE command returned and your filesystem is corrupt. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
That is a very good question, which I have raised elsewhere on the postgresql lists previously. In practice: I have *never* managed to make diskchecker fail with the BBU enabled in front of the drives and I spent days trying with plug pulls till I reached the point where as a statistical event it just can't be that likely at all. That's not to say it can't ever happen, just that I've taken all reasonable measures that I can to find out on the time and money budget I had available. In theory: It may be the fact the BBU makes the drives run at about half speed, so that the capacitors go a good bit further to empty the cache, after all: without the BBU in the way, the drive manages to save everything but the last fragment of writes. But I also suspect that the controller itself maybe replaying the last set of writes from around the time of power loss. Anyway I'm 50/50 on those two explanations. Any other thoughts welcome. This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY (at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P). Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power failure? There is rather too much magic involved for me to be happy. G On 07 Jul 2015, at 18:27, Vitalii Tymchyshyn v...@tym.im wrote: Hi. How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after it was fsynced by the drive. As I know, there is no other magic command for drive to tell controller that the data is safe now and can be removed from BBU cache. Вт, 7 лип. 2015 11:59 Graeme B. Bell graeme.b...@nibio.no пише: Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough. G On 07 Jul 2015, at 16:59, Heikki Linnakangas hlinn...@iki.fi wrote: So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY (at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P). Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power failure? (and if you have that manual - how can you know it's accurate? that the implementation matches the manual and is free of bugs? because my M500s didn't match the packaging and neither did a H710 we bought - Dell had advertised features in some marketing material that were only present on the H710P) And I see UBER (unrecoverable bit error) rates for SSDs and HDDs, but has anyone ever seen them for the flash-based cache on their raid controller? Sleep well, friends. Graeme. On 07 Jul 2015, at 18:54, Graeme B. Bell graeme.b...@nibio.no wrote: That is a very good question, which I have raised elsewhere on the postgresql lists previously. In practice: I have *never* managed to make diskchecker fail with the BBU enabled in front of the drives and I spent days trying with plug pulls till I reached the point where as a statistical event it just can't be that likely at all. That's not to say it can't ever happen, just that I've taken all reasonable measures that I can to find out on the time and money budget I had available. In theory: It may be the fact the BBU makes the drives run at about half speed, so that the capacitors go a good bit further to empty the cache, after all: without the BBU in the way, the drive manages to save everything but the last fragment of writes. But I also suspect that the controller itself maybe replaying the last set of writes from around the time of power loss. Anyway I'm 50/50 on those two explanations. Any other thoughts welcome. This raises another interesting question. Does anyone hear have a document explaining how their BBU cache works EXACTLY (at cache / sata level) on their server? Because I haven't been able to find any for mine (Dell PERC H710/H710P). Can anyone tell me with godlike authority and precision, what exactly happens inside that BBU post-power failure? There is rather too much magic involved for me to be happy. G On 07 Jul 2015, at 18:27, Vitalii Tymchyshyn v...@tym.im wrote: Hi. How would BBU cache help you if it lies about fsync? I suppose any RAID controller removes data from BBU cache after it was fsynced by the drive. As I know, there is no other magic command for drive to tell controller that the data is safe now and can be removed from BBU cache. Вт, 7 лип. 2015 11:59 Graeme B. Bell graeme.b...@nibio.no пише: Yikes. I would not be able to sleep tonight if it were not for the BBU cache in front of these disks... diskchecker.pl consistently reported several examples of corruption post-power-loss (usually 10 - 30 ) on unprotected M500s/M550s, so I think it's pretty much open to debate what types of madness and corruption you'll find if you look close enough. G On 07 Jul 2015, at 16:59, Heikki Linnakangas hlinn...@iki.fi wrote: So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On Tue, Jul 7, 2015 at 10:59 AM, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote: The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or Intel S3500 S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cache and you pull the power. Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug. That can (and does) happen with spinning disks, too. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer drives will have power loss protection against Data Corruption and the enterprise drive will have power loss protection against BOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protect against that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumer drives from *any* company from consideration for use with Postgres. So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki The sad fact is that MANY drives (ssd as well as spinning) lie about their fsync status. -- Mike Nolan
Re: [PERFORM] Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
Hi Merlin, Long story short - thanks for the reply, but you're not measuring anything about the parallelism of code running in a pl/pgsql environment here. You're just measuring whether postgres can parallelise entering that environment and get back out. Don't get me wrong - it's great that this scales well because it affects situations where you have lots of calls to trivial functions. However it's not the problem I'm talking about. I mean 'real' pl'pgsql functions. e.g. things that you might find in postgis or similar. If you re-read my previous email or look at par_psql (http://parpsql.com) and look at the benchmarks there you'll maybe see more about what I'm talking about. To clear up the issue I build a little test harness around your comment below. If anyone was wondering if it's par_psql itself that causes bad scaling in postgres. The answer is clearly no. :-) What I found this evening is that there are several problems here. I did some testing here using a machine with 16 physical cores and lots of memory/IO. - Using a table as a source of input rather than a fixed parameter e.g. 'select col1... ' vs. 'select 3'. Please note I am not talking about poor performance, I am talking about poor scaling of performance to multicore. There should be no reason for this when read-locks are being taken on the table, and no reason for this when it is combined with e.g. a bunch of pl/pgsql work in a function. However the impact of this problem is only seen above 8 cores where performance crashes. - Using pl/pgsql itself intensively (e.g. anything non-trivial) causes horrifically bad scaling above 2 cores on the systems I've tested and performance crashes very hard soon after. This matches what I've seen elsewhere in big projects and in par_psql's tests. Of course, it could be some wacky postgresql.conf setting (I doubt it here), so I'd be glad if others could give it a try. If you're bored, set the time to 5s and run, from testing I can tell you it shouldn't alter the results. The repo will be up in around 30 minutes time on http://github.com/gbb/t, and I'm going to submit it as a bug to the pg bugs list. Graeme. On 06 Jul 2015, at 18:40, Merlin Moncure mmonc...@gmail.com wrote: On Fri, Jul 3, 2015 at 9:48 AM, Graeme B. Bell graeme.b...@nibio.no wrote: Hi everyone, I've written a new open source tool for easily parallelising SQL scripts in postgres. [obligatory plug: https://github.com/gbb/par_psql ] Using it, I'm seeing a problem that I've also seen in other postgres projects involving high degrees of parallelisation in the last 12 months. Basically: - I have machines here with up to 16 CPU cores and 128GB memory, very fast SSDs and controller etc, carefully configured kernel/postgresql.conf for high performance. - Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), e.g. almost up to 16x performance improvement. - Non-DB stuff like GDAL, python etc. parallelise nearly perfectly. - HOWEVER calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT myfunction(some_stuff)) do not parallelise well, even when they are independently defined functions, or accessing tables in a read-only way. They hit a limit of 2.5x performance improvement relative to single-CPU performance (pg9.4) and merely 2x performance (pg9.3) regardless of how many CPU cores I throw at them. This is about 6 times slower than I'm expecting. I can't see what would be locking. It seems like it's the pl/pgsql environment itself that is somehow locking or incurring some huge frictional costs. Whether I use independently defined functions, independent source tables, independent output tables, makes no difference whatsoever, so it doesn't feel 'lock-related'. It also doesn't seem to be WAL/synchronisation related, as the machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged tables for output. Take a quick peek here: https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md I'm wondering what I'm missing here. Any ideas? I'm not necessarily seeing your results. via pgbench, mmoncure@mernix2 11:34 AM ~$ ~/pgdev/bin/pgbench -n -T 60 -f b.sql transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 658833 latency average: 0.091 ms tps = 10980.538470 (including connections establishing) tps = 10980.994547 (excluding connections establishing) mmoncure@mernix2 11:35 AM ~$ ~/pgdev/bin/pgbench -n -T 60 -c4 -j4 -f b.sql transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 4 number of threads: 4 duration: 60 s number of transactions actually processed: 2847631 latency average: 0.084 ms tps = 47460.430447 (including connections establishing) tps = 47463.702074 (excluding connections establishing) b.sql: select
Re: [PERFORM] Hmmm... why does pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
On Tue, Jul 7, 2015 at 3:33 PM, Graeme B. Bell graeme.b...@nibio.no wrote: Hi Merlin, Long story short - thanks for the reply, but you're not measuring anything about the parallelism of code running in a pl/pgsql environment here. You're just measuring whether postgres can parallelise entering that environment and get back out. Don't get me wrong - it's great that this scales well because it affects situations where you have lots of calls to trivial functions. However it's not the problem I'm talking about. I mean 'real' pl'pgsql functions. e.g. things that you might find in postgis or similar. Maybe so. But it will be a lot easier for me (and others on this) list if you submit a self contained test case that runs via pgbench. From there it's a simple matter of a perf top and other standard locking diagnostic tests and also rules out any suspicion of 3rd party issues. This will also get better feedback on -bugs. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
1. Does the sammy nvme have *complete* power loss protection though, for all fsync'd data? I am very badly burned by my experiences with Crucial SSDs and their 'power loss protection' which doesn't actually ensure all fsync'd data gets into flash. It certainly looks pretty with all those capacitors on top in the photos, but we need some plug pull tests to be sure. 2. Apologies for the typo in the previous post, raidz5 should have been raidz1. 3. Also, something to think about when you start having single disk solutions (or non-ZFS raid, for that matter). SSDs are so unlike HDDs. The samsung nvme has a UBER (uncorrectable bit error rate) measured at 1 in 10^17. That's one bit gone bad in 12500 TB, a good number. Chances are the drives fails before you hit a bit error, and if not, ZFS would catch it. Whereas current HDDS are at the 1 in 10^14 level. That means an error every 12TB, by the specs. That means, every time you fill your cheap 6-8TB seagate drive, it likely corrupted some of your data *even if it performed according to the spec*. (That's also why RAID5 isn't viable for rebuilding large arrays, incidentally). Graeme Bell On 07 Jul 2015, at 12:56, Mkrtchyan, Tigran tigran.mkrtch...@desy.de wrote: - Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Mkrtchyan, Tigran tigran.mkrtch...@desy.de Cc: Graeme B. Bell graeme.b...@nibio.no, Steve Crawford scrawf...@pinpointresearch.com, Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:38:10 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? I am unsure about the performance side but, ZFS is generally very attractive to me. Key advantages: 1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S, program files). 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other things). This may actually be a net negative for pg so maybe turn it off. 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a single drive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get from a RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem. 4) filesystem snapshotting Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform quite a lot of the alternatives, with great data integrity guarantees. We are planing to have a test setup as well. For now I have single NVMe SSD on my test system: # lspci | grep NVM 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) # mount | grep nvm /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered) and quite happy with it. We have write heavy workload on it to see when it will break. Postgres Performs very well. About x2.5 faster than with regular disks with a single client and almost linear with multiple clients (picture attached. On Y number of high level op/s our application does, X number of clients). The setup is used last 3 months. Looks promising but for production we need to to have disk size twice as big as on the test system. Until today, I was planning to use a RAID10 with a HW controller... Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris. Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks. Tigran. Haven't built one yet. Hope to, later this year. Steve, I would love to know more about how you're getting on with your NVMe disk in postgres! Graeme. On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran tigran.mkrtch...@desy.de wrote: Thanks for the Info. So if RAID controllers are not an option, what one should use to build big databases? LVM with xfs? BtrFs? Zfs? Tigran. - Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Steve Crawford scrawf...@pinpointresearch.com Cc: Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:22:00 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? Completely agree with Steve. 1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned elsewhere. 2. RAID controllers. We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. This might give people idea about where the risk lies in the path from disk to CPU. We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with batteries suddenly becoming useless or suddenly reporting wildly varying
Re: [PERFORM] Sudden connection and load average spikes with postgresql 9.3
Hello guys! I finally got rid of it. It looks that at the end it was all due to transparent_hugepages values. I disabled them and cpu spikes disappeared. I am sorry cause it's something I usually disable on postgresql servers, but I forgot to do so on this one and never thought about it. Thanks a lot for all your helpful messages! Eudald -- View this message in context: http://postgresql.nabble.com/Sudden-connection-and-load-average-spikes-with-postgresql-9-3-tp5855895p5856914.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
Hi Karl, Great post, thanks. Though I don't think it's against conventional wisdom to aggregate writes into larger blocks rather than rely on 4k performance on ssds :-) 128kb blocks + compression certainly makes sense. But it might make less sense I suppose if you had some incredibly high rate of churn in your rows. But for the work we do here, we could use 16MB blocks for all the difference it would make. (Tip to others: don't do that. 128kb block performance is already enough out the IO bus to most ssds) Do you have your WAL log on a compressed zfs fs? Graeme Bell On 07 Jul 2015, at 13:28, Karl Denninger k...@denninger.net wrote: Lz4 compression and standard 128kb block size has shown to be materially faster here than using 8kb blocks and no compression, both with rotating disks and SSDs. This is workload dependent in my experience but in the applications we put Postgres to there is a very material improvement in throughput using compression and the larger blocksize, which is counter-intuitive and also opposite the conventional wisdom. For best throughput we use mirrored vdev sets. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On 7/7/2015 06:52, Graeme B. Bell wrote: Hi Karl, Great post, thanks. Though I don't think it's against conventional wisdom to aggregate writes into larger blocks rather than rely on 4k performance on ssds :-) 128kb blocks + compression certainly makes sense. But it might make less sense I suppose if you had some incredibly high rate of churn in your rows. But for the work we do here, we could use 16MB blocks for all the difference it would make. (Tip to others: don't do that. 128kb block performance is already enough out the IO bus to most ssds) Do you have your WAL log on a compressed zfs fs? Graeme Bell Yes. Data goes on one mirrored set of vdevs, pg_xlog goes on a second, separate pool. WAL goes on a third pool on RaidZ2. WAL typically goes on rotating storage since I use it (and a basebackup) as disaster recovery (and in hot spare apps the source for the syncing hot standbys) and that's nearly a big-block-write-only data stream. Rotating media is fine for that in most applications. I take a new basebackup on reasonable intervals and rotate the WAL logs to keep that from growing without boundary. I use LSI host adapters for the drives themselves (no hardware RAID); I'm currently running on FreeBSD 10.1. Be aware that ZFS on FreeBSD has some fairly nasty issues that I developed (and publish) a patch for; without it some workloads can result in very undesirable behavior where working set gets paged out in favor of ZFS ARC; if that happens your performance will go straight into the toilet. Back before FreeBSD 9 when ZFS was simply not stable enough for me I used ARECA hardware RAID adapters and rotating media with BBUs and large cache memory installed on them with UFS filesystems. Hardware adapters are, however, a net lose in a ZFS environment even when they nominally work well (and they frequently interact very badly with ZFS during certain operations making them just flat-out unsuitable.) All-in I far prefer ZFS on a host adapter to UFS on a RAID adapter both from a data integrity and performance standpoint. My SSD drives of choice are all Intel; for lower-end requirements the 730s work very well; the S3500 is next and if your write volume is high enough the S3700 has much greater endurance (but at a correspondingly higher price.) All three are properly power-fail protected. All three are much, much faster than rotating storage. If you can saturate the SATA channels and need still more I/O throughput NVMe drives are the next quantum up in performance; I'm not there with our application at the present time. Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here: 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 643 (4 6868) -- Karl Denninger k...@denninger.net mailto:k...@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ smime.p7s Description: S/MIME Cryptographic Signature
Re: [PERFORM] New server: SSD/RAID recommendations?
Thanks for the Info. So if RAID controllers are not an option, what one should use to build big databases? LVM with xfs? BtrFs? Zfs? Tigran. - Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Steve Crawford scrawf...@pinpointresearch.com Cc: Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:22:00 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? Completely agree with Steve. 1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned elsewhere. 2. RAID controllers. We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. This might give people idea about where the risk lies in the path from disk to CPU. We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with batteries suddenly becoming useless or suddenly reporting wildly varying temperatures/overheating. There may have been other RAID problems I don't know about. Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. So by my estimates: 30% annual problem rate with RAID controllers 30-50% failure rate with Seagate HDDs (backblaze saw similar results) 0% failure rate with HGST HDDs. 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug in TRIM under linux, which I'll need to investigate to see if we have been affected by). also, RAID controllers aren't free - not just the money but also the management of them (ever tried writing a complex install script that interacts work with MegaCLI? It can be done but it's not much fun.). Just take a look at the MegaCLI manual and ask yourself... is this even worth it (if you have a good MTBF on an enterprise SSD). RAID was meant to be about ensuring availability of data. I have trouble believing that these days Graeme Bell On 06 Jul 2015, at 18:56, Steve Crawford scrawf...@pinpointresearch.com wrote: 2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs (though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
I am unsure about the performance side but, ZFS is generally very attractive to me. Key advantages: 1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S, program files). 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other things). This may actually be a net negative for pg so maybe turn it off. 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a single drive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get from a RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem. 4) filesystem snapshotting Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform quite a lot of the alternatives, with great data integrity guarantees. Haven't built one yet. Hope to, later this year. Steve, I would love to know more about how you're getting on with your NVMe disk in postgres! Graeme. On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran tigran.mkrtch...@desy.de wrote: Thanks for the Info. So if RAID controllers are not an option, what one should use to build big databases? LVM with xfs? BtrFs? Zfs? Tigran. - Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Steve Crawford scrawf...@pinpointresearch.com Cc: Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:22:00 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? Completely agree with Steve. 1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned elsewhere. 2. RAID controllers. We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. This might give people idea about where the risk lies in the path from disk to CPU. We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with batteries suddenly becoming useless or suddenly reporting wildly varying temperatures/overheating. There may have been other RAID problems I don't know about. Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. So by my estimates: 30% annual problem rate with RAID controllers 30-50% failure rate with Seagate HDDs (backblaze saw similar results) 0% failure rate with HGST HDDs. 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug in TRIM under linux, which I'll need to investigate to see if we have been affected by). also, RAID controllers aren't free - not just the money but also the management of them (ever tried writing a complex install script that interacts work with MegaCLI? It can be done but it's not much fun.). Just take a look at the MegaCLI manual and ask yourself... is this even worth it (if you have a good MTBF on an enterprise SSD). RAID was meant to be about ensuring availability of data. I have trouble believing that these days Graeme Bell On 06 Jul 2015, at 18:56, Steve Crawford scrawf...@pinpointresearch.com wrote: 2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs (though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
- Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Mkrtchyan, Tigran tigran.mkrtch...@desy.de Cc: Graeme B. Bell graeme.b...@nibio.no, Steve Crawford scrawf...@pinpointresearch.com, Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:38:10 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? I am unsure about the performance side but, ZFS is generally very attractive to me. Key advantages: 1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S, program files). 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other things). This may actually be a net negative for pg so maybe turn it off. 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a single drive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get from a RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem. 4) filesystem snapshotting Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform quite a lot of the alternatives, with great data integrity guarantees. We are planing to have a test setup as well. For now I have single NVMe SSD on my test system: # lspci | grep NVM 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03) # mount | grep nvm /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered) and quite happy with it. We have write heavy workload on it to see when it will break. Postgres Performs very well. About x2.5 faster than with regular disks with a single client and almost linear with multiple clients (picture attached. On Y number of high level op/s our application does, X number of clients). The setup is used last 3 months. Looks promising but for production we need to to have disk size twice as big as on the test system. Until today, I was planning to use a RAID10 with a HW controller... Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris. Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks. Tigran. Haven't built one yet. Hope to, later this year. Steve, I would love to know more about how you're getting on with your NVMe disk in postgres! Graeme. On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran tigran.mkrtch...@desy.de wrote: Thanks for the Info. So if RAID controllers are not an option, what one should use to build big databases? LVM with xfs? BtrFs? Zfs? Tigran. - Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Steve Crawford scrawf...@pinpointresearch.com Cc: Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:22:00 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? Completely agree with Steve. 1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned elsewhere. 2. RAID controllers. We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. This might give people idea about where the risk lies in the path from disk to CPU. We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with batteries suddenly becoming useless or suddenly reporting wildly varying temperatures/overheating. There may have been other RAID problems I don't know about. Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. So by my estimates: 30% annual problem rate with RAID controllers 30-50% failure rate with Seagate HDDs (backblaze saw similar results) 0% failure rate with HGST HDDs. 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug in TRIM under linux, which I'll need to investigate to see if we have been affected by). also, RAID controllers aren't free - not just the money but also the management of them (ever tried writing a complex install script that interacts work with MegaCLI? It can be done but it's not much fun.). Just take a look at the MegaCLI manual and ask yourself... is this even worth it (if you have a good MTBF on an enterprise SSD). RAID was meant to be about ensuring availability of data. I have trouble believing that these days Graeme Bell On 06 Jul 2015, at 18:56, Steve Crawford scrawf...@pinpointresearch.com wrote: 2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs
Re: [PERFORM] New server: SSD/RAID recommendations?
Completely agree with Steve. 1. Intel NVMe looks like the best bet if you have modern enough hardware for NVMe. Otherwise e.g. S3700 mentioned elsewhere. 2. RAID controllers. We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines. This might give people idea about where the risk lies in the path from disk to CPU. We've had 2 RAID card failures in the last 12 months that nuked the array with days of downtime, and 2 problems with batteries suddenly becoming useless or suddenly reporting wildly varying temperatures/overheating. There may have been other RAID problems I don't know about. Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs. So by my estimates: 30% annual problem rate with RAID controllers 30-50% failure rate with Seagate HDDs (backblaze saw similar results) 0% failure rate with HGST HDDs. 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug in TRIM under linux, which I'll need to investigate to see if we have been affected by). also, RAID controllers aren't free - not just the money but also the management of them (ever tried writing a complex install script that interacts work with MegaCLI? It can be done but it's not much fun.). Just take a look at the MegaCLI manual and ask yourself... is this even worth it (if you have a good MTBF on an enterprise SSD). RAID was meant to be about ensuring availability of data. I have trouble believing that these days Graeme Bell On 06 Jul 2015, at 18:56, Steve Crawford scrawf...@pinpointresearch.com wrote: 2. We don't typically have redundant electronic components in our servers. Sure, we have dual power supplies and dual NICs (though generally to handle external failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no backup RAID card. Intel Enterprise SSD already have power-fail protection so I don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD I'm left to wonder if placing a RAID card in front merely adds a new point of failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at you, RAID backup battery). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On 7/7/2015 05:56, Mkrtchyan, Tigran wrote: - Original Message - From: Graeme B. Bell graeme.b...@nibio.no To: Mkrtchyan, Tigran tigran.mkrtch...@desy.de Cc: Graeme B. Bell graeme.b...@nibio.no, Steve Crawford scrawf...@pinpointresearch.com, Wes Vaske (wvaske) wva...@micron.com, pgsql-performance pgsql-performance@postgresql.org Sent: Tuesday, July 7, 2015 12:38:10 PM Subject: Re: [PERFORM] New server: SSD/RAID recommendations? I am unsure about the performance side but, ZFS is generally very attractive to me. Key advantages: 1) Checksumming and automatic fixing-of-broken-things on every file (not just postgres pages, but your scripts, O/S, program files). 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact may slow them down, but helpful for other things). This may actually be a net negative for pg so maybe turn it off. 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's safe to replace a RAID array with a single drive... you can use a couple of NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get from a RAID controller. Slightly better, arguably, since they claim to have fixed the raid write-hole problem. 4) filesystem snapshotting Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU with multiple NVMe drives will outperform quite a lot of the alternatives, with great data integrity guarantees. Lz4 compression and standard 128kb block size has shown to be materially faster here than using 8kb blocks and no compression, both with rotating disks and SSDs. This is workload dependent in my experience but in the applications we put Postgres to there is a very material improvement in throughput using compression and the larger blocksize, which is counter-intuitive and also opposite the conventional wisdom. For best throughput we use mirrored vdev sets. -- Karl Denninger k...@denninger.net mailto:k...@denninger.net /The Market Ticker/ /[S/MIME encrypted email preferred]/ smime.p7s Description: S/MIME Cryptographic Signature
[PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
Hi everyone, I've written a new open source tool for easily parallelising SQL scripts in postgres. [obligatory plug: https://github.com/gbb/par_psql ] Using it, I'm seeing a problem I've seen in other postgres projects involving parallelisation in the last 12 months. Basically: - I have machines here with up to 16 CPUs and 128GB memory, very fast SSDs and controller etc, carefully configured kernel/postgresql.conf for high performance. - Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), e.g. almost up to 16x performance improvement. - Calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT myfunction(some_stuff)) do not parallelise well, even when they are independent or accessing tables in a read-only way. They hit a limit at 2.5x performance improvement relative to single-CPU performance (pg9.4) and 2x performance (pg9.3). This is about 6 times slower than I'm expecting. - Can't see what would be locking. It seems like it's the pl/pgsql environment itself that is somehow locking or incurring some huge frictional costs. Whether I use independently defined functions, independent source tables, independent output tables, makes no difference whatsoever, so it doesn't feel 'locky'. It also doesn't seem to be WAL/synchronisation related, as the machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged tables. Curious? Take a quick peek here: https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md Wondering what I'm missing here. Any ideas? Graeme. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
On Thu, Jul 2, 2015 at 9:15 AM, Graeme B. Bell graeme.b...@nibio.no wrote: Hi everyone, I've written a new open source tool for easily parallelising SQL scripts in postgres. [obligatory plug: https://github.com/gbb/par_psql ] Using it, I'm seeing a problem I've seen in other postgres projects involving parallelisation in the last 12 months. Basically: - I have machines here with up to 16 CPUs and 128GB memory, very fast SSDs and controller etc, carefully configured kernel/postgresql.conf for high performance. - Ordinary queries parallelise nearly perfectly (e.g. SELECT some_stuff ...), e.g. almost up to 16x performance improvement. - Calls to CPU-intensive user-defined pl/pgsql functions (e.g. SELECT myfunction(some_stuff)) do not parallelise well, even when they are independent or accessing tables in a read-only way. They hit a limit at 2.5x performance improvement relative to single-CPU performance (pg9.4) and 2x performance (pg9.3). This is about 6 times slower than I'm expecting. - Can't see what would be locking. It seems like it's the pl/pgsql environment itself that is somehow locking or incurring some huge frictional costs. Whether I use independently defined functions, independent source tables, independent output tables, makes no difference whatsoever, so it doesn't feel 'locky'. It also doesn't seem to be WAL/synchronisation related, as the machines I'm using can hit absurdly high pgbench rates, and I'm using unlogged tables. Curious? Take a quick peek here: https://github.com/gbb/par_psql/blob/master/BENCHMARKS.md Wondering what I'm missing here. Any ideas? No ideas, but I ran into the same thing. I have a set of C/C++ functions that put some chemistry calculations into Postgres as extensions (things like, calculate the molecular weight of this molecule). As SQL functions, the whole thing bogged down, and we never got the scalability we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at the same time, even with dozens of clients. When I moved these same functions into an Apache fast-CGI HTTP service (exact same code, same network overhead), I could easily scale up and use the full 100% of all eight CPUs. I have no idea why, and never investigated further. The convenience of having the functions in SQL wasn't that important. Craig Graeme. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- - Craig A. James Chief Technology Officer eMolecules, Inc. -
Re: [PERFORM] New server: SSD/RAID recommendations?
The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or Intel S3500 S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cache and you pull the power. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer drives will have power loss protection against Data Corruption and the enterprise drive will have power loss protection against BOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protect against that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumer drives from *any* company from consideration for use with Postgres. Wes Vaske | Senior Storage Solutions Engineer Micron Technology -Original Message- From: Graeme B. Bell [mailto:graeme.b...@nibio.no] Sent: Tuesday, July 07, 2015 8:26 AM To: Merlin Moncure Cc: Wes Vaske (wvaske); Craig James; pgsql-performance@postgresql.org Subject: Re: [PERFORM] New server: SSD/RAID recommendations? As I have warned elsewhere, The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect yourself. The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data is preserved in the event of a power loss, which completely undermines postgres's sanity. I would be extremely skeptical about the M500DC given the name and manufacturer. I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after extensive testing (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result: they stopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about protecting all data during a power loss'. The only safe way to use an M500/M550 with postgres is: a) disable the disk cache, which will cripple performance to about 3-5% of normal. b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the peak performance of the flash on the raid controller. If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before commiting to several. For myself, my time is valuable enough that it will be cheaper to buy intel in future. Graeme. On 07 Jul 2015, at 15:12, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) wva...@micron.com wrote: Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review The only database-specific info we have available are for Cassandra and MSSQL: http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf (some of that info might be relevant) In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: Micron M500DC (20nm) - 2 DWPD Intel S3500 (20nm) - 0.3 DWPD Intel S3510 (16nm) - 0.3 DWPD Intel S3710 (20nm) - 10 DWPD They're all great drives, the question is how write-intensive is the workload. Intel added a new product, the 3610, that is rated for 3 DWPD. Pricing looks to be around 1.20$/GB. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
Thanks, this is very useful to know about the 730. When you say 'tested it with plug-pulls', you were using diskchecker.pl, right? Graeme. On 07 Jul 2015, at 14:39, Karl Denninger k...@denninger.net wrote: Incidentally while there are people who have questioned the 730 series power loss protection I've tested it with plug-pulls and in addition it watchdogs its internal power loss capacitors -- from the smartctl -a display of one of them on an in-service machine here: 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 643 (4 6868) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) wva...@micron.com wrote: Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review The only database-specific info we have available are for Cassandra and MSSQL: http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf (some of that info might be relevant) In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: Micron M500DC (20nm) – 2 DWPD Intel S3500 (20nm) – 0.3 DWPD Intel S3510 (16nm) – 0.3 DWPD Intel S3710 (20nm) – 10 DWPD They’re all great drives, the question is how write-intensive is the workload. Intel added a new product, the 3610, that is rated for 3 DWPD. Pricing looks to be around 1.20$/GB. merlin
Re: [PERFORM] New server: SSD/RAID recommendations?
As I have warned elsewhere, The M500/M550 from $SOME_COMPANY is NOT SUITABLE for postgres unless you have a RAID controller with BBU to protect yourself. The M500/M550 are NOT plug-pull safe despite the 'power loss protection' claimed on the packaging. Not all fsync'd data is preserved in the event of a power loss, which completely undermines postgres's sanity. I would be extremely skeptical about the M500DC given the name and manufacturer. I went to quite a lot of trouble to provide $SOME_COMPANYs engineers with the full details of this fault after extensive testing (we have e.g. 20-25 of these disks) on multiple machines and controllers, at their request. Result: they stopped replying to me, and soon after I saw their PR reps talking about how 'power loss protection isn't about protecting all data during a power loss'. The only safe way to use an M500/M550 with postgres is: a) disable the disk cache, which will cripple performance to about 3-5% of normal. b) use a battery backed or cap-backed RAID controller, which will generally hurt performance, by limiting you to the peak performance of the flash on the raid controller. If you are buying such a drive, I strongly recommend buying only one and doing extensive plug pull testing before commiting to several. For myself, my time is valuable enough that it will be cheaper to buy intel in future. Graeme. On 07 Jul 2015, at 15:12, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Jul 2, 2015 at 1:00 PM, Wes Vaske (wvaske) wva...@micron.com wrote: Storage Review has a pretty good process and reviewed the M500DC when it released last year. http://www.storagereview.com/micron_m500dc_enterprise_ssd_review The only database-specific info we have available are for Cassandra and MSSQL: http://www.micron.com/~/media/documents/products/technical-marketing-brief/cassandra_and_m500dc_enterprise_ssd_tech_brief.pdf http://www.micron.com/~/media/documents/products/technical-marketing-brief/sql_server_2014_and_m500dc_raid_configuration_tech_brief.pdf (some of that info might be relevant) In terms of endurance, the M500DC is rated to 2 Drive Writes Per Day (DWPD) for 5-years. For comparison: Micron M500DC (20nm) – 2 DWPD Intel S3500 (20nm) – 0.3 DWPD Intel S3510 (16nm) – 0.3 DWPD Intel S3710 (20nm) – 10 DWPD They’re all great drives, the question is how write-intensive is the workload. Intel added a new product, the 3610, that is rated for 3 DWPD. Pricing looks to be around 1.20$/GB. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Hmmm... why does CPU-intensive pl/pgsql code parallelise so badly when queries parallelise fine? Anyone else seen this?
On 07/07/2015 08:05 PM, Craig James wrote: No ideas, but I ran into the same thing. I have a set of C/C++ functions that put some chemistry calculations into Postgres as extensions (things like, calculate the molecular weight of this molecule). As SQL functions, the whole thing bogged down, and we never got the scalability we needed. On our 8-CPU setup, we couldn't get more than 2 CPUs busy at the same time, even with dozens of clients. When I moved these same functions into an Apache fast-CGI HTTP service (exact same code, same network overhead), I could easily scale up and use the full 100% of all eight CPUs. I have no idea why, and never investigated further. The convenience of having the functions in SQL wasn't that important. I admit that I haven't read this whole thread but: Using Apache Fast-CGI, you are going to fork a process for each instance of the function being executed and that in turn will use all CPUs up to the max available resource. With PostgreSQL, that isn't going to happen unless you are running (at least) 8 functions across 8 connections. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing I'm offended is basically telling the world you can't control your own emotions, so everyone else should do it for you. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] New server: SSD/RAID recommendations?
On 07/07/2015 05:15 PM, Wes Vaske (wvaske) wrote: The M500/M550/M600 are consumer class drives that don't have power protection for all inflight data.* (like the Samsung 8x0 series and the Intel 3x0 5x0 series). The M500DC has full power protection for inflight data and is an enterprise-class drive (like the Samsung 845DC or Intel S3500 S3700 series). So any drive without the capacitors to protect inflight data will suffer from data loss if you're using disk write cache and you pull the power. Wow, I would be pretty angry if I installed a SSD in my desktop, and it loses a file that I saved just before pulling the power plug. *Big addendum: There are two issues on powerloss that will mess with Postgres. Data Loss and Data Corruption. The micron consumer drives will have power loss protection against Data Corruption and the enterprise drive will have power loss protection against BOTH. https://www.micron.com/~/media/documents/products/white-paper/wp_ssd_power_loss_protection.pdf The Data Corruption problem is only an issue in non-SLC NAND but it's industry wide. And even though some drives will protect against that, the protection of inflight data that's been fsync'd is more important and should disqualify *any* consumer drives from *any* company from consideration for use with Postgres. So it lies about fsync()... The next question is, does it nevertheless enforce the correct ordering of persisting fsync'd data? If you write to file A and fsync it, then write to another file B and fsync it too, is it guaranteed that if B is persisted, A is as well? Because if it isn't, you can end up with filesystem (or database) corruption anyway. - Heikki -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance