Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Hi Dave, Ted, All, On 2014-05-23 16:42:47 +1000, Dave Chinner wrote: > On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote: > > Hi Dave, > > > > On 2014-04-29 09:47:56 +1000, Dave Chinner wrote: > > > ping? > > > > I'd replied at http://marc.info/?l=linux-mm=139730910307321=2 > > I missed it, sorry. No worries. As you can see, I'm not quick answering either :/ > I've had a bit more time to look at this behaviour now and tweaked > it as you suggested, but I simply can't get XFS to misbehave in the > manner you demonstrated. However, I can reproduce major read latency > changes and writeback flush storms with ext4. I originally only > tested on XFS. That's interesting. I know that the problem was reproducable on xfs at some point, but that was on 2.6.18 or so... I'll try whether I can make it perform badly on the measly hardware I have available. > I'm using the no-op IO scheduler everywhere, too. And will check whether it's potentially related to that. > ext4, OTOH, generated a much, much higher periodic write IO load and > it's regularly causing read IO latencies in the hundreds of > milliseconds. Every so often this occurred on ext4 (5s sample rate) > > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > vdc 0.00 3.00 3142.20 219.2034.1119.1032.42 > 1.110.330.330.31 0.27 91.92 > vdc 0.00 0.80 3311.60 216.2035.8618.9031.79 > 1.170.330.330.39 0.26 92.56 > vdc 0.00 0.80 2919.80 2750.6031.6748.3628.90 > 20.053.500.366.83 0.16 92.96 > vdc 0.00 0.80 435.00 15689.80 4.96 198.1025.79 > 113.217.032.327.16 0.06 99.20 > vdc 0.00 0.80 2683.80 216.2029.7218.9834.39 > 1.130.390.390.40 0.32 91.92 > vdc 0.00 0.80 2853.00 218.2031.2919.0633.57 > 1.140.370.370.36 0.30 92.56 > > Which is, i think, signs of what you'd been trying to demonstrate - > a major dip in read performance when writeback is flushing. I've seen *much* worse cases than this, but it's what we're seing in production. > What is interesting here is the difference in IO patterns. ext4 is > doing much larger IOs than XFS - it's average IO size is 16k, while > XFS's is a bit over 8k. So while the read and background write IOPS > rates are similar, ext4 is moving a lot more data to/from disk in > larger chunks. > > This seems also to translate to much larger writeback IO peaks in > ext4. I have no idea what this means in terms of actual application > throughput, but it looks very much to me like the nasty read > latencies are much more pronounced on ext4 because of the higher > read bandwidths and write IOPS being seen. I'll try starting a benchmark of actual postgres showing the differnt peak/average throughput and latencies. > So, seeing the differences in behvaiour just by changing > filesystems, I just ran the workload on btrfs. Ouch - it was > even worse than ext4 in terms of read latencies - they were highly > unpredictable, and massively variable even within a read group: I've essentially given up on btrfs for the forseeable future :(. > That means it isn't clear that there's any generic infrastructure > problem here, and it certainly isn't clear that each filesystem has > the same problem or the issues can be solved by a generic mechanism. > I think you probably need to engage the ext4 developers drectly to > understand what ext4 is doing in detail, or work out how to prod XFS > into displaying that extremely bad read latency behaviour I've CCed the ext4 list and Ted. Maybe that'll bring some insigh... > > > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote: > > > > I'm not sure how you were generating the behaviour you reported, but > > > > the test program as it stands does not appear to be causing any > > > > problems at all on the sort of storage I'd expect large databases to > > > > be hosted on > > > > A really really large number of database aren't stored on big enterprise > > rigs... > > I'm not using a big enterprise rig. I've reproduced these results on > a low end Dell server with the internal H710 SAS RAID and a pair of > consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID > HBA with 12 2T nearline SAS drives in RAID0. There's a *lot* of busy postgres installations out there running on a single disk of spinning rust. Hopefully replicating to another piece of spinning rust... In comparison to that that's enterprise hardware ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Hi Dave, Ted, All, On 2014-05-23 16:42:47 +1000, Dave Chinner wrote: On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote: Hi Dave, On 2014-04-29 09:47:56 +1000, Dave Chinner wrote: ping? I'd replied at http://marc.info/?l=linux-mmm=139730910307321w=2 I missed it, sorry. No worries. As you can see, I'm not quick answering either :/ I've had a bit more time to look at this behaviour now and tweaked it as you suggested, but I simply can't get XFS to misbehave in the manner you demonstrated. However, I can reproduce major read latency changes and writeback flush storms with ext4. I originally only tested on XFS. That's interesting. I know that the problem was reproducable on xfs at some point, but that was on 2.6.18 or so... I'll try whether I can make it perform badly on the measly hardware I have available. I'm using the no-op IO scheduler everywhere, too. And will check whether it's potentially related to that. ext4, OTOH, generated a much, much higher periodic write IO load and it's regularly causing read IO latencies in the hundreds of milliseconds. Every so often this occurred on ext4 (5s sample rate) Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 3.00 3142.20 219.2034.1119.1032.42 1.110.330.330.31 0.27 91.92 vdc 0.00 0.80 3311.60 216.2035.8618.9031.79 1.170.330.330.39 0.26 92.56 vdc 0.00 0.80 2919.80 2750.6031.6748.3628.90 20.053.500.366.83 0.16 92.96 vdc 0.00 0.80 435.00 15689.80 4.96 198.1025.79 113.217.032.327.16 0.06 99.20 vdc 0.00 0.80 2683.80 216.2029.7218.9834.39 1.130.390.390.40 0.32 91.92 vdc 0.00 0.80 2853.00 218.2031.2919.0633.57 1.140.370.370.36 0.30 92.56 Which is, i think, signs of what you'd been trying to demonstrate - a major dip in read performance when writeback is flushing. I've seen *much* worse cases than this, but it's what we're seing in production. What is interesting here is the difference in IO patterns. ext4 is doing much larger IOs than XFS - it's average IO size is 16k, while XFS's is a bit over 8k. So while the read and background write IOPS rates are similar, ext4 is moving a lot more data to/from disk in larger chunks. This seems also to translate to much larger writeback IO peaks in ext4. I have no idea what this means in terms of actual application throughput, but it looks very much to me like the nasty read latencies are much more pronounced on ext4 because of the higher read bandwidths and write IOPS being seen. I'll try starting a benchmark of actual postgres showing the differnt peak/average throughput and latencies. So, seeing the differences in behvaiour just by changing filesystems, I just ran the workload on btrfs. Ouch - it was even worse than ext4 in terms of read latencies - they were highly unpredictable, and massively variable even within a read group: I've essentially given up on btrfs for the forseeable future :(. That means it isn't clear that there's any generic infrastructure problem here, and it certainly isn't clear that each filesystem has the same problem or the issues can be solved by a generic mechanism. I think you probably need to engage the ext4 developers drectly to understand what ext4 is doing in detail, or work out how to prod XFS into displaying that extremely bad read latency behaviour I've CCed the ext4 list and Ted. Maybe that'll bring some insigh... On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote: I'm not sure how you were generating the behaviour you reported, but the test program as it stands does not appear to be causing any problems at all on the sort of storage I'd expect large databases to be hosted on A really really large number of database aren't stored on big enterprise rigs... I'm not using a big enterprise rig. I've reproduced these results on a low end Dell server with the internal H710 SAS RAID and a pair of consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID HBA with 12 2T nearline SAS drives in RAID0. There's a *lot* of busy postgres installations out there running on a single disk of spinning rust. Hopefully replicating to another piece of spinning rust... In comparison to that that's enterprise hardware ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote: > Hi Dave, > > On 2014-04-29 09:47:56 +1000, Dave Chinner wrote: > > ping? > > I'd replied at http://marc.info/?l=linux-mm=139730910307321=2 I missed it, sorry. I've had a bit more time to look at this behaviour now and tweaked it as you suggested, but I simply can't get XFS to misbehave in the manner you demonstrated. However, I can reproduce major read latency changes and writeback flush storms with ext4. I originally only tested on XFS. I'm using the no-op IO scheduler everywhere, too. I ran the tweaked version I have for a couple of hours on XFS, and only saw a handful abnormal writeback events where the write IOPS spiked above the normal periodic peaks and was sufficient to cause any noticable increase in read latency. Even then the maximums were in the 40ms range, nothing much higher. ext4, OTOH, generated a much, much higher periodic write IO load and it's regularly causing read IO latencies in the hundreds of milliseconds. Every so often this occurred on ext4 (5s sample rate) Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 3.00 3142.20 219.2034.1119.1032.42 1.110.330.330.31 0.27 91.92 vdc 0.00 0.80 3311.60 216.2035.8618.9031.79 1.170.330.330.39 0.26 92.56 vdc 0.00 0.80 2919.80 2750.6031.6748.3628.90 20.053.500.366.83 0.16 92.96 vdc 0.00 0.80 435.00 15689.80 4.96 198.1025.79 113.217.032.327.16 0.06 99.20 vdc 0.00 0.80 2683.80 216.2029.7218.9834.39 1.130.390.390.40 0.32 91.92 vdc 0.00 0.80 2853.00 218.2031.2919.0633.57 1.140.370.370.36 0.30 92.56 Which is, i think, signs of what you'd been trying to demonstrate - a major dip in read performance when writeback is flushing. In comparison, this is from XFS: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 0.00 2416.40 335.0021.02 7.8521.49 0.780.280.300.19 0.24 65.28 vdc 0.00 0.00 2575.80 336.0022.68 7.8821.49 0.810.280.290.16 0.23 66.32 vdc 0.00 0.00 1740.20 4645.2015.6058.2223.68 21.213.320.414.41 0.11 68.56 vdc 0.00 0.00 2082.80 329.0018.28 7.7122.07 0.810.340.350.26 0.28 67.44 vdc 0.00 0.00 2347.80 333.2019.53 7.8020.88 0.830.310.320.25 0.25 67.52 You can see how much less load XFS putting on the storage - it's only 65-70% utilised compared to the 90-100% load that ext4 is generating. What is interesting here is the difference in IO patterns. ext4 is doing much larger IOs than XFS - it's average IO size is 16k, while XFS's is a bit over 8k. So while the read and background write IOPS rates are similar, ext4 is moving a lot more data to/from disk in larger chunks. This seems also to translate to much larger writeback IO peaks in ext4. I have no idea what this means in terms of actual application throughput, but it looks very much to me like the nasty read latencies are much more pronounced on ext4 because of the higher read bandwidths and write IOPS being seen. The screen shot of the recorded behaviour is attached - the left hand side is the tail end (~30min) of the 2 hour long XFS run, and the first half an hour of ext4 running. The difference in IO behaviour is quite obvious What is interesting is that CPU usage is not very much different between the two filesystems, but IOWait is much, much higher for ext4. That indicates that ext4 is definitely loading the storage more, and so much more likely to have IO load related latencies.. So, seeing the differences in behvaiour just by changing filesystems, I just ran the workload on btrfs. Ouch - it was even worse than ext4 in terms of read latencies - they were highly unpredictable, and massively variable even within a read group: read[11331]: avg: 0.3 msec; max: 7.0 msec read[11340]: avg: 0.3 msec; max: 7.1 msec read[11334]: avg: 0.3 msec; max: 7.0 msec read[11329]: avg: 0.3 msec; max: 7.0 msec read[11328]: avg: 0.3 msec; max: 7.0 msec read[11332]: avg: 0.6 msec; max: 4481.2 msec read[11342]: avg: 0.6 msec; max: 4480.6 msec read[11332]: avg: 0.0 msec; max: 0.7 msec read[11342]: avg: 0.0 msec; max: 1.6 msec wal[11326]: avg: 0.0 msec; max: 0.1 msec . It was also not uncommon to see major commit latencies: read[11335]: avg: 0.2 msec; max: 8.3 msec read[11341]: avg: 0.2 msec; max: 8.5 msec wal[11326]: avg: 0.0 msec; max: 0.1 msec commit[11326]: avg: 0.7 msec; max: 5302.3 msec wal[11326]: avg: 0.0 msec;
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote: Hi Dave, On 2014-04-29 09:47:56 +1000, Dave Chinner wrote: ping? I'd replied at http://marc.info/?l=linux-mmm=139730910307321w=2 I missed it, sorry. I've had a bit more time to look at this behaviour now and tweaked it as you suggested, but I simply can't get XFS to misbehave in the manner you demonstrated. However, I can reproduce major read latency changes and writeback flush storms with ext4. I originally only tested on XFS. I'm using the no-op IO scheduler everywhere, too. I ran the tweaked version I have for a couple of hours on XFS, and only saw a handful abnormal writeback events where the write IOPS spiked above the normal periodic peaks and was sufficient to cause any noticable increase in read latency. Even then the maximums were in the 40ms range, nothing much higher. ext4, OTOH, generated a much, much higher periodic write IO load and it's regularly causing read IO latencies in the hundreds of milliseconds. Every so often this occurred on ext4 (5s sample rate) Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 3.00 3142.20 219.2034.1119.1032.42 1.110.330.330.31 0.27 91.92 vdc 0.00 0.80 3311.60 216.2035.8618.9031.79 1.170.330.330.39 0.26 92.56 vdc 0.00 0.80 2919.80 2750.6031.6748.3628.90 20.053.500.366.83 0.16 92.96 vdc 0.00 0.80 435.00 15689.80 4.96 198.1025.79 113.217.032.327.16 0.06 99.20 vdc 0.00 0.80 2683.80 216.2029.7218.9834.39 1.130.390.390.40 0.32 91.92 vdc 0.00 0.80 2853.00 218.2031.2919.0633.57 1.140.370.370.36 0.30 92.56 Which is, i think, signs of what you'd been trying to demonstrate - a major dip in read performance when writeback is flushing. In comparison, this is from XFS: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 0.00 2416.40 335.0021.02 7.8521.49 0.780.280.300.19 0.24 65.28 vdc 0.00 0.00 2575.80 336.0022.68 7.8821.49 0.810.280.290.16 0.23 66.32 vdc 0.00 0.00 1740.20 4645.2015.6058.2223.68 21.213.320.414.41 0.11 68.56 vdc 0.00 0.00 2082.80 329.0018.28 7.7122.07 0.810.340.350.26 0.28 67.44 vdc 0.00 0.00 2347.80 333.2019.53 7.8020.88 0.830.310.320.25 0.25 67.52 You can see how much less load XFS putting on the storage - it's only 65-70% utilised compared to the 90-100% load that ext4 is generating. What is interesting here is the difference in IO patterns. ext4 is doing much larger IOs than XFS - it's average IO size is 16k, while XFS's is a bit over 8k. So while the read and background write IOPS rates are similar, ext4 is moving a lot more data to/from disk in larger chunks. This seems also to translate to much larger writeback IO peaks in ext4. I have no idea what this means in terms of actual application throughput, but it looks very much to me like the nasty read latencies are much more pronounced on ext4 because of the higher read bandwidths and write IOPS being seen. The screen shot of the recorded behaviour is attached - the left hand side is the tail end (~30min) of the 2 hour long XFS run, and the first half an hour of ext4 running. The difference in IO behaviour is quite obvious What is interesting is that CPU usage is not very much different between the two filesystems, but IOWait is much, much higher for ext4. That indicates that ext4 is definitely loading the storage more, and so much more likely to have IO load related latencies.. So, seeing the differences in behvaiour just by changing filesystems, I just ran the workload on btrfs. Ouch - it was even worse than ext4 in terms of read latencies - they were highly unpredictable, and massively variable even within a read group: read[11331]: avg: 0.3 msec; max: 7.0 msec read[11340]: avg: 0.3 msec; max: 7.1 msec read[11334]: avg: 0.3 msec; max: 7.0 msec read[11329]: avg: 0.3 msec; max: 7.0 msec read[11328]: avg: 0.3 msec; max: 7.0 msec read[11332]: avg: 0.6 msec; max: 4481.2 msec read[11342]: avg: 0.6 msec; max: 4480.6 msec read[11332]: avg: 0.0 msec; max: 0.7 msec read[11342]: avg: 0.0 msec; max: 1.6 msec wal[11326]: avg: 0.0 msec; max: 0.1 msec . It was also not uncommon to see major commit latencies: read[11335]: avg: 0.2 msec; max: 8.3 msec read[11341]: avg: 0.2 msec; max: 8.5 msec wal[11326]: avg: 0.0 msec; max: 0.1 msec commit[11326]: avg: 0.7 msec; max: 5302.3 msec wal[11326]: avg: 0.0 msec; max:
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Hi Dave, On 2014-04-29 09:47:56 +1000, Dave Chinner wrote: > ping? I'd replied at http://marc.info/?l=linux-mm=139730910307321=2 As an additional note: > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote: > > I'm not sure how you were generating the behaviour you reported, but > > the test program as it stands does not appear to be causing any > > problems at all on the sort of storage I'd expect large databases to > > be hosted on A really really large number of database aren't stored on big enterprise rigs... Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
ping? On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote: > On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote: > > Hi, > > > > At LSF/MM there was a slot about postgres' problems with the kernel. Our > > top#1 concern is frequent slow read()s that happen while another process > > calls fsync(), even though we'd be perfectly fine if that fsync() took > > ages. > > The "conclusion" of that part was that it'd be very useful to have a > > demonstration of the problem without needing a full blown postgres > > setup. I've quickly hacked something together, that seems to show the > > problem nicely. > > > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ > > and the "IO Scheduling" bit in > > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de > > > > The tools output looks like this: > > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf > > ... > > wal[12155]: avg: 0.0 msec; max: 0.0 msec > > commit[12155]: avg: 0.2 msec; max: 15.4 msec > > wal[12155]: avg: 0.0 msec; max: 0.0 msec > > read[12157]: avg: 0.2 msec; max: 9.4 msec > > ... > > read[12165]: avg: 0.2 msec; max: 9.4 msec > > wal[12155]: avg: 0.0 msec; max: 0.0 msec > > starting fsync() of files > > finished fsync() of files > > read[12162]: avg: 0.6 msec; max: 2765.5 msec > > > > So, the average read time is less than one ms (SSD, and about 50% cached > > workload). But once another backend does the fsync(), read latency > > skyrockets. > > > > A concurrent iostat shows the problem pretty clearly: > > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s > > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 1.00 0.00 6322.00 337.0051.73 4.38 17.26 > > 2.090.320.192.59 0.14 90.00 > > sda 0.00 0.00 6016.00 303.0047.18 3.95 16.57 > > 2.300.360.233.12 0.15 94.40 > > sda 0.00 0.00 6236.00 1059.0049.5212.88 17.52 > > 5.910.640.203.23 0.12 88.40 > > sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 > > 142.375.42 27.735.33 0.04 100.00 > > sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 > > 142.305.25 11.085.23 0.04 100.00 > > sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 > > 136.414.07 100.404.04 0.03 100.00 > > sda 0.00 0.00 3819.00 10096.0031.14 120.47 22.31 > > 42.803.100.324.15 0.07 96.00 > > sda 0.00 0.00 6482.00 346.0052.98 4.53 17.25 > > 1.930.280.201.80 0.14 93.20 > > > > While the fsync() is going on (or the kernel decides to start writing > > out aggressively for some other reason) the amount of writes to the disk > > is increased by two orders of magnitude. Unsurprisingly with disastrous > > consequences for read() performance. We really want a way to pace the > > writes issued to the disk more regularly. > > Hi Andreas, > > I've finally dug myself out from under the backlog from LSFMM far > enough to start testing this on my local IO performance test rig. > > tl;dr: I can't reproduce this peaky behaviour on my test rig. > > I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and > an unmodified benchmark on a current 3.15-linus tree. All storage > (guest and host) is XFS based, guest VMs use virtio and direct IO to > the backing storage. The host is using noop IO scheduling. > > The first IO setup I ran was a 100TB XFS filesystem in the guest. > The backing file is a sparse file on an XFS filesystem on a pair of > 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are > exported as JBOD from a RAID controller which has 1GB of FBWC. The > guest is capable of sustaining around 65,000 random read IOPS and > 40,000 write IOPS through this filesystem depending on the tests > being run. > > The iostat output looks like this: > > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > vdc 0.00 0.00 1817.00 315.4018.80 6.9324.71 > 0.800.380.380.37 0.31 66.24 > vdc 0.00 0.00 2094.20 323.2021.82 7.1024.50 > 0.810.330.340.27 0.28 68.48 > vdc 0.00 0.00 1989.00 4500.2020.5056.6424.34 > 24.823.820.435.32 0.12 80.16 > vdc 0.00 0.00 2019.80 320.8020.83 7.0524.39 > 0.830.350.360.32 0.29 69.04 > vdc 0.00 0.00 2206.60 323.2022.57 7.1024.02 > 0.870.340.340.33 0.28 71.92 > vdc 0.00 0.00 2437.20 329.6025.79 7.2424.45 > 0.830.300.300.27 0.26 71.76 > vdc 0.00 0.00 1224.40 11263.8012.88
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
ping? On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote: On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de The tools output looks like this: gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf ./ioperf ... wal[12155]: avg: 0.0 msec; max: 0.0 msec commit[12155]: avg: 0.2 msec; max: 15.4 msec wal[12155]: avg: 0.0 msec; max: 0.0 msec read[12157]: avg: 0.2 msec; max: 9.4 msec ... read[12165]: avg: 0.2 msec; max: 9.4 msec wal[12155]: avg: 0.0 msec; max: 0.0 msec starting fsync() of files finished fsync() of files read[12162]: avg: 0.6 msec; max: 2765.5 msec So, the average read time is less than one ms (SSD, and about 50% cached workload). But once another backend does the fsync(), read latency skyrockets. A concurrent iostat shows the problem pretty clearly: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.00 0.00 6322.00 337.0051.73 4.38 17.26 2.090.320.192.59 0.14 90.00 sda 0.00 0.00 6016.00 303.0047.18 3.95 16.57 2.300.360.233.12 0.15 94.40 sda 0.00 0.00 6236.00 1059.0049.5212.88 17.52 5.910.640.203.23 0.12 88.40 sda 0.00 0.00 105.00 26173.00 0.89 311.39 24.34 142.375.42 27.735.33 0.04 100.00 sda 0.00 0.00 78.00 27199.00 0.87 324.06 24.40 142.305.25 11.085.23 0.04 100.00 sda 0.00 0.00 10.00 33488.00 0.11 399.05 24.40 136.414.07 100.404.04 0.03 100.00 sda 0.00 0.00 3819.00 10096.0031.14 120.47 22.31 42.803.100.324.15 0.07 96.00 sda 0.00 0.00 6482.00 346.0052.98 4.53 17.25 1.930.280.201.80 0.14 93.20 While the fsync() is going on (or the kernel decides to start writing out aggressively for some other reason) the amount of writes to the disk is increased by two orders of magnitude. Unsurprisingly with disastrous consequences for read() performance. We really want a way to pace the writes issued to the disk more regularly. Hi Andreas, I've finally dug myself out from under the backlog from LSFMM far enough to start testing this on my local IO performance test rig. tl;dr: I can't reproduce this peaky behaviour on my test rig. I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and an unmodified benchmark on a current 3.15-linus tree. All storage (guest and host) is XFS based, guest VMs use virtio and direct IO to the backing storage. The host is using noop IO scheduling. The first IO setup I ran was a 100TB XFS filesystem in the guest. The backing file is a sparse file on an XFS filesystem on a pair of 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM. The SSDs are exported as JBOD from a RAID controller which has 1GB of FBWC. The guest is capable of sustaining around 65,000 random read IOPS and 40,000 write IOPS through this filesystem depending on the tests being run. The iostat output looks like this: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdc 0.00 0.00 1817.00 315.4018.80 6.9324.71 0.800.380.380.37 0.31 66.24 vdc 0.00 0.00 2094.20 323.2021.82 7.1024.50 0.810.330.340.27 0.28 68.48 vdc 0.00 0.00 1989.00 4500.2020.5056.6424.34 24.823.820.435.32 0.12 80.16 vdc 0.00 0.00 2019.80 320.8020.83 7.0524.39 0.830.350.360.32 0.29 69.04 vdc 0.00 0.00 2206.60 323.2022.57 7.1024.02 0.870.340.340.33 0.28 71.92 vdc 0.00 0.00 2437.20 329.6025.79 7.2424.45 0.830.300.300.27 0.26 71.76 vdc 0.00 0.00 1224.40 11263.8012.88 136.3824.48 64.905.200.695.69 0.07 84.96 vdc 0.00 0.00 2074.60 319.4021.03 7.0123.99 0.840.35
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Hi Dave, On 2014-04-29 09:47:56 +1000, Dave Chinner wrote: ping? I'd replied at http://marc.info/?l=linux-mmm=139730910307321w=2 As an additional note: On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote: I'm not sure how you were generating the behaviour you reported, but the test program as it stands does not appear to be causing any problems at all on the sort of storage I'd expect large databases to be hosted on A really really large number of database aren't stored on big enterprise rigs... Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
(2014/03/28 0:50), Jan Kara wrote: On Wed 26-03-14 22:55:18, Andres Freund wrote: On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The "conclusion" of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the "IO Scheduling" bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Also, unless you changed the parameters, it's a) using a 48GB disk file, and writes really rather fast ;) Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something smaller. If it still doesn't work consider increasing the two nsleep()s... I didn't have a good idea how to scale those to the current machine in a halfway automatic fashion. That's not necessary. If we have a guidance like above, we can figure it out ourselves (I hope ;). Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. I think it's both actually. If I understand correctly there's not even a correct association to the originator anymore during a fsync triggered flush? There is. The association is lost for background writeback (and sync(2) for that matter) but IO from fsync(2) is submitted in the context of the process doing fsync. What I think happens is the problem with 'dependent sync IO' vs 'independent sync IO'. Reads are an example of dependent sync IO where you submit a read, need it to complete and then you submit another read. OTOH fsync is an example of independent sync IO where you fire of tons of IO to the drive and they wait for everything. Since we treat both these types of IO in the same way, it can easily happen that independent sync IO starves out the dependent one (you execute say 100 IO requests for fsync and 1 IO request for read). We've seen problems like this in the past. I'll have a look into your test program and if my feeling is indeed correct, I'll have a look into what we could do in the block layer to fix this (and poke block layer guys - they had some preliminary patches that tried to address this but it didn't went anywhere). We have been using PostgreSQL in production for years so I am pretty familiar with the symptoms described by the PostgreSQL guys. In almost all cases the problem was the "'dependent sync IO' vs 'independent sync IO'" issue pointed out by Jan. However, as I mentioned during the LSF session, the culprit was not the kernel but the firmware of NCQ/TCQ capable storage that would keep read requests queued forever, leaving tasks doing reads (dependent sync IO) waiting for an interrupt that would not come. For the record, latencies of up to 120 seconds in suposedly enterprise storage (I will not name names) are relatively common. This can be fixed by modifying drivers and/or the block layer to dynamically adjust the queue depth when dumb scheduling by the firmware is observed. If you do not want to make changes to the kernel you can always try to change the default queue depth. With this mechanism in place it is relatively easy to make ionice work, since, as Jan mentioned, both reads and fsync writes are done in process context. - Fernando -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Hello, On Wed 26-03-14 20:11:13, Andres Freund wrote: > At LSF/MM there was a slot about postgres' problems with the kernel. Our > top#1 concern is frequent slow read()s that happen while another process > calls fsync(), even though we'd be perfectly fine if that fsync() took > ages. > The "conclusion" of that part was that it'd be very useful to have a > demonstration of the problem without needing a full blown postgres > setup. I've quickly hacked something together, that seems to show the > problem nicely. Thanks a lot for the program! > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ > and the "IO Scheduling" bit in > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de > > The tools output looks like this: > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf > ... > wal[12155]: avg: 0.0 msec; max: 0.0 msec > commit[12155]: avg: 0.2 msec; max: 15.4 msec > wal[12155]: avg: 0.0 msec; max: 0.0 msec > read[12157]: avg: 0.2 msec; max: 9.4 msec > ... > read[12165]: avg: 0.2 msec; max: 9.4 msec > wal[12155]: avg: 0.0 msec; max: 0.0 msec > starting fsync() of files > finished fsync() of files > read[12162]: avg: 0.6 msec; max: 2765.5 msec > > So, the average read time is less than one ms (SSD, and about 50% cached > workload). But once another backend does the fsync(), read latency > skyrockets. > > A concurrent iostat shows the problem pretty clearly: > Device: rrqm/s wrqm/s r/s w/srMB/swMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sda 1.00 0.00 6322.00 337.0051.73 4.38 17.26 > 2.090.320.192.59 0.14 90.00 > sda 0.00 0.00 6016.00 303.0047.18 3.95 16.57 > 2.300.360.233.12 0.15 94.40 > sda 0.00 0.00 6236.00 1059.0049.5212.88 17.52 > 5.910.640.203.23 0.12 88.40 > sda 0.00 0.00 105.00 26173.00 0.89 311.3924.34 > 142.375.42 27.735.33 0.04 100.00 > sda 0.00 0.00 78.00 27199.00 0.87 324.0624.40 > 142.305.25 11.085.23 0.04 100.00 > sda 0.00 0.00 10.00 33488.00 0.11 399.0524.40 > 136.414.07 100.404.04 0.03 100.00 > sda 0.00 0.00 3819.00 10096.0031.14 120.4722.31 > 42.803.100.324.15 0.07 96.00 > sda 0.00 0.00 6482.00 346.0052.98 4.53 17.25 > 1.930.280.201.80 0.14 93.20 > > While the fsync() is going on (or the kernel decides to start writing > out aggressively for some other reason) the amount of writes to the disk > is increased by two orders of magnitude. Unsurprisingly with disastrous > consequences for read() performance. We really want a way to pace the > writes issued to the disk more regularly. > > The attached program right now can only be configured by changing some > details in the code itself, but I guess that's not a problem. It will > upfront allocate two files, and then start testing. If the files already > exists it will use them. > > Possible solutions: > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like > sync_file_range() does. > * Make IO triggered by writeback regard IO priorities and add it to > schedulers other than CFQ > * Add a tunable that allows limiting the amount of dirty memory before > writeback on a per process basis. > * ...? > > If somebody familiar with buffered IO writeback is around at LSF/MM, or > rather collab, Robert and I will be around for the next days. I guess I'm your guy, at least for the writeback part. I have some insight in the block layer as well although there are better experts around here. But I at least know whom to catch if there's some deeply intricate problem ;) Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed 26-03-14 22:55:18, Andres Freund wrote: > On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: > > On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund wrote: > > > Hi, > > > > > > At LSF/MM there was a slot about postgres' problems with the kernel. Our > > > top#1 concern is frequent slow read()s that happen while another process > > > calls fsync(), even though we'd be perfectly fine if that fsync() took > > > ages. > > > The "conclusion" of that part was that it'd be very useful to have a > > > demonstration of the problem without needing a full blown postgres > > > setup. I've quickly hacked something together, that seems to show the > > > problem nicely. > > > > > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ > > > and the "IO Scheduling" bit in > > > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de > > > > > > > For your amusement: running this program in KVM on a 2GB disk image > > failed, but it caused the *host* to go out to lunch for several > > seconds while failing. In fact, it seems to have caused the host to > > fall over so badly that the guest decided that the disk controller was > > timing out. The host is btrfs, and I think that btrfs is *really* bad > > at this kind of workload. > > Also, unless you changed the parameters, it's a) using a 48GB disk file, > and writes really rather fast ;) > > > Even using ext4 is no good. I think that dm-crypt is dying under the > > load. So I won't test your program for real :/ > > Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something > smaller. If it still doesn't work consider increasing the two nsleep()s... > > I didn't have a good idea how to scale those to the current machine in a > halfway automatic fashion. That's not necessary. If we have a guidance like above, we can figure it out ourselves (I hope ;). > > > Possible solutions: > > > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like > > > sync_file_range() does. > > > * Make IO triggered by writeback regard IO priorities and add it to > > > schedulers other than CFQ > > > * Add a tunable that allows limiting the amount of dirty memory before > > > writeback on a per process basis. > > > * ...? > > > > I thought the problem wasn't so much that priorities weren't respected > > but that the fsync call fills up the queue, so everything starts > > contending for the right to enqueue a new request. > > I think it's both actually. If I understand correctly there's not even a > correct association to the originator anymore during a fsync triggered > flush? There is. The association is lost for background writeback (and sync(2) for that matter) but IO from fsync(2) is submitted in the context of the process doing fsync. What I think happens is the problem with 'dependent sync IO' vs 'independent sync IO'. Reads are an example of dependent sync IO where you submit a read, need it to complete and then you submit another read. OTOH fsync is an example of independent sync IO where you fire of tons of IO to the drive and they wait for everything. Since we treat both these types of IO in the same way, it can easily happen that independent sync IO starves out the dependent one (you execute say 100 IO requests for fsync and 1 IO request for read). We've seen problems like this in the past. I'll have a look into your test program and if my feeling is indeed correct, I'll have a look into what we could do in the block layer to fix this (and poke block layer guys - they had some preliminary patches that tried to address this but it didn't went anywhere). > > Since fsync blocks until all of its IO finishes anyway, what if it > > could just limit itself to a much smaller number of outstanding > > requests? > > Yea, that could already help. If you remove the fsync()s, the problem > will periodically appear anyway, because writeback is triggered with > vengeance. That'd need to be fixed in a similar way. Actually, that might be triggered by a different problem because in case of background writeback, block layer knows the IO is asynchronous and treats it in a different way. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed 26-03-14 22:55:18, Andres Freund wrote: On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Also, unless you changed the parameters, it's a) using a 48GB disk file, and writes really rather fast ;) Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something smaller. If it still doesn't work consider increasing the two nsleep()s... I didn't have a good idea how to scale those to the current machine in a halfway automatic fashion. That's not necessary. If we have a guidance like above, we can figure it out ourselves (I hope ;). Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. I think it's both actually. If I understand correctly there's not even a correct association to the originator anymore during a fsync triggered flush? There is. The association is lost for background writeback (and sync(2) for that matter) but IO from fsync(2) is submitted in the context of the process doing fsync. What I think happens is the problem with 'dependent sync IO' vs 'independent sync IO'. Reads are an example of dependent sync IO where you submit a read, need it to complete and then you submit another read. OTOH fsync is an example of independent sync IO where you fire of tons of IO to the drive and they wait for everything. Since we treat both these types of IO in the same way, it can easily happen that independent sync IO starves out the dependent one (you execute say 100 IO requests for fsync and 1 IO request for read). We've seen problems like this in the past. I'll have a look into your test program and if my feeling is indeed correct, I'll have a look into what we could do in the block layer to fix this (and poke block layer guys - they had some preliminary patches that tried to address this but it didn't went anywhere). Since fsync blocks until all of its IO finishes anyway, what if it could just limit itself to a much smaller number of outstanding requests? Yea, that could already help. If you remove the fsync()s, the problem will periodically appear anyway, because writeback is triggered with vengeance. That'd need to be fixed in a similar way. Actually, that might be triggered by a different problem because in case of background writeback, block layer knows the IO is asynchronous and treats it in a different way. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
Hello, On Wed 26-03-14 20:11:13, Andres Freund wrote: At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. Thanks a lot for the program! For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de The tools output looks like this: gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf ./ioperf ... wal[12155]: avg: 0.0 msec; max: 0.0 msec commit[12155]: avg: 0.2 msec; max: 15.4 msec wal[12155]: avg: 0.0 msec; max: 0.0 msec read[12157]: avg: 0.2 msec; max: 9.4 msec ... read[12165]: avg: 0.2 msec; max: 9.4 msec wal[12155]: avg: 0.0 msec; max: 0.0 msec starting fsync() of files finished fsync() of files read[12162]: avg: 0.6 msec; max: 2765.5 msec So, the average read time is less than one ms (SSD, and about 50% cached workload). But once another backend does the fsync(), read latency skyrockets. A concurrent iostat shows the problem pretty clearly: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.00 0.00 6322.00 337.0051.73 4.38 17.26 2.090.320.192.59 0.14 90.00 sda 0.00 0.00 6016.00 303.0047.18 3.95 16.57 2.300.360.233.12 0.15 94.40 sda 0.00 0.00 6236.00 1059.0049.5212.88 17.52 5.910.640.203.23 0.12 88.40 sda 0.00 0.00 105.00 26173.00 0.89 311.3924.34 142.375.42 27.735.33 0.04 100.00 sda 0.00 0.00 78.00 27199.00 0.87 324.0624.40 142.305.25 11.085.23 0.04 100.00 sda 0.00 0.00 10.00 33488.00 0.11 399.0524.40 136.414.07 100.404.04 0.03 100.00 sda 0.00 0.00 3819.00 10096.0031.14 120.4722.31 42.803.100.324.15 0.07 96.00 sda 0.00 0.00 6482.00 346.0052.98 4.53 17.25 1.930.280.201.80 0.14 93.20 While the fsync() is going on (or the kernel decides to start writing out aggressively for some other reason) the amount of writes to the disk is increased by two orders of magnitude. Unsurprisingly with disastrous consequences for read() performance. We really want a way to pace the writes issued to the disk more regularly. The attached program right now can only be configured by changing some details in the code itself, but I guess that's not a problem. It will upfront allocate two files, and then start testing. If the files already exists it will use them. Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? If somebody familiar with buffered IO writeback is around at LSF/MM, or rather collab, Robert and I will be around for the next days. I guess I'm your guy, at least for the writeback part. I have some insight in the block layer as well although there are better experts around here. But I at least know whom to catch if there's some deeply intricate problem ;) Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
(2014/03/28 0:50), Jan Kara wrote: On Wed 26-03-14 22:55:18, Andres Freund wrote: On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Also, unless you changed the parameters, it's a) using a 48GB disk file, and writes really rather fast ;) Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something smaller. If it still doesn't work consider increasing the two nsleep()s... I didn't have a good idea how to scale those to the current machine in a halfway automatic fashion. That's not necessary. If we have a guidance like above, we can figure it out ourselves (I hope ;). Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. I think it's both actually. If I understand correctly there's not even a correct association to the originator anymore during a fsync triggered flush? There is. The association is lost for background writeback (and sync(2) for that matter) but IO from fsync(2) is submitted in the context of the process doing fsync. What I think happens is the problem with 'dependent sync IO' vs 'independent sync IO'. Reads are an example of dependent sync IO where you submit a read, need it to complete and then you submit another read. OTOH fsync is an example of independent sync IO where you fire of tons of IO to the drive and they wait for everything. Since we treat both these types of IO in the same way, it can easily happen that independent sync IO starves out the dependent one (you execute say 100 IO requests for fsync and 1 IO request for read). We've seen problems like this in the past. I'll have a look into your test program and if my feeling is indeed correct, I'll have a look into what we could do in the block layer to fix this (and poke block layer guys - they had some preliminary patches that tried to address this but it didn't went anywhere). We have been using PostgreSQL in production for years so I am pretty familiar with the symptoms described by the PostgreSQL guys. In almost all cases the problem was the 'dependent sync IO' vs 'independent sync IO' issue pointed out by Jan. However, as I mentioned during the LSFMM session, the culprit was not the kernel but the firmware of NCQ/TCQ capable storage that would keep read requests queued forever, leaving tasks doing reads (dependent sync IO) waiting for an interrupt that would not come. For the record, latencies of up to 120 seconds in suposedly enterprise storage (I will not name names) are relatively common. This can be fixed by modifying drivers and/or the block layer to dynamically adjust the queue depth when dumb scheduling by the firmware is observed. If you do not want to make changes to the kernel you can always try to change the default queue depth. With this mechanism in place it is relatively easy to make ionice work, since, as Jan mentioned, both reads and fsync writes are done in process context. - Fernando -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 4:11 PM, Andy Lutomirski wrote: > On Wed, Mar 26, 2014 at 3:35 PM, David Lang wrote: >> On Wed, 26 Mar 2014, Andy Lutomirski wrote: >> > I'm not sure I understand the request queue stuff, but here's an idea. > The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) >>> >>> >>> My little patch doesn't seem to help. I'm either changing the wrong >>> piece of code entirely or I'm penalizing readers and writers too much. >>> >>> Hopefully some real block layer people can comment as to whether a >>> refinement of this idea could work. The behavior I want is for >>> writeback to be limited to using a smallish fraction of the total >>> request queue size -- I think that writeback should be able to enqueue >>> enough requests to get decent sorting performance but not enough >>> requests to prevent the io scheduler from doing a good job on >>> non-writeback I/O. >> >> >> The thing is that if there are no reads that are waiting, why not use every >> bit of disk I/O available to write? If you can do that reliably with only >> using part of the queue, fine, but aren't you getting fairly close to just >> having separate queues for reading and writing with such a restriction? >> > > Hmm. > > I wonder what the actual effect of queue length is on throughput. I > suspect that using half the queue gives you well over half the > throughput as long as the queue isn't tiny. > > I'm not so sure I'd go so far as having separate reader and writer > queues -- I think that small synchronous writes should also not get > stuck behind large writeback storms, but maybe that's something that > can be a secondary goal. That being said, separate reader and writer > queues might solve the immediate problem. It won't help for the case > where a small fsync blocks behind writeback, though, and that seems to > be a very common cause of Firefox freezing on my system. > > Is there an easy way to do a proof-of-concept? It would be great if > there was a ten-line patch that implemented something like this > correctly enough to see if it helps. I don't think I'm the right > person to do it, because my knowledge of the block layer code is > essentially nil. I think it's at least a bit more subtle than this. cfq distinguishes SYNC and ASYNC, but very large fsyncs are presumably SYNC. Deadline pays no attention to rw flags. Anyway, it seems like there's basically nothing prioritizing what happens when the number of requests exceeds the congestion thresholds. I'd happily bet a beverage* that Postgres's slow requests are spending an excessive amount of time waiting to get into the queue in the first place. * Since I'm back home now, any actual beverage transaction will be rather delayed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 3:35 PM, David Lang wrote: > On Wed, 26 Mar 2014, Andy Lutomirski wrote: > I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: >>> >>> >>> I haven't read enough of the code yet, to comment intelligently ;) >> >> >> My little patch doesn't seem to help. I'm either changing the wrong >> piece of code entirely or I'm penalizing readers and writers too much. >> >> Hopefully some real block layer people can comment as to whether a >> refinement of this idea could work. The behavior I want is for >> writeback to be limited to using a smallish fraction of the total >> request queue size -- I think that writeback should be able to enqueue >> enough requests to get decent sorting performance but not enough >> requests to prevent the io scheduler from doing a good job on >> non-writeback I/O. > > > The thing is that if there are no reads that are waiting, why not use every > bit of disk I/O available to write? If you can do that reliably with only > using part of the queue, fine, but aren't you getting fairly close to just > having separate queues for reading and writing with such a restriction? > Hmm. I wonder what the actual effect of queue length is on throughput. I suspect that using half the queue gives you well over half the throughput as long as the queue isn't tiny. I'm not so sure I'd go so far as having separate reader and writer queues -- I think that small synchronous writes should also not get stuck behind large writeback storms, but maybe that's something that can be a secondary goal. That being said, separate reader and writer queues might solve the immediate problem. It won't help for the case where a small fsync blocks behind writeback, though, and that seems to be a very common cause of Firefox freezing on my system. Is there an easy way to do a proof-of-concept? It would be great if there was a ten-line patch that implemented something like this correctly enough to see if it helps. I don't think I'm the right person to do it, because my knowledge of the block layer code is essentially nil. > >> As an even more radical idea, what if there was a way to submit truly >> enormous numbers of lightweight requests, such that the queue will >> give the requester some kind of callback when the request is nearly >> ready for submission so the requester can finish filling in the >> request? This would allow things like dm-crypt to get the benefit of >> sorting without needing to encrypt hundreds of MB of data in advance >> of having that data actually be to the backing device. It might also >> allow writeback to submit multiple gigabytes of writes, in arbitrarily >> large pieces, but not to need to pin pages or do whatever expensive >> things are needed until the IO actually happens. > > > the problem with a callback is that you then need to wait for that source to > get the CPU and finish doing it's work. What happens if that takes long > enough for you to run out of data to write? And is it worth the extra > context switches to bounce around when the writing process was finished with > that block already. dm-crypt is so context-switch heavy that I doubt the context switches matter. And you'd need to give the callback early enough that there's a very good chance that the callback will finish in time. There might even need to be a way to let other non-callback-dependent IO pass by the callback-dependent stuff, although in the particular case of dm-crypt, dm-crypt is pretty much the only source of writes. (Reads don't have this problem for dm-crypt, I think.) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, 26 Mar 2014, Andy Lutomirski wrote: I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) My little patch doesn't seem to help. I'm either changing the wrong piece of code entirely or I'm penalizing readers and writers too much. Hopefully some real block layer people can comment as to whether a refinement of this idea could work. The behavior I want is for writeback to be limited to using a smallish fraction of the total request queue size -- I think that writeback should be able to enqueue enough requests to get decent sorting performance but not enough requests to prevent the io scheduler from doing a good job on non-writeback I/O. The thing is that if there are no reads that are waiting, why not use every bit of disk I/O available to write? If you can do that reliably with only using part of the queue, fine, but aren't you getting fairly close to just having separate queues for reading and writing with such a restriction? As an even more radical idea, what if there was a way to submit truly enormous numbers of lightweight requests, such that the queue will give the requester some kind of callback when the request is nearly ready for submission so the requester can finish filling in the request? This would allow things like dm-crypt to get the benefit of sorting without needing to encrypt hundreds of MB of data in advance of having that data actually be to the backing device. It might also allow writeback to submit multiple gigabytes of writes, in arbitrarily large pieces, but not to need to pin pages or do whatever expensive things are needed until the IO actually happens. the problem with a callback is that you then need to wait for that source to get the CPU and finish doing it's work. What happens if that takes long enough for you to run out of data to write? And is it worth the extra context switches to bounce around when the writing process was finished with that block already. David Lang For reference, here's my patch that doesn't work well: diff --git a/block/blk-core.c b/block/blk-core.c index 4cd5ffc..c0dedc3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list * } /* -* Only allow batching queuers to allocate up to 50% over the defined -* limit of requests, otherwise we could have thousands of requests -* allocated with any setting of ->nr_requests +* Only allow batching queuers to allocate up to 50% of the +* defined limit of requests, so that non-batching queuers can +* get into the queue and thus be scheduled properly. */ - if (rl->count[is_sync] >= (3 * q->nr_requests / 2)) + if (rl->count[is_sync] >= (q->nr_requests + 3) / 4) return NULL; q->nr_rqs[is_sync]++; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund wrote: > On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: >> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund wrote: >> > Hi, >> > >> > At LSF/MM there was a slot about postgres' problems with the kernel. Our >> > top#1 concern is frequent slow read()s that happen while another process >> > calls fsync(), even though we'd be perfectly fine if that fsync() took >> > ages. >> > The "conclusion" of that part was that it'd be very useful to have a >> > demonstration of the problem without needing a full blown postgres >> > setup. I've quickly hacked something together, that seems to show the >> > problem nicely. >> > >> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ >> > and the "IO Scheduling" bit in >> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de >> > >> >> For your amusement: running this program in KVM on a 2GB disk image >> failed, but it caused the *host* to go out to lunch for several >> seconds while failing. In fact, it seems to have caused the host to >> fall over so badly that the guest decided that the disk controller was >> timing out. The host is btrfs, and I think that btrfs is *really* bad >> at this kind of workload. > > Also, unless you changed the parameters, it's a) using a 48GB disk file, > and writes really rather fast ;) > >> Even using ext4 is no good. I think that dm-crypt is dying under the >> load. So I won't test your program for real :/ > > Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something > smaller. If it still doesn't work consider increasing the two nsleep()s... > > I didn't have a good idea how to scale those to the current machine in a > halfway automatic fashion. OK, I think I'm getting reasonable bad behavior with these qemu options: -smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none and a 2GB test partition. > >> > Possible solutions: >> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like >> > sync_file_range() does. >> > * Make IO triggered by writeback regard IO priorities and add it to >> > schedulers other than CFQ >> > * Add a tunable that allows limiting the amount of dirty memory before >> > writeback on a per process basis. >> > * ...? >> >> I thought the problem wasn't so much that priorities weren't respected >> but that the fsync call fills up the queue, so everything starts >> contending for the right to enqueue a new request. > > I think it's both actually. If I understand correctly there's not even a > correct association to the originator anymore during a fsync triggered > flush? > >> Since fsync blocks until all of its IO finishes anyway, what if it >> could just limit itself to a much smaller number of outstanding >> requests? > > Yea, that could already help. If you remove the fsync()s, the problem > will periodically appear anyway, because writeback is triggered with > vengeance. That'd need to be fixed in a similar way. > >> I'm not sure I understand the request queue stuff, but here's an idea. >> The block core contains this little bit of code: > > I haven't read enough of the code yet, to comment intelligently ;) My little patch doesn't seem to help. I'm either changing the wrong piece of code entirely or I'm penalizing readers and writers too much. Hopefully some real block layer people can comment as to whether a refinement of this idea could work. The behavior I want is for writeback to be limited to using a smallish fraction of the total request queue size -- I think that writeback should be able to enqueue enough requests to get decent sorting performance but not enough requests to prevent the io scheduler from doing a good job on non-writeback I/O. As an even more radical idea, what if there was a way to submit truly enormous numbers of lightweight requests, such that the queue will give the requester some kind of callback when the request is nearly ready for submission so the requester can finish filling in the request? This would allow things like dm-crypt to get the benefit of sorting without needing to encrypt hundreds of MB of data in advance of having that data actually be to the backing device. It might also allow writeback to submit multiple gigabytes of writes, in arbitrarily large pieces, but not to need to pin pages or do whatever expensive things are needed until the IO actually happens. For reference, here's my patch that doesn't work well: diff --git a/block/blk-core.c b/block/blk-core.c index 4cd5ffc..c0dedc3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list * } /* -* Only allow batching queuers to allocate up to 50% over the defined -* limit of requests, otherwise we could have thousands of requests -* allocated with any setting of ->nr_requests +* Only allow batching queuers to allocate up to 50% of the +*
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: > On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund wrote: > > Hi, > > > > At LSF/MM there was a slot about postgres' problems with the kernel. Our > > top#1 concern is frequent slow read()s that happen while another process > > calls fsync(), even though we'd be perfectly fine if that fsync() took > > ages. > > The "conclusion" of that part was that it'd be very useful to have a > > demonstration of the problem without needing a full blown postgres > > setup. I've quickly hacked something together, that seems to show the > > problem nicely. > > > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ > > and the "IO Scheduling" bit in > > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de > > > > For your amusement: running this program in KVM on a 2GB disk image > failed, but it caused the *host* to go out to lunch for several > seconds while failing. In fact, it seems to have caused the host to > fall over so badly that the guest decided that the disk controller was > timing out. The host is btrfs, and I think that btrfs is *really* bad > at this kind of workload. Also, unless you changed the parameters, it's a) using a 48GB disk file, and writes really rather fast ;) > Even using ext4 is no good. I think that dm-crypt is dying under the > load. So I won't test your program for real :/ Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something smaller. If it still doesn't work consider increasing the two nsleep()s... I didn't have a good idea how to scale those to the current machine in a halfway automatic fashion. > > Possible solutions: > > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like > > sync_file_range() does. > > * Make IO triggered by writeback regard IO priorities and add it to > > schedulers other than CFQ > > * Add a tunable that allows limiting the amount of dirty memory before > > writeback on a per process basis. > > * ...? > > I thought the problem wasn't so much that priorities weren't respected > but that the fsync call fills up the queue, so everything starts > contending for the right to enqueue a new request. I think it's both actually. If I understand correctly there's not even a correct association to the originator anymore during a fsync triggered flush? > Since fsync blocks until all of its IO finishes anyway, what if it > could just limit itself to a much smaller number of outstanding > requests? Yea, that could already help. If you remove the fsync()s, the problem will periodically appear anyway, because writeback is triggered with vengeance. That'd need to be fixed in a similar way. > I'm not sure I understand the request queue stuff, but here's an idea. > The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund wrote: > Hi, > > At LSF/MM there was a slot about postgres' problems with the kernel. Our > top#1 concern is frequent slow read()s that happen while another process > calls fsync(), even though we'd be perfectly fine if that fsync() took > ages. > The "conclusion" of that part was that it'd be very useful to have a > demonstration of the problem without needing a full blown postgres > setup. I've quickly hacked something together, that seems to show the > problem nicely. > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ > and the "IO Scheduling" bit in > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de > For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ [...] > Possible solutions: > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like > sync_file_range() does. > * Make IO triggered by writeback regard IO priorities and add it to > schedulers other than CFQ > * Add a tunable that allows limiting the amount of dirty memory before > writeback on a per process basis. > * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. Since fsync blocks until all of its IO finishes anyway, what if it could just limit itself to a much smaller number of outstanding requests? I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: /* * Only allow batching queuers to allocate up to 50% over the defined * limit of requests, otherwise we could have thousands of requests * allocated with any setting of ->nr_requests */ if (rl->count[is_sync] >= (3 * q->nr_requests / 2)) return NULL; What if this changed to: /* * Only allow batching queuers to allocate up to 50% of the defined * limit of requests, so that non-batching queuers can get into the queue * and thus be scheduled properly. */ if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)) return NULL; I suspect that doing this right would take a bit more care than that, but I wonder if this approach is any good. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ [...] Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. Since fsync blocks until all of its IO finishes anyway, what if it could just limit itself to a much smaller number of outstanding requests? I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: /* * Only allow batching queuers to allocate up to 50% over the defined * limit of requests, otherwise we could have thousands of requests * allocated with any setting of -nr_requests */ if (rl-count[is_sync] = (3 * q-nr_requests / 2)) return NULL; What if this changed to: /* * Only allow batching queuers to allocate up to 50% of the defined * limit of requests, so that non-batching queuers can get into the queue * and thus be scheduled properly. */ if (rl-count[is_sync] = (q-nr_requests + 3) / 4)) return NULL; I suspect that doing this right would take a bit more care than that, but I wonder if this approach is any good. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Also, unless you changed the parameters, it's a) using a 48GB disk file, and writes really rather fast ;) Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something smaller. If it still doesn't work consider increasing the two nsleep()s... I didn't have a good idea how to scale those to the current machine in a halfway automatic fashion. Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. I think it's both actually. If I understand correctly there's not even a correct association to the originator anymore during a fsync triggered flush? Since fsync blocks until all of its IO finishes anyway, what if it could just limit itself to a much smaller number of outstanding requests? Yea, that could already help. If you remove the fsync()s, the problem will periodically appear anyway, because writeback is triggered with vengeance. That'd need to be fixed in a similar way. I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund and...@2ndquadrant.com wrote: On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote: On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote: Hi, At LSF/MM there was a slot about postgres' problems with the kernel. Our top#1 concern is frequent slow read()s that happen while another process calls fsync(), even though we'd be perfectly fine if that fsync() took ages. The conclusion of that part was that it'd be very useful to have a demonstration of the problem without needing a full blown postgres setup. I've quickly hacked something together, that seems to show the problem nicely. For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/ and the IO Scheduling bit in http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de For your amusement: running this program in KVM on a 2GB disk image failed, but it caused the *host* to go out to lunch for several seconds while failing. In fact, it seems to have caused the host to fall over so badly that the guest decided that the disk controller was timing out. The host is btrfs, and I think that btrfs is *really* bad at this kind of workload. Also, unless you changed the parameters, it's a) using a 48GB disk file, and writes really rather fast ;) Even using ext4 is no good. I think that dm-crypt is dying under the load. So I won't test your program for real :/ Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something smaller. If it still doesn't work consider increasing the two nsleep()s... I didn't have a good idea how to scale those to the current machine in a halfway automatic fashion. OK, I think I'm getting reasonable bad behavior with these qemu options: -smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none and a 2GB test partition. Possible solutions: * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like sync_file_range() does. * Make IO triggered by writeback regard IO priorities and add it to schedulers other than CFQ * Add a tunable that allows limiting the amount of dirty memory before writeback on a per process basis. * ...? I thought the problem wasn't so much that priorities weren't respected but that the fsync call fills up the queue, so everything starts contending for the right to enqueue a new request. I think it's both actually. If I understand correctly there's not even a correct association to the originator anymore during a fsync triggered flush? Since fsync blocks until all of its IO finishes anyway, what if it could just limit itself to a much smaller number of outstanding requests? Yea, that could already help. If you remove the fsync()s, the problem will periodically appear anyway, because writeback is triggered with vengeance. That'd need to be fixed in a similar way. I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) My little patch doesn't seem to help. I'm either changing the wrong piece of code entirely or I'm penalizing readers and writers too much. Hopefully some real block layer people can comment as to whether a refinement of this idea could work. The behavior I want is for writeback to be limited to using a smallish fraction of the total request queue size -- I think that writeback should be able to enqueue enough requests to get decent sorting performance but not enough requests to prevent the io scheduler from doing a good job on non-writeback I/O. As an even more radical idea, what if there was a way to submit truly enormous numbers of lightweight requests, such that the queue will give the requester some kind of callback when the request is nearly ready for submission so the requester can finish filling in the request? This would allow things like dm-crypt to get the benefit of sorting without needing to encrypt hundreds of MB of data in advance of having that data actually be to the backing device. It might also allow writeback to submit multiple gigabytes of writes, in arbitrarily large pieces, but not to need to pin pages or do whatever expensive things are needed until the IO actually happens. For reference, here's my patch that doesn't work well: diff --git a/block/blk-core.c b/block/blk-core.c index 4cd5ffc..c0dedc3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list * } /* -* Only allow batching queuers to allocate up to 50% over the defined -* limit of requests, otherwise we could have thousands of requests -* allocated with any setting of -nr_requests +* Only allow batching queuers to allocate up to 50% of the +* defined limit of requests, so that non-batching queuers can +* get into the queue and
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, 26 Mar 2014, Andy Lutomirski wrote: I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) My little patch doesn't seem to help. I'm either changing the wrong piece of code entirely or I'm penalizing readers and writers too much. Hopefully some real block layer people can comment as to whether a refinement of this idea could work. The behavior I want is for writeback to be limited to using a smallish fraction of the total request queue size -- I think that writeback should be able to enqueue enough requests to get decent sorting performance but not enough requests to prevent the io scheduler from doing a good job on non-writeback I/O. The thing is that if there are no reads that are waiting, why not use every bit of disk I/O available to write? If you can do that reliably with only using part of the queue, fine, but aren't you getting fairly close to just having separate queues for reading and writing with such a restriction? As an even more radical idea, what if there was a way to submit truly enormous numbers of lightweight requests, such that the queue will give the requester some kind of callback when the request is nearly ready for submission so the requester can finish filling in the request? This would allow things like dm-crypt to get the benefit of sorting without needing to encrypt hundreds of MB of data in advance of having that data actually be to the backing device. It might also allow writeback to submit multiple gigabytes of writes, in arbitrarily large pieces, but not to need to pin pages or do whatever expensive things are needed until the IO actually happens. the problem with a callback is that you then need to wait for that source to get the CPU and finish doing it's work. What happens if that takes long enough for you to run out of data to write? And is it worth the extra context switches to bounce around when the writing process was finished with that block already. David Lang For reference, here's my patch that doesn't work well: diff --git a/block/blk-core.c b/block/blk-core.c index 4cd5ffc..c0dedc3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list * } /* -* Only allow batching queuers to allocate up to 50% over the defined -* limit of requests, otherwise we could have thousands of requests -* allocated with any setting of -nr_requests +* Only allow batching queuers to allocate up to 50% of the +* defined limit of requests, so that non-batching queuers can +* get into the queue and thus be scheduled properly. */ - if (rl-count[is_sync] = (3 * q-nr_requests / 2)) + if (rl-count[is_sync] = (q-nr_requests + 3) / 4) return NULL; q-nr_rqs[is_sync]++; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 3:35 PM, David Lang da...@lang.hm wrote: On Wed, 26 Mar 2014, Andy Lutomirski wrote: I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) My little patch doesn't seem to help. I'm either changing the wrong piece of code entirely or I'm penalizing readers and writers too much. Hopefully some real block layer people can comment as to whether a refinement of this idea could work. The behavior I want is for writeback to be limited to using a smallish fraction of the total request queue size -- I think that writeback should be able to enqueue enough requests to get decent sorting performance but not enough requests to prevent the io scheduler from doing a good job on non-writeback I/O. The thing is that if there are no reads that are waiting, why not use every bit of disk I/O available to write? If you can do that reliably with only using part of the queue, fine, but aren't you getting fairly close to just having separate queues for reading and writing with such a restriction? Hmm. I wonder what the actual effect of queue length is on throughput. I suspect that using half the queue gives you well over half the throughput as long as the queue isn't tiny. I'm not so sure I'd go so far as having separate reader and writer queues -- I think that small synchronous writes should also not get stuck behind large writeback storms, but maybe that's something that can be a secondary goal. That being said, separate reader and writer queues might solve the immediate problem. It won't help for the case where a small fsync blocks behind writeback, though, and that seems to be a very common cause of Firefox freezing on my system. Is there an easy way to do a proof-of-concept? It would be great if there was a ten-line patch that implemented something like this correctly enough to see if it helps. I don't think I'm the right person to do it, because my knowledge of the block layer code is essentially nil. As an even more radical idea, what if there was a way to submit truly enormous numbers of lightweight requests, such that the queue will give the requester some kind of callback when the request is nearly ready for submission so the requester can finish filling in the request? This would allow things like dm-crypt to get the benefit of sorting without needing to encrypt hundreds of MB of data in advance of having that data actually be to the backing device. It might also allow writeback to submit multiple gigabytes of writes, in arbitrarily large pieces, but not to need to pin pages or do whatever expensive things are needed until the IO actually happens. the problem with a callback is that you then need to wait for that source to get the CPU and finish doing it's work. What happens if that takes long enough for you to run out of data to write? And is it worth the extra context switches to bounce around when the writing process was finished with that block already. dm-crypt is so context-switch heavy that I doubt the context switches matter. And you'd need to give the callback early enough that there's a very good chance that the callback will finish in time. There might even need to be a way to let other non-callback-dependent IO pass by the callback-dependent stuff, although in the particular case of dm-crypt, dm-crypt is pretty much the only source of writes. (Reads don't have this problem for dm-crypt, I think.) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()
On Wed, Mar 26, 2014 at 4:11 PM, Andy Lutomirski l...@amacapital.net wrote: On Wed, Mar 26, 2014 at 3:35 PM, David Lang da...@lang.hm wrote: On Wed, 26 Mar 2014, Andy Lutomirski wrote: I'm not sure I understand the request queue stuff, but here's an idea. The block core contains this little bit of code: I haven't read enough of the code yet, to comment intelligently ;) My little patch doesn't seem to help. I'm either changing the wrong piece of code entirely or I'm penalizing readers and writers too much. Hopefully some real block layer people can comment as to whether a refinement of this idea could work. The behavior I want is for writeback to be limited to using a smallish fraction of the total request queue size -- I think that writeback should be able to enqueue enough requests to get decent sorting performance but not enough requests to prevent the io scheduler from doing a good job on non-writeback I/O. The thing is that if there are no reads that are waiting, why not use every bit of disk I/O available to write? If you can do that reliably with only using part of the queue, fine, but aren't you getting fairly close to just having separate queues for reading and writing with such a restriction? Hmm. I wonder what the actual effect of queue length is on throughput. I suspect that using half the queue gives you well over half the throughput as long as the queue isn't tiny. I'm not so sure I'd go so far as having separate reader and writer queues -- I think that small synchronous writes should also not get stuck behind large writeback storms, but maybe that's something that can be a secondary goal. That being said, separate reader and writer queues might solve the immediate problem. It won't help for the case where a small fsync blocks behind writeback, though, and that seems to be a very common cause of Firefox freezing on my system. Is there an easy way to do a proof-of-concept? It would be great if there was a ten-line patch that implemented something like this correctly enough to see if it helps. I don't think I'm the right person to do it, because my knowledge of the block layer code is essentially nil. I think it's at least a bit more subtle than this. cfq distinguishes SYNC and ASYNC, but very large fsyncs are presumably SYNC. Deadline pays no attention to rw flags. Anyway, it seems like there's basically nothing prioritizing what happens when the number of requests exceeds the congestion thresholds. I'd happily bet a beverage* that Postgres's slow requests are spending an excessive amount of time waiting to get into the queue in the first place. * Since I'm back home now, any actual beverage transaction will be rather delayed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/