Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-06-04 Thread Andres Freund
Hi Dave, Ted, All,

On 2014-05-23 16:42:47 +1000, Dave Chinner wrote:
> On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> > Hi Dave,
> > 
> > On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > > ping?
> > 
> > I'd replied at http://marc.info/?l=linux-mm=139730910307321=2
> 
> I missed it, sorry.

No worries. As you can see, I'm not quick answering either :/

> I've had a bit more time to look at this behaviour now and tweaked
> it as you suggested, but I simply can't get XFS to misbehave in the
> manner you demonstrated. However, I can reproduce major read latency
> changes and writeback flush storms with ext4.  I originally only
> tested on XFS.

That's interesting. I know that the problem was reproducable on xfs at
some point, but that was on 2.6.18 or so...

I'll try whether I can make it perform badly on the measly hardware I
have available.

> I'm using the no-op IO scheduler everywhere, too.

And will check whether it's potentially related to that.

> ext4, OTOH, generated a much, much higher periodic write IO load and
> it's regularly causing read IO latencies in the hundreds of
> milliseconds. Every so often this occurred on ext4 (5s sample rate)
> 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> vdc   0.00 3.00 3142.20  219.2034.1119.1032.42
>  1.110.330.330.31   0.27  91.92
> vdc   0.00 0.80 3311.60  216.2035.8618.9031.79
>  1.170.330.330.39   0.26  92.56
> vdc   0.00 0.80 2919.80 2750.6031.6748.3628.90
> 20.053.500.366.83   0.16  92.96
> vdc   0.00 0.80  435.00 15689.80 4.96   198.1025.79   
> 113.217.032.327.16   0.06  99.20
> vdc   0.00 0.80 2683.80  216.2029.7218.9834.39
>  1.130.390.390.40   0.32  91.92
> vdc   0.00 0.80 2853.00  218.2031.2919.0633.57
>  1.140.370.370.36   0.30  92.56
> 
> Which is, i think, signs of what you'd been trying to demonstrate -
> a major dip in read performance when writeback is flushing.

I've seen *much* worse cases than this, but it's what we're seing in
production.

> What is interesting here is the difference in IO patterns. ext4 is
> doing much larger IOs than XFS - it's average IO size is 16k, while
> XFS's is a bit over 8k. So while the read and background write IOPS
> rates are similar, ext4 is moving a lot more data to/from disk in
> larger chunks.
> 
> This seems also to translate to much larger writeback IO peaks in
> ext4.  I have no idea what this means in terms of actual application
> throughput, but it looks very much to me like the nasty read
> latencies are much more pronounced on ext4 because of the higher
> read bandwidths and write IOPS being seen.

I'll try starting a benchmark of actual postgres showing the differnt
peak/average throughput and latencies.

> So, seeing the differences in behvaiour just by changing
> filesystems, I just ran the workload on btrfs. Ouch - it was
> even worse than ext4 in terms of read latencies - they were highly
> unpredictable, and massively variable even within a read group:

I've essentially given up on btrfs for the forseeable future :(.

> That means it isn't clear that there's any generic infrastructure
> problem here, and it certainly isn't clear that each filesystem has
> the same problem or the issues can be solved by a generic mechanism.
> I think you probably need to engage the ext4 developers drectly to
> understand what ext4 is doing in detail, or work out how to prod XFS
> into displaying that extremely bad read latency behaviour

I've CCed the ext4 list and Ted. Maybe that'll bring some insigh...

> > > On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > > > I'm not sure how you were generating the behaviour you reported, but
> > > > the test program as it stands does not appear to be causing any
> > > > problems at all on the sort of storage I'd expect large databases to
> > > > be hosted on
> > 
> > A really really large number of database aren't stored on big enterprise
> > rigs...
> 
> I'm not using a big enterprise rig. I've reproduced these results on
> a low end Dell server with the internal H710 SAS RAID and a pair of
> consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
> HBA with 12 2T nearline SAS drives in RAID0.

There's a *lot* of busy postgres installations out there running on a
single disk of spinning rust. Hopefully replicating to another piece of
spinning rust... In comparison to that that's enterprise hardware ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to 

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-06-04 Thread Andres Freund
Hi Dave, Ted, All,

On 2014-05-23 16:42:47 +1000, Dave Chinner wrote:
 On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
  Hi Dave,
  
  On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
   ping?
  
  I'd replied at http://marc.info/?l=linux-mmm=139730910307321w=2
 
 I missed it, sorry.

No worries. As you can see, I'm not quick answering either :/

 I've had a bit more time to look at this behaviour now and tweaked
 it as you suggested, but I simply can't get XFS to misbehave in the
 manner you demonstrated. However, I can reproduce major read latency
 changes and writeback flush storms with ext4.  I originally only
 tested on XFS.

That's interesting. I know that the problem was reproducable on xfs at
some point, but that was on 2.6.18 or so...

I'll try whether I can make it perform badly on the measly hardware I
have available.

 I'm using the no-op IO scheduler everywhere, too.

And will check whether it's potentially related to that.

 ext4, OTOH, generated a much, much higher periodic write IO load and
 it's regularly causing read IO latencies in the hundreds of
 milliseconds. Every so often this occurred on ext4 (5s sample rate)
 
 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 vdc   0.00 3.00 3142.20  219.2034.1119.1032.42
  1.110.330.330.31   0.27  91.92
 vdc   0.00 0.80 3311.60  216.2035.8618.9031.79
  1.170.330.330.39   0.26  92.56
 vdc   0.00 0.80 2919.80 2750.6031.6748.3628.90
 20.053.500.366.83   0.16  92.96
 vdc   0.00 0.80  435.00 15689.80 4.96   198.1025.79   
 113.217.032.327.16   0.06  99.20
 vdc   0.00 0.80 2683.80  216.2029.7218.9834.39
  1.130.390.390.40   0.32  91.92
 vdc   0.00 0.80 2853.00  218.2031.2919.0633.57
  1.140.370.370.36   0.30  92.56
 
 Which is, i think, signs of what you'd been trying to demonstrate -
 a major dip in read performance when writeback is flushing.

I've seen *much* worse cases than this, but it's what we're seing in
production.

 What is interesting here is the difference in IO patterns. ext4 is
 doing much larger IOs than XFS - it's average IO size is 16k, while
 XFS's is a bit over 8k. So while the read and background write IOPS
 rates are similar, ext4 is moving a lot more data to/from disk in
 larger chunks.
 
 This seems also to translate to much larger writeback IO peaks in
 ext4.  I have no idea what this means in terms of actual application
 throughput, but it looks very much to me like the nasty read
 latencies are much more pronounced on ext4 because of the higher
 read bandwidths and write IOPS being seen.

I'll try starting a benchmark of actual postgres showing the differnt
peak/average throughput and latencies.

 So, seeing the differences in behvaiour just by changing
 filesystems, I just ran the workload on btrfs. Ouch - it was
 even worse than ext4 in terms of read latencies - they were highly
 unpredictable, and massively variable even within a read group:

I've essentially given up on btrfs for the forseeable future :(.

 That means it isn't clear that there's any generic infrastructure
 problem here, and it certainly isn't clear that each filesystem has
 the same problem or the issues can be solved by a generic mechanism.
 I think you probably need to engage the ext4 developers drectly to
 understand what ext4 is doing in detail, or work out how to prod XFS
 into displaying that extremely bad read latency behaviour

I've CCed the ext4 list and Ted. Maybe that'll bring some insigh...

   On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
I'm not sure how you were generating the behaviour you reported, but
the test program as it stands does not appear to be causing any
problems at all on the sort of storage I'd expect large databases to
be hosted on
  
  A really really large number of database aren't stored on big enterprise
  rigs...
 
 I'm not using a big enterprise rig. I've reproduced these results on
 a low end Dell server with the internal H710 SAS RAID and a pair of
 consumer SSDs in RAID0, as well as via a 4 year old Perc/6e SAS RAID
 HBA with 12 2T nearline SAS drives in RAID0.

There's a *lot* of busy postgres installations out there running on a
single disk of spinning rust. Hopefully replicating to another piece of
spinning rust... In comparison to that that's enterprise hardware ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-05-23 Thread Dave Chinner
On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
> Hi Dave,
> 
> On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> > ping?
> 
> I'd replied at http://marc.info/?l=linux-mm=139730910307321=2

I missed it, sorry.

I've had a bit more time to look at this behaviour now and tweaked
it as you suggested, but I simply can't get XFS to misbehave in the
manner you demonstrated. However, I can reproduce major read latency
changes and writeback flush storms with ext4.  I originally only
tested on XFS. I'm using the no-op IO scheduler everywhere, too.

I ran the tweaked version I have for a couple of hours on XFS, and
only saw a handful abnormal writeback events where the write IOPS
spiked above the normal periodic peaks and was sufficient to cause
any noticable increase in read latency. Even then the maximums were
in the 40ms range, nothing much higher.

ext4, OTOH, generated a much, much higher periodic write IO load and
it's regularly causing read IO latencies in the hundreds of
milliseconds. Every so often this occurred on ext4 (5s sample rate)

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 3.00 3142.20  219.2034.1119.1032.42 
1.110.330.330.31   0.27  91.92
vdc   0.00 0.80 3311.60  216.2035.8618.9031.79 
1.170.330.330.39   0.26  92.56
vdc   0.00 0.80 2919.80 2750.6031.6748.3628.90
20.053.500.366.83   0.16  92.96
vdc   0.00 0.80  435.00 15689.80 4.96   198.1025.79   
113.217.032.327.16   0.06  99.20
vdc   0.00 0.80 2683.80  216.2029.7218.9834.39 
1.130.390.390.40   0.32  91.92
vdc   0.00 0.80 2853.00  218.2031.2919.0633.57 
1.140.370.370.36   0.30  92.56

Which is, i think, signs of what you'd been trying to demonstrate -
a major dip in read performance when writeback is flushing.

In comparison, this is from XFS:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 0.00 2416.40  335.0021.02 7.8521.49 
0.780.280.300.19   0.24  65.28
vdc   0.00 0.00 2575.80  336.0022.68 7.8821.49 
0.810.280.290.16   0.23  66.32
vdc   0.00 0.00 1740.20 4645.2015.6058.2223.68
21.213.320.414.41   0.11  68.56
vdc   0.00 0.00 2082.80  329.0018.28 7.7122.07 
0.810.340.350.26   0.28  67.44
vdc   0.00 0.00 2347.80  333.2019.53 7.8020.88 
0.830.310.320.25   0.25  67.52

You can see how much less load XFS putting on the storage - it's
only 65-70% utilised compared to the 90-100% load that ext4 is
generating.

What is interesting here is the difference in IO patterns. ext4 is
doing much larger IOs than XFS - it's average IO size is 16k, while
XFS's is a bit over 8k. So while the read and background write IOPS
rates are similar, ext4 is moving a lot more data to/from disk in
larger chunks.

This seems also to translate to much larger writeback IO peaks in
ext4.  I have no idea what this means in terms of actual application
throughput, but it looks very much to me like the nasty read
latencies are much more pronounced on ext4 because of the higher
read bandwidths and write IOPS being seen.

The screen shot of the recorded behaviour is attached - the left
hand side is the tail end (~30min) of the 2 hour long XFS run, and
the first half an hour of ext4 running. The difference in IO
behaviour is quite obvious

What is interesting is that CPU usage is not very much different
between the two filesystems, but IOWait is much, much higher for
ext4. That indicates that ext4 is definitely loading the storage
more, and so much more likely to have IO load related
latencies..

So, seeing the differences in behvaiour just by changing
filesystems, I just ran the workload on btrfs. Ouch - it was
even worse than ext4 in terms of read latencies - they were highly
unpredictable, and massively variable even within a read group:


read[11331]: avg: 0.3 msec; max: 7.0 msec
read[11340]: avg: 0.3 msec; max: 7.1 msec
read[11334]: avg: 0.3 msec; max: 7.0 msec
read[11329]: avg: 0.3 msec; max: 7.0 msec
read[11328]: avg: 0.3 msec; max: 7.0 msec
read[11332]: avg: 0.6 msec; max: 4481.2 msec
read[11342]: avg: 0.6 msec; max: 4480.6 msec
read[11332]: avg: 0.0 msec; max: 0.7 msec
read[11342]: avg: 0.0 msec; max: 1.6 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
.

It was also not uncommon to see major commit latencies:

read[11335]: avg: 0.2 msec; max: 8.3 msec
read[11341]: avg: 0.2 msec; max: 8.5 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.7 msec; max: 5302.3 msec
wal[11326]: avg: 0.0 msec; 

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-05-23 Thread Dave Chinner
On Tue, Apr 29, 2014 at 01:57:14AM +0200, Andres Freund wrote:
 Hi Dave,
 
 On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
  ping?
 
 I'd replied at http://marc.info/?l=linux-mmm=139730910307321w=2

I missed it, sorry.

I've had a bit more time to look at this behaviour now and tweaked
it as you suggested, but I simply can't get XFS to misbehave in the
manner you demonstrated. However, I can reproduce major read latency
changes and writeback flush storms with ext4.  I originally only
tested on XFS. I'm using the no-op IO scheduler everywhere, too.

I ran the tweaked version I have for a couple of hours on XFS, and
only saw a handful abnormal writeback events where the write IOPS
spiked above the normal periodic peaks and was sufficient to cause
any noticable increase in read latency. Even then the maximums were
in the 40ms range, nothing much higher.

ext4, OTOH, generated a much, much higher periodic write IO load and
it's regularly causing read IO latencies in the hundreds of
milliseconds. Every so often this occurred on ext4 (5s sample rate)

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 3.00 3142.20  219.2034.1119.1032.42 
1.110.330.330.31   0.27  91.92
vdc   0.00 0.80 3311.60  216.2035.8618.9031.79 
1.170.330.330.39   0.26  92.56
vdc   0.00 0.80 2919.80 2750.6031.6748.3628.90
20.053.500.366.83   0.16  92.96
vdc   0.00 0.80  435.00 15689.80 4.96   198.1025.79   
113.217.032.327.16   0.06  99.20
vdc   0.00 0.80 2683.80  216.2029.7218.9834.39 
1.130.390.390.40   0.32  91.92
vdc   0.00 0.80 2853.00  218.2031.2919.0633.57 
1.140.370.370.36   0.30  92.56

Which is, i think, signs of what you'd been trying to demonstrate -
a major dip in read performance when writeback is flushing.

In comparison, this is from XFS:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
vdc   0.00 0.00 2416.40  335.0021.02 7.8521.49 
0.780.280.300.19   0.24  65.28
vdc   0.00 0.00 2575.80  336.0022.68 7.8821.49 
0.810.280.290.16   0.23  66.32
vdc   0.00 0.00 1740.20 4645.2015.6058.2223.68
21.213.320.414.41   0.11  68.56
vdc   0.00 0.00 2082.80  329.0018.28 7.7122.07 
0.810.340.350.26   0.28  67.44
vdc   0.00 0.00 2347.80  333.2019.53 7.8020.88 
0.830.310.320.25   0.25  67.52

You can see how much less load XFS putting on the storage - it's
only 65-70% utilised compared to the 90-100% load that ext4 is
generating.

What is interesting here is the difference in IO patterns. ext4 is
doing much larger IOs than XFS - it's average IO size is 16k, while
XFS's is a bit over 8k. So while the read and background write IOPS
rates are similar, ext4 is moving a lot more data to/from disk in
larger chunks.

This seems also to translate to much larger writeback IO peaks in
ext4.  I have no idea what this means in terms of actual application
throughput, but it looks very much to me like the nasty read
latencies are much more pronounced on ext4 because of the higher
read bandwidths and write IOPS being seen.

The screen shot of the recorded behaviour is attached - the left
hand side is the tail end (~30min) of the 2 hour long XFS run, and
the first half an hour of ext4 running. The difference in IO
behaviour is quite obvious

What is interesting is that CPU usage is not very much different
between the two filesystems, but IOWait is much, much higher for
ext4. That indicates that ext4 is definitely loading the storage
more, and so much more likely to have IO load related
latencies..

So, seeing the differences in behvaiour just by changing
filesystems, I just ran the workload on btrfs. Ouch - it was
even worse than ext4 in terms of read latencies - they were highly
unpredictable, and massively variable even within a read group:


read[11331]: avg: 0.3 msec; max: 7.0 msec
read[11340]: avg: 0.3 msec; max: 7.1 msec
read[11334]: avg: 0.3 msec; max: 7.0 msec
read[11329]: avg: 0.3 msec; max: 7.0 msec
read[11328]: avg: 0.3 msec; max: 7.0 msec
read[11332]: avg: 0.6 msec; max: 4481.2 msec
read[11342]: avg: 0.6 msec; max: 4480.6 msec
read[11332]: avg: 0.0 msec; max: 0.7 msec
read[11342]: avg: 0.0 msec; max: 1.6 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
.

It was also not uncommon to see major commit latencies:

read[11335]: avg: 0.2 msec; max: 8.3 msec
read[11341]: avg: 0.2 msec; max: 8.5 msec
wal[11326]: avg: 0.0 msec; max: 0.1 msec
commit[11326]: avg: 0.7 msec; max: 5302.3 msec
wal[11326]: avg: 0.0 msec; max: 

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-04-28 Thread Andres Freund
Hi Dave,

On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
> ping?

I'd replied at http://marc.info/?l=linux-mm=139730910307321=2

As an additional note:

> On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> > I'm not sure how you were generating the behaviour you reported, but
> > the test program as it stands does not appear to be causing any
> > problems at all on the sort of storage I'd expect large databases to
> > be hosted on

A really really large number of database aren't stored on big enterprise
rigs...

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-04-28 Thread Dave Chinner
ping?

On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
> On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
> > Hi,
> > 
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> > 
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> > 
> > The tools output looks like this:
> > gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> > ...
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > commit[12155]: avg: 0.2 msec; max: 15.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > read[12157]: avg: 0.2 msec; max: 9.4 msec
> > ...
> > read[12165]: avg: 0.2 msec; max: 9.4 msec
> > wal[12155]: avg: 0.0 msec; max: 0.0 msec
> > starting fsync() of files
> > finished fsync() of files
> > read[12162]: avg: 0.6 msec; max: 2765.5 msec
> > 
> > So, the average read time is less than one ms (SSD, and about 50% cached
> > workload). But once another backend does the fsync(), read latency
> > skyrockets.
> > 
> > A concurrent iostat shows the problem pretty clearly:
> > Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s   
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sda   1.00 0.00 6322.00  337.0051.73 4.38   17.26   
> >   2.090.320.192.59   0.14  90.00
> > sda   0.00 0.00 6016.00  303.0047.18 3.95   16.57   
> >   2.300.360.233.12   0.15  94.40
> > sda   0.00 0.00 6236.00 1059.0049.5212.88   17.52   
> >   5.910.640.203.23   0.12  88.40
> > sda   0.00 0.00  105.00 26173.00 0.89   311.39  24.34   
> > 142.375.42   27.735.33   0.04 100.00
> > sda   0.00 0.00   78.00 27199.00 0.87   324.06  24.40   
> > 142.305.25   11.085.23   0.04 100.00
> > sda   0.00 0.00   10.00 33488.00 0.11   399.05  24.40   
> > 136.414.07  100.404.04   0.03 100.00
> > sda   0.00 0.00 3819.00 10096.0031.14   120.47  22.31   
> >  42.803.100.324.15   0.07  96.00
> > sda   0.00 0.00 6482.00  346.0052.98 4.53   17.25   
> >   1.930.280.201.80   0.14  93.20
> > 
> > While the fsync() is going on (or the kernel decides to start writing
> > out aggressively for some other reason) the amount of writes to the disk
> > is increased by two orders of magnitude. Unsurprisingly with disastrous
> > consequences for read() performance. We really want a way to pace the
> > writes issued to the disk more regularly.
> 
> Hi Andreas,
> 
> I've finally dug myself out from under the backlog from LSFMM far
> enough to start testing this on my local IO performance test rig.
> 
> tl;dr: I can't reproduce this peaky behaviour on my test rig.
> 
> I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
> an unmodified benchmark on a current 3.15-linus tree. All storage
> (guest and host) is XFS based, guest VMs use virtio and direct IO to
> the backing storage.  The host is using noop IO scheduling.
> 
> The first IO setup I ran was a 100TB XFS filesystem in the guest.
> The backing file is a sparse file on an XFS filesystem on a pair of
> 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
> exported as JBOD from a RAID controller which has 1GB of FBWC.  The
> guest is capable of sustaining around 65,000 random read IOPS and
> 40,000 write IOPS through this filesystem depending on the tests
> being run.
> 
> The iostat output looks like this:
> 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> vdc   0.00 0.00 1817.00  315.4018.80 6.9324.71
>  0.800.380.380.37   0.31  66.24
> vdc   0.00 0.00 2094.20  323.2021.82 7.1024.50
>  0.810.330.340.27   0.28  68.48
> vdc   0.00 0.00 1989.00 4500.2020.5056.6424.34
> 24.823.820.435.32   0.12  80.16
> vdc   0.00 0.00 2019.80  320.8020.83 7.0524.39
>  0.830.350.360.32   0.29  69.04
> vdc   0.00 0.00 2206.60  323.2022.57 7.1024.02
>  0.870.340.340.33   0.28  71.92
> vdc   0.00 0.00 2437.20  329.6025.79 7.2424.45
>  0.830.300.300.27   0.26  71.76
> vdc   0.00 0.00 1224.40 11263.8012.88   

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-04-28 Thread Dave Chinner
ping?

On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
 On Wed, Mar 26, 2014 at 08:11:13PM +0100, Andres Freund wrote:
  Hi,
  
  At LSF/MM there was a slot about postgres' problems with the kernel. Our
  top#1 concern is frequent slow read()s that happen while another process
  calls fsync(), even though we'd be perfectly fine if that fsync() took
  ages.
  The conclusion of that part was that it'd be very useful to have a
  demonstration of the problem without needing a full blown postgres
  setup. I've quickly hacked something together, that seems to show the
  problem nicely.
  
  For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
  and the IO Scheduling bit in
  http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
  
  The tools output looks like this:
  gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf  ./ioperf
  ...
  wal[12155]: avg: 0.0 msec; max: 0.0 msec
  commit[12155]: avg: 0.2 msec; max: 15.4 msec
  wal[12155]: avg: 0.0 msec; max: 0.0 msec
  read[12157]: avg: 0.2 msec; max: 9.4 msec
  ...
  read[12165]: avg: 0.2 msec; max: 9.4 msec
  wal[12155]: avg: 0.0 msec; max: 0.0 msec
  starting fsync() of files
  finished fsync() of files
  read[12162]: avg: 0.6 msec; max: 2765.5 msec
  
  So, the average read time is less than one ms (SSD, and about 50% cached
  workload). But once another backend does the fsync(), read latency
  skyrockets.
  
  A concurrent iostat shows the problem pretty clearly:
  Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s   
  avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
  sda   1.00 0.00 6322.00  337.0051.73 4.38   17.26   
2.090.320.192.59   0.14  90.00
  sda   0.00 0.00 6016.00  303.0047.18 3.95   16.57   
2.300.360.233.12   0.15  94.40
  sda   0.00 0.00 6236.00 1059.0049.5212.88   17.52   
5.910.640.203.23   0.12  88.40
  sda   0.00 0.00  105.00 26173.00 0.89   311.39  24.34   
  142.375.42   27.735.33   0.04 100.00
  sda   0.00 0.00   78.00 27199.00 0.87   324.06  24.40   
  142.305.25   11.085.23   0.04 100.00
  sda   0.00 0.00   10.00 33488.00 0.11   399.05  24.40   
  136.414.07  100.404.04   0.03 100.00
  sda   0.00 0.00 3819.00 10096.0031.14   120.47  22.31   
   42.803.100.324.15   0.07  96.00
  sda   0.00 0.00 6482.00  346.0052.98 4.53   17.25   
1.930.280.201.80   0.14  93.20
  
  While the fsync() is going on (or the kernel decides to start writing
  out aggressively for some other reason) the amount of writes to the disk
  is increased by two orders of magnitude. Unsurprisingly with disastrous
  consequences for read() performance. We really want a way to pace the
  writes issued to the disk more regularly.
 
 Hi Andreas,
 
 I've finally dug myself out from under the backlog from LSFMM far
 enough to start testing this on my local IO performance test rig.
 
 tl;dr: I can't reproduce this peaky behaviour on my test rig.
 
 I'm running in a 16p VM with 16GB RAM (in 4 nodes via fake-numa) and
 an unmodified benchmark on a current 3.15-linus tree. All storage
 (guest and host) is XFS based, guest VMs use virtio and direct IO to
 the backing storage.  The host is using noop IO scheduling.
 
 The first IO setup I ran was a 100TB XFS filesystem in the guest.
 The backing file is a sparse file on an XFS filesystem on a pair of
 240GB SSDs (Samsung 840 EVO) in RAID 0 via DM.  The SSDs are
 exported as JBOD from a RAID controller which has 1GB of FBWC.  The
 guest is capable of sustaining around 65,000 random read IOPS and
 40,000 write IOPS through this filesystem depending on the tests
 being run.
 
 The iostat output looks like this:
 
 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 vdc   0.00 0.00 1817.00  315.4018.80 6.9324.71
  0.800.380.380.37   0.31  66.24
 vdc   0.00 0.00 2094.20  323.2021.82 7.1024.50
  0.810.330.340.27   0.28  68.48
 vdc   0.00 0.00 1989.00 4500.2020.5056.6424.34
 24.823.820.435.32   0.12  80.16
 vdc   0.00 0.00 2019.80  320.8020.83 7.0524.39
  0.830.350.360.32   0.29  69.04
 vdc   0.00 0.00 2206.60  323.2022.57 7.1024.02
  0.870.340.340.33   0.28  71.92
 vdc   0.00 0.00 2437.20  329.6025.79 7.2424.45
  0.830.300.300.27   0.26  71.76
 vdc   0.00 0.00 1224.40 11263.8012.88   136.3824.48   
  64.905.200.695.69   0.07  84.96
 vdc   0.00 0.00 2074.60  319.4021.03 7.0123.99
  0.840.35   

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-04-28 Thread Andres Freund
Hi Dave,

On 2014-04-29 09:47:56 +1000, Dave Chinner wrote:
 ping?

I'd replied at http://marc.info/?l=linux-mmm=139730910307321w=2

As an additional note:

 On Wed, Apr 09, 2014 at 07:20:09PM +1000, Dave Chinner wrote:
  I'm not sure how you were generating the behaviour you reported, but
  the test program as it stands does not appear to be causing any
  problems at all on the sort of storage I'd expect large databases to
  be hosted on

A really really large number of database aren't stored on big enterprise
rigs...

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-27 Thread Fernando Luis Vazquez Cao

(2014/03/28 0:50), Jan Kara wrote:

On Wed 26-03-14 22:55:18, Andres Freund wrote:

On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:

On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund  wrote:

Hi,

At LSF/MM there was a slot about postgres' problems with the kernel. Our
top#1 concern is frequent slow read()s that happen while another process
calls fsync(), even though we'd be perfectly fine if that fsync() took
ages.
The "conclusion" of that part was that it'd be very useful to have a
demonstration of the problem without needing a full blown postgres
setup. I've quickly hacked something together, that seems to show the
problem nicely.

For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
and the "IO Scheduling" bit in
http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de


For your amusement: running this program in KVM on a 2GB disk image
failed, but it caused the *host* to go out to lunch for several
seconds while failing.  In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out.  The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.

Also, unless you changed the parameters, it's a) using a 48GB disk file,
and writes really rather fast ;)


Even using ext4 is no good.  I think that dm-crypt is dying under the
load.  So I won't test your program for real :/

Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
smaller. If it still doesn't work consider increasing the two nsleep()s...

I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.

   That's not necessary. If we have a guidance like above, we can figure it
out ourselves (I hope ;).


Possible solutions:
* Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
   sync_file_range() does.
* Make IO triggered by writeback regard IO priorities and add it to
   schedulers other than CFQ
* Add a tunable that allows limiting the amount of dirty memory before
   writeback on a per process basis.
* ...?

I thought the problem wasn't so much that priorities weren't respected
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.

I think it's both actually. If I understand correctly there's not even a
correct association to the originator anymore during a fsync triggered
flush?

   There is. The association is lost for background writeback (and sync(2)
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.

What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.

I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).


We have been using PostgreSQL in production for years so I am pretty 
familiar

with the symptoms described by the PostgreSQL guys.

In almost all cases the problem was the "'dependent sync IO' vs 'independent
sync IO'" issue pointed out by Jan. However, as I mentioned during the 
LSF

session, the culprit was not the kernel but the firmware of NCQ/TCQ capable
storage that would keep read requests queued forever, leaving tasks doing
reads (dependent sync IO) waiting for an interrupt that would not come. 
For the
record, latencies of up to 120 seconds in suposedly enterprise storage 
(I will not

name names) are relatively common. This can be fixed by modifying drivers
and/or the block layer to dynamically adjust the queue depth when dumb 
scheduling
by the firmware is observed. If you do not want to make changes to the 
kernel you
can always try to change the default queue depth. With this mechanism in 
place it is
relatively easy to make ionice work, since, as Jan mentioned, both reads 
and fsync

writes are done in process context.

- Fernando
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-27 Thread Jan Kara
  Hello,

On Wed 26-03-14 20:11:13, Andres Freund wrote:
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
  Thanks a lot for the program!

> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> 
> The tools output looks like this:
> gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf && ./ioperf
> ...
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> commit[12155]: avg: 0.2 msec; max: 15.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> read[12157]: avg: 0.2 msec; max: 9.4 msec
> ...
> read[12165]: avg: 0.2 msec; max: 9.4 msec
> wal[12155]: avg: 0.0 msec; max: 0.0 msec
> starting fsync() of files
> finished fsync() of files
> read[12162]: avg: 0.6 msec; max: 2765.5 msec
> 
> So, the average read time is less than one ms (SSD, and about 50% cached
> workload). But once another backend does the fsync(), read latency
> skyrockets.
> 
> A concurrent iostat shows the problem pretty clearly:
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s 
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda   1.00 0.00 6322.00  337.0051.73 4.38 17.26   
>   2.090.320.192.59   0.14  90.00
> sda   0.00 0.00 6016.00  303.0047.18 3.95 16.57   
>   2.300.360.233.12   0.15  94.40
> sda   0.00 0.00 6236.00 1059.0049.5212.88 17.52   
>   5.910.640.203.23   0.12  88.40
> sda   0.00 0.00  105.00 26173.00 0.89   311.3924.34   
> 142.375.42   27.735.33   0.04 100.00
> sda   0.00 0.00   78.00 27199.00 0.87   324.0624.40   
> 142.305.25   11.085.23   0.04 100.00
> sda   0.00 0.00   10.00 33488.00 0.11   399.0524.40   
> 136.414.07  100.404.04   0.03 100.00
> sda   0.00 0.00 3819.00 10096.0031.14   120.4722.31   
>  42.803.100.324.15   0.07  96.00
> sda   0.00 0.00 6482.00  346.0052.98 4.53 17.25   
>   1.930.280.201.80   0.14  93.20
> 
> While the fsync() is going on (or the kernel decides to start writing
> out aggressively for some other reason) the amount of writes to the disk
> is increased by two orders of magnitude. Unsurprisingly with disastrous
> consequences for read() performance. We really want a way to pace the
> writes issued to the disk more regularly.
> 
> The attached program right now can only be configured by changing some
> details in the code itself, but I guess that's not a problem. It will
> upfront allocate two files, and then start testing. If the files already
> exists it will use them.
> 
> Possible solutions:
> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>   sync_file_range() does.
> * Make IO triggered by writeback regard IO priorities and add it to
>   schedulers other than CFQ
> * Add a tunable that allows limiting the amount of dirty memory before
>   writeback on a per process basis.
> * ...?
> 
> If somebody familiar with buffered IO writeback is around at LSF/MM, or
> rather collab, Robert and I will be around for the next days.
  I guess I'm your guy, at least for the writeback part. I have some
insight in the block layer as well although there are better experts around
here. But I at least know whom to catch if there's some deeply intricate
problem ;)

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-27 Thread Jan Kara
On Wed 26-03-14 22:55:18, Andres Freund wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
> > On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund  wrote:
> > > Hi,
> > >
> > > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > > top#1 concern is frequent slow read()s that happen while another process
> > > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > > ages.
> > > The "conclusion" of that part was that it'd be very useful to have a
> > > demonstration of the problem without needing a full blown postgres
> > > setup. I've quickly hacked something together, that seems to show the
> > > problem nicely.
> > >
> > > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > > and the "IO Scheduling" bit in
> > > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> > >
> > 
> > For your amusement: running this program in KVM on a 2GB disk image
> > failed, but it caused the *host* to go out to lunch for several
> > seconds while failing.  In fact, it seems to have caused the host to
> > fall over so badly that the guest decided that the disk controller was
> > timing out.  The host is btrfs, and I think that btrfs is *really* bad
> > at this kind of workload.
> 
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
> 
> > Even using ext4 is no good.  I think that dm-crypt is dying under the
> > load.  So I won't test your program for real :/
> 
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
> 
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.
  That's not necessary. If we have a guidance like above, we can figure it
out ourselves (I hope ;).

> > > Possible solutions:
> > > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
> > >   sync_file_range() does.
> > > * Make IO triggered by writeback regard IO priorities and add it to
> > >   schedulers other than CFQ
> > > * Add a tunable that allows limiting the amount of dirty memory before
> > >   writeback on a per process basis.
> > > * ...?
> > 
> > I thought the problem wasn't so much that priorities weren't respected
> > but that the fsync call fills up the queue, so everything starts
> > contending for the right to enqueue a new request.
> 
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
  There is. The association is lost for background writeback (and sync(2)
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.

What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.

I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).

> > Since fsync blocks until all of its IO finishes anyway, what if it
> > could just limit itself to a much smaller number of outstanding
> > requests?
> 
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
  Actually, that might be triggered by a different problem because in case
of background writeback, block layer knows the IO is asynchronous and
treats it in a different way.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-27 Thread Jan Kara
On Wed 26-03-14 22:55:18, Andres Freund wrote:
 On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
  On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote:
   Hi,
  
   At LSF/MM there was a slot about postgres' problems with the kernel. Our
   top#1 concern is frequent slow read()s that happen while another process
   calls fsync(), even though we'd be perfectly fine if that fsync() took
   ages.
   The conclusion of that part was that it'd be very useful to have a
   demonstration of the problem without needing a full blown postgres
   setup. I've quickly hacked something together, that seems to show the
   problem nicely.
  
   For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
   and the IO Scheduling bit in
   http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
  
  
  For your amusement: running this program in KVM on a 2GB disk image
  failed, but it caused the *host* to go out to lunch for several
  seconds while failing.  In fact, it seems to have caused the host to
  fall over so badly that the guest decided that the disk controller was
  timing out.  The host is btrfs, and I think that btrfs is *really* bad
  at this kind of workload.
 
 Also, unless you changed the parameters, it's a) using a 48GB disk file,
 and writes really rather fast ;)
 
  Even using ext4 is no good.  I think that dm-crypt is dying under the
  load.  So I won't test your program for real :/
 
 Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
 smaller. If it still doesn't work consider increasing the two nsleep()s...
 
 I didn't have a good idea how to scale those to the current machine in a
 halfway automatic fashion.
  That's not necessary. If we have a guidance like above, we can figure it
out ourselves (I hope ;).

   Possible solutions:
   * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
 sync_file_range() does.
   * Make IO triggered by writeback regard IO priorities and add it to
 schedulers other than CFQ
   * Add a tunable that allows limiting the amount of dirty memory before
 writeback on a per process basis.
   * ...?
  
  I thought the problem wasn't so much that priorities weren't respected
  but that the fsync call fills up the queue, so everything starts
  contending for the right to enqueue a new request.
 
 I think it's both actually. If I understand correctly there's not even a
 correct association to the originator anymore during a fsync triggered
 flush?
  There is. The association is lost for background writeback (and sync(2)
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.

What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.

I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).

  Since fsync blocks until all of its IO finishes anyway, what if it
  could just limit itself to a much smaller number of outstanding
  requests?
 
 Yea, that could already help. If you remove the fsync()s, the problem
 will periodically appear anyway, because writeback is triggered with
 vengeance. That'd need to be fixed in a similar way.
  Actually, that might be triggered by a different problem because in case
of background writeback, block layer knows the IO is asynchronous and
treats it in a different way.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-27 Thread Jan Kara
  Hello,

On Wed 26-03-14 20:11:13, Andres Freund wrote:
 At LSF/MM there was a slot about postgres' problems with the kernel. Our
 top#1 concern is frequent slow read()s that happen while another process
 calls fsync(), even though we'd be perfectly fine if that fsync() took
 ages.
 The conclusion of that part was that it'd be very useful to have a
 demonstration of the problem without needing a full blown postgres
 setup. I've quickly hacked something together, that seems to show the
 problem nicely.
  Thanks a lot for the program!

 For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
 and the IO Scheduling bit in
 http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
 
 The tools output looks like this:
 gcc -std=c99 -Wall -ggdb ~/tmp/ioperf.c -o ioperf  ./ioperf
 ...
 wal[12155]: avg: 0.0 msec; max: 0.0 msec
 commit[12155]: avg: 0.2 msec; max: 15.4 msec
 wal[12155]: avg: 0.0 msec; max: 0.0 msec
 read[12157]: avg: 0.2 msec; max: 9.4 msec
 ...
 read[12165]: avg: 0.2 msec; max: 9.4 msec
 wal[12155]: avg: 0.0 msec; max: 0.0 msec
 starting fsync() of files
 finished fsync() of files
 read[12162]: avg: 0.6 msec; max: 2765.5 msec
 
 So, the average read time is less than one ms (SSD, and about 50% cached
 workload). But once another backend does the fsync(), read latency
 skyrockets.
 
 A concurrent iostat shows the problem pretty clearly:
 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s 
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 sda   1.00 0.00 6322.00  337.0051.73 4.38 17.26   
   2.090.320.192.59   0.14  90.00
 sda   0.00 0.00 6016.00  303.0047.18 3.95 16.57   
   2.300.360.233.12   0.15  94.40
 sda   0.00 0.00 6236.00 1059.0049.5212.88 17.52   
   5.910.640.203.23   0.12  88.40
 sda   0.00 0.00  105.00 26173.00 0.89   311.3924.34   
 142.375.42   27.735.33   0.04 100.00
 sda   0.00 0.00   78.00 27199.00 0.87   324.0624.40   
 142.305.25   11.085.23   0.04 100.00
 sda   0.00 0.00   10.00 33488.00 0.11   399.0524.40   
 136.414.07  100.404.04   0.03 100.00
 sda   0.00 0.00 3819.00 10096.0031.14   120.4722.31   
  42.803.100.324.15   0.07  96.00
 sda   0.00 0.00 6482.00  346.0052.98 4.53 17.25   
   1.930.280.201.80   0.14  93.20
 
 While the fsync() is going on (or the kernel decides to start writing
 out aggressively for some other reason) the amount of writes to the disk
 is increased by two orders of magnitude. Unsurprisingly with disastrous
 consequences for read() performance. We really want a way to pace the
 writes issued to the disk more regularly.
 
 The attached program right now can only be configured by changing some
 details in the code itself, but I guess that's not a problem. It will
 upfront allocate two files, and then start testing. If the files already
 exists it will use them.
 
 Possible solutions:
 * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
   sync_file_range() does.
 * Make IO triggered by writeback regard IO priorities and add it to
   schedulers other than CFQ
 * Add a tunable that allows limiting the amount of dirty memory before
   writeback on a per process basis.
 * ...?
 
 If somebody familiar with buffered IO writeback is around at LSF/MM, or
 rather collab, Robert and I will be around for the next days.
  I guess I'm your guy, at least for the writeback part. I have some
insight in the block layer as well although there are better experts around
here. But I at least know whom to catch if there's some deeply intricate
problem ;)

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-27 Thread Fernando Luis Vazquez Cao

(2014/03/28 0:50), Jan Kara wrote:

On Wed 26-03-14 22:55:18, Andres Freund wrote:

On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:

On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote:

Hi,

At LSF/MM there was a slot about postgres' problems with the kernel. Our
top#1 concern is frequent slow read()s that happen while another process
calls fsync(), even though we'd be perfectly fine if that fsync() took
ages.
The conclusion of that part was that it'd be very useful to have a
demonstration of the problem without needing a full blown postgres
setup. I've quickly hacked something together, that seems to show the
problem nicely.

For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
and the IO Scheduling bit in
http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de


For your amusement: running this program in KVM on a 2GB disk image
failed, but it caused the *host* to go out to lunch for several
seconds while failing.  In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out.  The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.

Also, unless you changed the parameters, it's a) using a 48GB disk file,
and writes really rather fast ;)


Even using ext4 is no good.  I think that dm-crypt is dying under the
load.  So I won't test your program for real :/

Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
smaller. If it still doesn't work consider increasing the two nsleep()s...

I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.

   That's not necessary. If we have a guidance like above, we can figure it
out ourselves (I hope ;).


Possible solutions:
* Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
   sync_file_range() does.
* Make IO triggered by writeback regard IO priorities and add it to
   schedulers other than CFQ
* Add a tunable that allows limiting the amount of dirty memory before
   writeback on a per process basis.
* ...?

I thought the problem wasn't so much that priorities weren't respected
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.

I think it's both actually. If I understand correctly there's not even a
correct association to the originator anymore during a fsync triggered
flush?

   There is. The association is lost for background writeback (and sync(2)
for that matter) but IO from fsync(2) is submitted in the context of the
process doing fsync.

What I think happens is the problem with 'dependent sync IO' vs
'independent sync IO'. Reads are an example of dependent sync IO where you
submit a read, need it to complete and then you submit another read. OTOH
fsync is an example of independent sync IO where you fire of tons of IO to
the drive and they wait for everything. Since we treat both these types of
IO in the same way, it can easily happen that independent sync IO starves
out the dependent one (you execute say 100 IO requests for fsync and 1 IO
request for read). We've seen problems like this in the past.

I'll have a look into your test program and if my feeling is indeed
correct, I'll have a look into what we could do in the block layer to fix
this (and poke block layer guys - they had some preliminary patches that
tried to address this but it didn't went anywhere).


We have been using PostgreSQL in production for years so I am pretty 
familiar

with the symptoms described by the PostgreSQL guys.

In almost all cases the problem was the 'dependent sync IO' vs 'independent
sync IO' issue pointed out by Jan. However, as I mentioned during the 
LSFMM

session, the culprit was not the kernel but the firmware of NCQ/TCQ capable
storage that would keep read requests queued forever, leaving tasks doing
reads (dependent sync IO) waiting for an interrupt that would not come. 
For the
record, latencies of up to 120 seconds in suposedly enterprise storage 
(I will not

name names) are relatively common. This can be fixed by modifying drivers
and/or the block layer to dynamically adjust the queue depth when dumb 
scheduling
by the firmware is observed. If you do not want to make changes to the 
kernel you
can always try to change the default queue depth. With this mechanism in 
place it is
relatively easy to make ionice work, since, as Jan mentioned, both reads 
and fsync

writes are done in process context.

- Fernando
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 4:11 PM, Andy Lutomirski  wrote:
> On Wed, Mar 26, 2014 at 3:35 PM, David Lang  wrote:
>> On Wed, 26 Mar 2014, Andy Lutomirski wrote:
>>
> I'm not sure I understand the request queue stuff, but here's an idea.
>  The block core contains this little bit of code:


 I haven't read enough of the code yet, to comment intelligently ;)
>>>
>>>
>>> My little patch doesn't seem to help.  I'm either changing the wrong
>>> piece of code entirely or I'm penalizing readers and writers too much.
>>>
>>> Hopefully some real block layer people can comment as to whether a
>>> refinement of this idea could work.  The behavior I want is for
>>> writeback to be limited to using a smallish fraction of the total
>>> request queue size -- I think that writeback should be able to enqueue
>>> enough requests to get decent sorting performance but not enough
>>> requests to prevent the io scheduler from doing a good job on
>>> non-writeback I/O.
>>
>>
>> The thing is that if there are no reads that are waiting, why not use every
>> bit of disk I/O available to write? If you can do that reliably with only
>> using part of the queue, fine, but aren't you getting fairly close to just
>> having separate queues for reading and writing with such a restriction?
>>
>
> Hmm.
>
> I wonder what the actual effect of queue length is on throughput.  I
> suspect that using half the queue gives you well over half the
> throughput as long as the queue isn't tiny.
>
> I'm not so sure I'd go so far as having separate reader and writer
> queues -- I think that small synchronous writes should also not get
> stuck behind large writeback storms, but maybe that's something that
> can be a secondary goal.  That being said, separate reader and writer
> queues might solve the immediate problem.  It won't help for the case
> where a small fsync blocks behind writeback, though, and that seems to
> be a very common cause of Firefox freezing on my system.
>
> Is there an easy way to do a proof-of-concept?  It would be great if
> there was a ten-line patch that implemented something like this
> correctly enough to see if it helps.  I don't think I'm the right
> person to do it, because my knowledge of the block layer code is
> essentially nil.

I think it's at least a bit more subtle than this.  cfq distinguishes
SYNC and ASYNC, but very large fsyncs are presumably SYNC.  Deadline
pays no attention to rw flags.

Anyway, it seems like there's basically nothing prioritizing what
happens when the number of requests exceeds the congestion thresholds.
 I'd happily bet a beverage* that Postgres's slow requests are
spending an excessive amount of time waiting to get into the queue in
the first place.

* Since I'm back home now, any actual beverage transaction will be
rather delayed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 3:35 PM, David Lang  wrote:
> On Wed, 26 Mar 2014, Andy Lutomirski wrote:
>
 I'm not sure I understand the request queue stuff, but here's an idea.
  The block core contains this little bit of code:
>>>
>>>
>>> I haven't read enough of the code yet, to comment intelligently ;)
>>
>>
>> My little patch doesn't seem to help.  I'm either changing the wrong
>> piece of code entirely or I'm penalizing readers and writers too much.
>>
>> Hopefully some real block layer people can comment as to whether a
>> refinement of this idea could work.  The behavior I want is for
>> writeback to be limited to using a smallish fraction of the total
>> request queue size -- I think that writeback should be able to enqueue
>> enough requests to get decent sorting performance but not enough
>> requests to prevent the io scheduler from doing a good job on
>> non-writeback I/O.
>
>
> The thing is that if there are no reads that are waiting, why not use every
> bit of disk I/O available to write? If you can do that reliably with only
> using part of the queue, fine, but aren't you getting fairly close to just
> having separate queues for reading and writing with such a restriction?
>

Hmm.

I wonder what the actual effect of queue length is on throughput.  I
suspect that using half the queue gives you well over half the
throughput as long as the queue isn't tiny.

I'm not so sure I'd go so far as having separate reader and writer
queues -- I think that small synchronous writes should also not get
stuck behind large writeback storms, but maybe that's something that
can be a secondary goal.  That being said, separate reader and writer
queues might solve the immediate problem.  It won't help for the case
where a small fsync blocks behind writeback, though, and that seems to
be a very common cause of Firefox freezing on my system.

Is there an easy way to do a proof-of-concept?  It would be great if
there was a ten-line patch that implemented something like this
correctly enough to see if it helps.  I don't think I'm the right
person to do it, because my knowledge of the block layer code is
essentially nil.

>
>> As an even more radical idea, what if there was a way to submit truly
>> enormous numbers of lightweight requests, such that the queue will
>> give the requester some kind of callback when the request is nearly
>> ready for submission so the requester can finish filling in the
>> request?  This would allow things like dm-crypt to get the benefit of
>> sorting without needing to encrypt hundreds of MB of data in advance
>> of having that data actually be to the backing device.  It might also
>> allow writeback to submit multiple gigabytes of writes, in arbitrarily
>> large pieces, but not to need to pin pages or do whatever expensive
>> things are needed until the IO actually happens.
>
>
> the problem with a callback is that you then need to wait for that source to
> get the CPU and finish doing it's work. What happens if that takes long
> enough for you to run out of data to write? And is it worth the extra
> context switches to bounce around when the writing process was finished with
> that block already.

dm-crypt is so context-switch heavy that I doubt the context switches
matter.  And you'd need to give the callback early enough that there's
a very good chance that the callback will finish in time.  There might
even need to be a way to let other non-callback-dependent IO pass by
the callback-dependent stuff, although in the particular case of
dm-crypt, dm-crypt is pretty much the only source of writes.  (Reads
don't have this problem for dm-crypt, I think.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread David Lang

On Wed, 26 Mar 2014, Andy Lutomirski wrote:


I'm not sure I understand the request queue stuff, but here's an idea.
 The block core contains this little bit of code:


I haven't read enough of the code yet, to comment intelligently ;)


My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.


The thing is that if there are no reads that are waiting, why not use every bit 
of disk I/O available to write? If you can do that reliably with only using part 
of the queue, fine, but aren't you getting fairly close to just having separate 
queues for reading and writing with such a restriction?



As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.


the problem with a callback is that you then need to wait for that source to get 
the CPU and finish doing it's work. What happens if that takes long enough for 
you to run out of data to write? And is it worth the extra context switches to 
bounce around when the writing process was finished with that block already.


David Lang


For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
   }

   /*
-* Only allow batching queuers to allocate up to 50% over the defined
-* limit of requests, otherwise we could have thousands of requests
-* allocated with any setting of ->nr_requests
+* Only allow batching queuers to allocate up to 50% of the
+* defined limit of requests, so that non-batching queuers can
+* get into the queue and thus be scheduled properly.
*/
-   if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
+   if (rl->count[is_sync] >= (q->nr_requests + 3) / 4)
   return NULL;

   q->nr_rqs[is_sync]++;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund  wrote:
> On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
>> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund  wrote:
>> > Hi,
>> >
>> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
>> > top#1 concern is frequent slow read()s that happen while another process
>> > calls fsync(), even though we'd be perfectly fine if that fsync() took
>> > ages.
>> > The "conclusion" of that part was that it'd be very useful to have a
>> > demonstration of the problem without needing a full blown postgres
>> > setup. I've quickly hacked something together, that seems to show the
>> > problem nicely.
>> >
>> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
>> > and the "IO Scheduling" bit in
>> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>> >
>>
>> For your amusement: running this program in KVM on a 2GB disk image
>> failed, but it caused the *host* to go out to lunch for several
>> seconds while failing.  In fact, it seems to have caused the host to
>> fall over so badly that the guest decided that the disk controller was
>> timing out.  The host is btrfs, and I think that btrfs is *really* bad
>> at this kind of workload.
>
> Also, unless you changed the parameters, it's a) using a 48GB disk file,
> and writes really rather fast ;)
>
>> Even using ext4 is no good.  I think that dm-crypt is dying under the
>> load.  So I won't test your program for real :/
>
> Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
> smaller. If it still doesn't work consider increasing the two nsleep()s...
>
> I didn't have a good idea how to scale those to the current machine in a
> halfway automatic fashion.

OK, I think I'm getting reasonable bad behavior with these qemu options:

-smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none

and a 2GB test partition.

>
>> > Possible solutions:
>> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>> >   sync_file_range() does.
>> > * Make IO triggered by writeback regard IO priorities and add it to
>> >   schedulers other than CFQ
>> > * Add a tunable that allows limiting the amount of dirty memory before
>> >   writeback on a per process basis.
>> > * ...?
>>
>> I thought the problem wasn't so much that priorities weren't respected
>> but that the fsync call fills up the queue, so everything starts
>> contending for the right to enqueue a new request.
>
> I think it's both actually. If I understand correctly there's not even a
> correct association to the originator anymore during a fsync triggered
> flush?
>
>> Since fsync blocks until all of its IO finishes anyway, what if it
>> could just limit itself to a much smaller number of outstanding
>> requests?
>
> Yea, that could already help. If you remove the fsync()s, the problem
> will periodically appear anyway, because writeback is triggered with
> vengeance. That'd need to be fixed in a similar way.
>
>> I'm not sure I understand the request queue stuff, but here's an idea.
>>  The block core contains this little bit of code:
>
> I haven't read enough of the code yet, to comment intelligently ;)

My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.

As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.

For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
}

/*
-* Only allow batching queuers to allocate up to 50% over the defined
-* limit of requests, otherwise we could have thousands of requests
-* allocated with any setting of ->nr_requests
+* Only allow batching queuers to allocate up to 50% of the
+* 

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andres Freund
On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
> On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund  wrote:
> > Hi,
> >
> > At LSF/MM there was a slot about postgres' problems with the kernel. Our
> > top#1 concern is frequent slow read()s that happen while another process
> > calls fsync(), even though we'd be perfectly fine if that fsync() took
> > ages.
> > The "conclusion" of that part was that it'd be very useful to have a
> > demonstration of the problem without needing a full blown postgres
> > setup. I've quickly hacked something together, that seems to show the
> > problem nicely.
> >
> > For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> > and the "IO Scheduling" bit in
> > http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
> >
> 
> For your amusement: running this program in KVM on a 2GB disk image
> failed, but it caused the *host* to go out to lunch for several
> seconds while failing.  In fact, it seems to have caused the host to
> fall over so badly that the guest decided that the disk controller was
> timing out.  The host is btrfs, and I think that btrfs is *really* bad
> at this kind of workload.

Also, unless you changed the parameters, it's a) using a 48GB disk file,
and writes really rather fast ;)

> Even using ext4 is no good.  I think that dm-crypt is dying under the
> load.  So I won't test your program for real :/

Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
smaller. If it still doesn't work consider increasing the two nsleep()s...

I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.

> > Possible solutions:
> > * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
> >   sync_file_range() does.
> > * Make IO triggered by writeback regard IO priorities and add it to
> >   schedulers other than CFQ
> > * Add a tunable that allows limiting the amount of dirty memory before
> >   writeback on a per process basis.
> > * ...?
> 
> I thought the problem wasn't so much that priorities weren't respected
> but that the fsync call fills up the queue, so everything starts
> contending for the right to enqueue a new request.

I think it's both actually. If I understand correctly there's not even a
correct association to the originator anymore during a fsync triggered
flush?

> Since fsync blocks until all of its IO finishes anyway, what if it
> could just limit itself to a much smaller number of outstanding
> requests?

Yea, that could already help. If you remove the fsync()s, the problem
will periodically appear anyway, because writeback is triggered with
vengeance. That'd need to be fixed in a similar way.

> I'm not sure I understand the request queue stuff, but here's an idea.
>  The block core contains this little bit of code:

I haven't read enough of the code yet, to comment intelligently ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund  wrote:
> Hi,
>
> At LSF/MM there was a slot about postgres' problems with the kernel. Our
> top#1 concern is frequent slow read()s that happen while another process
> calls fsync(), even though we'd be perfectly fine if that fsync() took
> ages.
> The "conclusion" of that part was that it'd be very useful to have a
> demonstration of the problem without needing a full blown postgres
> setup. I've quickly hacked something together, that seems to show the
> problem nicely.
>
> For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
> and the "IO Scheduling" bit in
> http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
>

For your amusement: running this program in KVM on a 2GB disk image
failed, but it caused the *host* to go out to lunch for several
seconds while failing.  In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out.  The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.

Even using ext4 is no good.  I think that dm-crypt is dying under the
load.  So I won't test your program for real :/


[...]

> Possible solutions:
> * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
>   sync_file_range() does.
> * Make IO triggered by writeback regard IO priorities and add it to
>   schedulers other than CFQ
> * Add a tunable that allows limiting the amount of dirty memory before
>   writeback on a per process basis.
> * ...?

I thought the problem wasn't so much that priorities weren't respected
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.

Since fsync blocks until all of its IO finishes anyway, what if it
could just limit itself to a much smaller number of outstanding
requests?

I'm not sure I understand the request queue stuff, but here's an idea.
 The block core contains this little bit of code:

/*
 * Only allow batching queuers to allocate up to 50% over the defined
 * limit of requests, otherwise we could have thousands of requests
 * allocated with any setting of ->nr_requests
 */
if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
return NULL;

What if this changed to:

/*
 * Only allow batching queuers to allocate up to 50% of the defined
 * limit of requests, so that non-batching queuers can get into the queue
 * and thus be scheduled properly.
 */
if (rl->count[is_sync] >= (q->nr_requests + 3) / 4))
return NULL;

I suspect that doing this right would take a bit more care than that,
but I wonder if this approach is any good.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote:
 Hi,

 At LSF/MM there was a slot about postgres' problems with the kernel. Our
 top#1 concern is frequent slow read()s that happen while another process
 calls fsync(), even though we'd be perfectly fine if that fsync() took
 ages.
 The conclusion of that part was that it'd be very useful to have a
 demonstration of the problem without needing a full blown postgres
 setup. I've quickly hacked something together, that seems to show the
 problem nicely.

 For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
 and the IO Scheduling bit in
 http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de


For your amusement: running this program in KVM on a 2GB disk image
failed, but it caused the *host* to go out to lunch for several
seconds while failing.  In fact, it seems to have caused the host to
fall over so badly that the guest decided that the disk controller was
timing out.  The host is btrfs, and I think that btrfs is *really* bad
at this kind of workload.

Even using ext4 is no good.  I think that dm-crypt is dying under the
load.  So I won't test your program for real :/


[...]

 Possible solutions:
 * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
   sync_file_range() does.
 * Make IO triggered by writeback regard IO priorities and add it to
   schedulers other than CFQ
 * Add a tunable that allows limiting the amount of dirty memory before
   writeback on a per process basis.
 * ...?

I thought the problem wasn't so much that priorities weren't respected
but that the fsync call fills up the queue, so everything starts
contending for the right to enqueue a new request.

Since fsync blocks until all of its IO finishes anyway, what if it
could just limit itself to a much smaller number of outstanding
requests?

I'm not sure I understand the request queue stuff, but here's an idea.
 The block core contains this little bit of code:

/*
 * Only allow batching queuers to allocate up to 50% over the defined
 * limit of requests, otherwise we could have thousands of requests
 * allocated with any setting of -nr_requests
 */
if (rl-count[is_sync] = (3 * q-nr_requests / 2))
return NULL;

What if this changed to:

/*
 * Only allow batching queuers to allocate up to 50% of the defined
 * limit of requests, so that non-batching queuers can get into the queue
 * and thus be scheduled properly.
 */
if (rl-count[is_sync] = (q-nr_requests + 3) / 4))
return NULL;

I suspect that doing this right would take a bit more care than that,
but I wonder if this approach is any good.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andres Freund
On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
 On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote:
  Hi,
 
  At LSF/MM there was a slot about postgres' problems with the kernel. Our
  top#1 concern is frequent slow read()s that happen while another process
  calls fsync(), even though we'd be perfectly fine if that fsync() took
  ages.
  The conclusion of that part was that it'd be very useful to have a
  demonstration of the problem without needing a full blown postgres
  setup. I've quickly hacked something together, that seems to show the
  problem nicely.
 
  For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
  and the IO Scheduling bit in
  http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
 
 
 For your amusement: running this program in KVM on a 2GB disk image
 failed, but it caused the *host* to go out to lunch for several
 seconds while failing.  In fact, it seems to have caused the host to
 fall over so badly that the guest decided that the disk controller was
 timing out.  The host is btrfs, and I think that btrfs is *really* bad
 at this kind of workload.

Also, unless you changed the parameters, it's a) using a 48GB disk file,
and writes really rather fast ;)

 Even using ext4 is no good.  I think that dm-crypt is dying under the
 load.  So I won't test your program for real :/

Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
smaller. If it still doesn't work consider increasing the two nsleep()s...

I didn't have a good idea how to scale those to the current machine in a
halfway automatic fashion.

  Possible solutions:
  * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
sync_file_range() does.
  * Make IO triggered by writeback regard IO priorities and add it to
schedulers other than CFQ
  * Add a tunable that allows limiting the amount of dirty memory before
writeback on a per process basis.
  * ...?
 
 I thought the problem wasn't so much that priorities weren't respected
 but that the fsync call fills up the queue, so everything starts
 contending for the right to enqueue a new request.

I think it's both actually. If I understand correctly there's not even a
correct association to the originator anymore during a fsync triggered
flush?

 Since fsync blocks until all of its IO finishes anyway, what if it
 could just limit itself to a much smaller number of outstanding
 requests?

Yea, that could already help. If you remove the fsync()s, the problem
will periodically appear anyway, because writeback is triggered with
vengeance. That'd need to be fixed in a similar way.

 I'm not sure I understand the request queue stuff, but here's an idea.
  The block core contains this little bit of code:

I haven't read enough of the code yet, to comment intelligently ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 2:55 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-03-26 14:41:31 -0700, Andy Lutomirski wrote:
 On Wed, Mar 26, 2014 at 12:11 PM, Andres Freund and...@anarazel.de wrote:
  Hi,
 
  At LSF/MM there was a slot about postgres' problems with the kernel. Our
  top#1 concern is frequent slow read()s that happen while another process
  calls fsync(), even though we'd be perfectly fine if that fsync() took
  ages.
  The conclusion of that part was that it'd be very useful to have a
  demonstration of the problem without needing a full blown postgres
  setup. I've quickly hacked something together, that seems to show the
  problem nicely.
 
  For a bit of context: lwn.net/SubscriberLink/591723/940134eb57fcc0b8/
  and the IO Scheduling bit in
  http://archives.postgresql.org/message-id/20140310101537.GC10663%40suse.de
 

 For your amusement: running this program in KVM on a 2GB disk image
 failed, but it caused the *host* to go out to lunch for several
 seconds while failing.  In fact, it seems to have caused the host to
 fall over so badly that the guest decided that the disk controller was
 timing out.  The host is btrfs, and I think that btrfs is *really* bad
 at this kind of workload.

 Also, unless you changed the parameters, it's a) using a 48GB disk file,
 and writes really rather fast ;)

 Even using ext4 is no good.  I think that dm-crypt is dying under the
 load.  So I won't test your program for real :/

 Try to reduce data_size to RAM * 2, NUM_RANDOM_READERS to something
 smaller. If it still doesn't work consider increasing the two nsleep()s...

 I didn't have a good idea how to scale those to the current machine in a
 halfway automatic fashion.

OK, I think I'm getting reasonable bad behavior with these qemu options:

-smp 2 -cpu host -m 600 -drive file=/var/lutotmp/test.img,cache=none

and a 2GB test partition.


  Possible solutions:
  * Add a fadvise(UNDIRTY), that doesn't stall on a full IO queue like
sync_file_range() does.
  * Make IO triggered by writeback regard IO priorities and add it to
schedulers other than CFQ
  * Add a tunable that allows limiting the amount of dirty memory before
writeback on a per process basis.
  * ...?

 I thought the problem wasn't so much that priorities weren't respected
 but that the fsync call fills up the queue, so everything starts
 contending for the right to enqueue a new request.

 I think it's both actually. If I understand correctly there's not even a
 correct association to the originator anymore during a fsync triggered
 flush?

 Since fsync blocks until all of its IO finishes anyway, what if it
 could just limit itself to a much smaller number of outstanding
 requests?

 Yea, that could already help. If you remove the fsync()s, the problem
 will periodically appear anyway, because writeback is triggered with
 vengeance. That'd need to be fixed in a similar way.

 I'm not sure I understand the request queue stuff, but here's an idea.
  The block core contains this little bit of code:

 I haven't read enough of the code yet, to comment intelligently ;)

My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.

As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.

For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
}

/*
-* Only allow batching queuers to allocate up to 50% over the defined
-* limit of requests, otherwise we could have thousands of requests
-* allocated with any setting of -nr_requests
+* Only allow batching queuers to allocate up to 50% of the
+* defined limit of requests, so that non-batching queuers can
+* get into the queue and 

Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread David Lang

On Wed, 26 Mar 2014, Andy Lutomirski wrote:


I'm not sure I understand the request queue stuff, but here's an idea.
 The block core contains this little bit of code:


I haven't read enough of the code yet, to comment intelligently ;)


My little patch doesn't seem to help.  I'm either changing the wrong
piece of code entirely or I'm penalizing readers and writers too much.

Hopefully some real block layer people can comment as to whether a
refinement of this idea could work.  The behavior I want is for
writeback to be limited to using a smallish fraction of the total
request queue size -- I think that writeback should be able to enqueue
enough requests to get decent sorting performance but not enough
requests to prevent the io scheduler from doing a good job on
non-writeback I/O.


The thing is that if there are no reads that are waiting, why not use every bit 
of disk I/O available to write? If you can do that reliably with only using part 
of the queue, fine, but aren't you getting fairly close to just having separate 
queues for reading and writing with such a restriction?



As an even more radical idea, what if there was a way to submit truly
enormous numbers of lightweight requests, such that the queue will
give the requester some kind of callback when the request is nearly
ready for submission so the requester can finish filling in the
request?  This would allow things like dm-crypt to get the benefit of
sorting without needing to encrypt hundreds of MB of data in advance
of having that data actually be to the backing device.  It might also
allow writeback to submit multiple gigabytes of writes, in arbitrarily
large pieces, but not to need to pin pages or do whatever expensive
things are needed until the IO actually happens.


the problem with a callback is that you then need to wait for that source to get 
the CPU and finish doing it's work. What happens if that takes long enough for 
you to run out of data to write? And is it worth the extra context switches to 
bounce around when the writing process was finished with that block already.


David Lang


For reference, here's my patch that doesn't work well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc..c0dedc3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -941,11 +941,11 @@ static struct request *__get_request(struct request_list *
   }

   /*
-* Only allow batching queuers to allocate up to 50% over the defined
-* limit of requests, otherwise we could have thousands of requests
-* allocated with any setting of -nr_requests
+* Only allow batching queuers to allocate up to 50% of the
+* defined limit of requests, so that non-batching queuers can
+* get into the queue and thus be scheduled properly.
*/
-   if (rl-count[is_sync] = (3 * q-nr_requests / 2))
+   if (rl-count[is_sync] = (q-nr_requests + 3) / 4)
   return NULL;

   q-nr_rqs[is_sync]++;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 3:35 PM, David Lang da...@lang.hm wrote:
 On Wed, 26 Mar 2014, Andy Lutomirski wrote:

 I'm not sure I understand the request queue stuff, but here's an idea.
  The block core contains this little bit of code:


 I haven't read enough of the code yet, to comment intelligently ;)


 My little patch doesn't seem to help.  I'm either changing the wrong
 piece of code entirely or I'm penalizing readers and writers too much.

 Hopefully some real block layer people can comment as to whether a
 refinement of this idea could work.  The behavior I want is for
 writeback to be limited to using a smallish fraction of the total
 request queue size -- I think that writeback should be able to enqueue
 enough requests to get decent sorting performance but not enough
 requests to prevent the io scheduler from doing a good job on
 non-writeback I/O.


 The thing is that if there are no reads that are waiting, why not use every
 bit of disk I/O available to write? If you can do that reliably with only
 using part of the queue, fine, but aren't you getting fairly close to just
 having separate queues for reading and writing with such a restriction?


Hmm.

I wonder what the actual effect of queue length is on throughput.  I
suspect that using half the queue gives you well over half the
throughput as long as the queue isn't tiny.

I'm not so sure I'd go so far as having separate reader and writer
queues -- I think that small synchronous writes should also not get
stuck behind large writeback storms, but maybe that's something that
can be a secondary goal.  That being said, separate reader and writer
queues might solve the immediate problem.  It won't help for the case
where a small fsync blocks behind writeback, though, and that seems to
be a very common cause of Firefox freezing on my system.

Is there an easy way to do a proof-of-concept?  It would be great if
there was a ten-line patch that implemented something like this
correctly enough to see if it helps.  I don't think I'm the right
person to do it, because my knowledge of the block layer code is
essentially nil.


 As an even more radical idea, what if there was a way to submit truly
 enormous numbers of lightweight requests, such that the queue will
 give the requester some kind of callback when the request is nearly
 ready for submission so the requester can finish filling in the
 request?  This would allow things like dm-crypt to get the benefit of
 sorting without needing to encrypt hundreds of MB of data in advance
 of having that data actually be to the backing device.  It might also
 allow writeback to submit multiple gigabytes of writes, in arbitrarily
 large pieces, but not to need to pin pages or do whatever expensive
 things are needed until the IO actually happens.


 the problem with a callback is that you then need to wait for that source to
 get the CPU and finish doing it's work. What happens if that takes long
 enough for you to run out of data to write? And is it worth the extra
 context switches to bounce around when the writing process was finished with
 that block already.

dm-crypt is so context-switch heavy that I doubt the context switches
matter.  And you'd need to give the callback early enough that there's
a very good chance that the callback will finish in time.  There might
even need to be a way to let other non-callback-dependent IO pass by
the callback-dependent stuff, although in the particular case of
dm-crypt, dm-crypt is pretty much the only source of writes.  (Reads
don't have this problem for dm-crypt, I think.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lsf] Postgresql performance problems with IO latency, especially during fsync()

2014-03-26 Thread Andy Lutomirski
On Wed, Mar 26, 2014 at 4:11 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Wed, Mar 26, 2014 at 3:35 PM, David Lang da...@lang.hm wrote:
 On Wed, 26 Mar 2014, Andy Lutomirski wrote:

 I'm not sure I understand the request queue stuff, but here's an idea.
  The block core contains this little bit of code:


 I haven't read enough of the code yet, to comment intelligently ;)


 My little patch doesn't seem to help.  I'm either changing the wrong
 piece of code entirely or I'm penalizing readers and writers too much.

 Hopefully some real block layer people can comment as to whether a
 refinement of this idea could work.  The behavior I want is for
 writeback to be limited to using a smallish fraction of the total
 request queue size -- I think that writeback should be able to enqueue
 enough requests to get decent sorting performance but not enough
 requests to prevent the io scheduler from doing a good job on
 non-writeback I/O.


 The thing is that if there are no reads that are waiting, why not use every
 bit of disk I/O available to write? If you can do that reliably with only
 using part of the queue, fine, but aren't you getting fairly close to just
 having separate queues for reading and writing with such a restriction?


 Hmm.

 I wonder what the actual effect of queue length is on throughput.  I
 suspect that using half the queue gives you well over half the
 throughput as long as the queue isn't tiny.

 I'm not so sure I'd go so far as having separate reader and writer
 queues -- I think that small synchronous writes should also not get
 stuck behind large writeback storms, but maybe that's something that
 can be a secondary goal.  That being said, separate reader and writer
 queues might solve the immediate problem.  It won't help for the case
 where a small fsync blocks behind writeback, though, and that seems to
 be a very common cause of Firefox freezing on my system.

 Is there an easy way to do a proof-of-concept?  It would be great if
 there was a ten-line patch that implemented something like this
 correctly enough to see if it helps.  I don't think I'm the right
 person to do it, because my knowledge of the block layer code is
 essentially nil.

I think it's at least a bit more subtle than this.  cfq distinguishes
SYNC and ASYNC, but very large fsyncs are presumably SYNC.  Deadline
pays no attention to rw flags.

Anyway, it seems like there's basically nothing prioritizing what
happens when the number of requests exceeds the congestion thresholds.
 I'd happily bet a beverage* that Postgres's slow requests are
spending an excessive amount of time waiting to get into the queue in
the first place.

* Since I'm back home now, any actual beverage transaction will be
rather delayed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/