Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-25 Thread Mel Gorman
On Wed, Apr 24, 2013 at 03:09:13PM -0400, Jeff Moyer wrote:
> Mel Gorman  writes:
> 
> >> I'll also note that even though your I/O is going all over the place
> >> (D2C is pretty bad, 14ms), most of the time is spent waiting for a
> >> struct request allocation or between Queue and Merge:
> >> 
> >>  All Devices 
> >> 
> >> ALL   MIN   AVG   MAX   N
> >> --- - - - ---
> >> 
> >> Q2Q   0.1   0.000992259   8.898375882 2300861
> >> Q2G   0.00843  10.193261239 2064.079501935 1016463 
> >> <
> >
> > This is not normally my sandbox so do you mind spelling this out?
> >
> > IIUC, the time to allocate the struct request from the slab cache is just a
> > small portion of this time. The bulk of the time is spent in get_request()
> > waiting for congestion to clear on the request list for either the sync or
> > async queue. Once a process goes to sleep on that waitqueue, it has to wait
> > until enough requests on that queue have been serviced before it gets woken
> > again at which point it gets priority access to prevent further starvation.
> > This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
> > is that the async queue was congested and on average, we are waiting for
> > 10 seconds for it to clear. The maximum value may be bogus for reasons
> > explained later.
> >
> > Is that accurate?
> 
> Yes, without getting into excruciating detail.


Good enough, thanks.

> >> I'm not sure I believe that max value.  2064 seconds seems a bit high.
> >
> > It is so I looked closer at the timestamps and there is an one hour
> > correction about 4400 seconds into the test.  Daylight savings time kicked
> > in on March 31st and the machine is rarely rebooted until this test case
> > came along. It looks like there is a timezone or time misconfiguration
> > on the laptop that starts the machine with the wrong time. NTP must have
> > corrected the time which skewed the readings in that window severely :(
> 
> Not sure I'm buying that argument, as there are no gaps in the blkparse
> output.  The logging is not done using wallclock time.  I still haven't
> had sufficient time to dig into these numbers.
> 

Ok.

> > It's awkward but it's not like there are standard benchmarks lying around
> > and it seemed the best way to reproduce the problems I typically see early
> > in the lifetime of a system or when running a git checkout when the tree
> > has not been used in a few hours. Run the actual test with
> >
> > ./run-mmtests.sh --config 
> > configs/config-global-dhp__io-multiple-source-latency --run-monitor 
> > test-name-of-your-choice
> >
> > Results will be in work/log. You'll need to run this as root so it
> > can run blktrace and so it can drop_caches between git checkouts
> > (to force disk IO). If systemtap craps out on you, then edit
> > configs/config-global-dhp__io-multiple-source-latency and remove dstate
> > from MONITORS_GZIP
> 
> And how do I determine whether I've hit the problem?
> 

If systemtap is available then

cat work/log/dstate-TESTNAME-gitcheckout | subreport/stap-dstate-frequency

will give you a report on the worst stalls and the stack traces when those
stalls occurred. If the stalls are 10+ seconds then you've certainly hit
the problem.

Alternatively

cd work/log
../../compare-kernels.sh

Look at the average time it takes to run the git checkout. Is it
abnormally high in comparison to if there was no parallel IO? If you do
not know what the normal time is, run with

./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency 
--no-monitor test-name-no-monitor

The monitors are what's opening the maildir and generating the parallel
IO. If there is an excessive difference between the git checkout times,
then you've hit the problem.

Furthermore, look at the await times. If they do not appear in the
compare-kernels.sh output then

../../bin/compare-mmtests.pl -d . -b gitcheckout -n test-name-of-your-choice 
--print-monitor iostat

And look at the await times. Are they very high? If so, you hit the
problem. If you want a better look at the await figures over time,
either look at the iostat file or you can graph it with

../../bin/graph-mmtests.sh -d . -b gitcheckout -n test-name-of-your-choice 
--print-monitor iostat --sub-heading sda-await

where sda-await should be  substituted for whatever the disk is that
you're running the test one.


> > If you have trouble getting this running, ping me on IRC.
> 
> Yes, I'm having issues getting things to go, but you didn't provide me a
> time zone, an irc server or a nick to help me find you.  Was that
> intentional?  ;-)
> 

Not consciously :) . I'm in the IST timezone (GMT+1), OFTC IRC network
and #mm channel. If it's my evening I'm not always responsive so send me
the error output on email and I'll try fixing any problems that way.

-- 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-25 Thread Mel Gorman
On Wed, Apr 24, 2013 at 03:09:13PM -0400, Jeff Moyer wrote:
 Mel Gorman mgor...@suse.de writes:
 
  I'll also note that even though your I/O is going all over the place
  (D2C is pretty bad, 14ms), most of the time is spent waiting for a
  struct request allocation or between Queue and Merge:
  
   All Devices 
  
  ALL   MIN   AVG   MAX   N
  --- - - - ---
  
  Q2Q   0.1   0.000992259   8.898375882 2300861
  Q2G   0.00843  10.193261239 2064.079501935 1016463 
  
 
  This is not normally my sandbox so do you mind spelling this out?
 
  IIUC, the time to allocate the struct request from the slab cache is just a
  small portion of this time. The bulk of the time is spent in get_request()
  waiting for congestion to clear on the request list for either the sync or
  async queue. Once a process goes to sleep on that waitqueue, it has to wait
  until enough requests on that queue have been serviced before it gets woken
  again at which point it gets priority access to prevent further starvation.
  This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
  is that the async queue was congested and on average, we are waiting for
  10 seconds for it to clear. The maximum value may be bogus for reasons
  explained later.
 
  Is that accurate?
 
 Yes, without getting into excruciating detail.


Good enough, thanks.

  I'm not sure I believe that max value.  2064 seconds seems a bit high.
 
  It is so I looked closer at the timestamps and there is an one hour
  correction about 4400 seconds into the test.  Daylight savings time kicked
  in on March 31st and the machine is rarely rebooted until this test case
  came along. It looks like there is a timezone or time misconfiguration
  on the laptop that starts the machine with the wrong time. NTP must have
  corrected the time which skewed the readings in that window severely :(
 
 Not sure I'm buying that argument, as there are no gaps in the blkparse
 output.  The logging is not done using wallclock time.  I still haven't
 had sufficient time to dig into these numbers.
 

Ok.

  It's awkward but it's not like there are standard benchmarks lying around
  and it seemed the best way to reproduce the problems I typically see early
  in the lifetime of a system or when running a git checkout when the tree
  has not been used in a few hours. Run the actual test with
 
  ./run-mmtests.sh --config 
  configs/config-global-dhp__io-multiple-source-latency --run-monitor 
  test-name-of-your-choice
 
  Results will be in work/log. You'll need to run this as root so it
  can run blktrace and so it can drop_caches between git checkouts
  (to force disk IO). If systemtap craps out on you, then edit
  configs/config-global-dhp__io-multiple-source-latency and remove dstate
  from MONITORS_GZIP
 
 And how do I determine whether I've hit the problem?
 

If systemtap is available then

cat work/log/dstate-TESTNAME-gitcheckout | subreport/stap-dstate-frequency

will give you a report on the worst stalls and the stack traces when those
stalls occurred. If the stalls are 10+ seconds then you've certainly hit
the problem.

Alternatively

cd work/log
../../compare-kernels.sh

Look at the average time it takes to run the git checkout. Is it
abnormally high in comparison to if there was no parallel IO? If you do
not know what the normal time is, run with

./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency 
--no-monitor test-name-no-monitor

The monitors are what's opening the maildir and generating the parallel
IO. If there is an excessive difference between the git checkout times,
then you've hit the problem.

Furthermore, look at the await times. If they do not appear in the
compare-kernels.sh output then

../../bin/compare-mmtests.pl -d . -b gitcheckout -n test-name-of-your-choice 
--print-monitor iostat

And look at the await times. Are they very high? If so, you hit the
problem. If you want a better look at the await figures over time,
either look at the iostat file or you can graph it with

../../bin/graph-mmtests.sh -d . -b gitcheckout -n test-name-of-your-choice 
--print-monitor iostat --sub-heading sda-await

where sda-await should be  substituted for whatever the disk is that
you're running the test one.


  If you have trouble getting this running, ping me on IRC.
 
 Yes, I'm having issues getting things to go, but you didn't provide me a
 time zone, an irc server or a nick to help me find you.  Was that
 intentional?  ;-)
 

Not consciously :) . I'm in the IST timezone (GMT+1), OFTC IRC network
and #mm channel. If it's my evening I'm not always responsive so send me
the error output on email and I'll try fixing any problems that way.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-24 Thread Jeff Moyer
Mel Gorman  writes:

>> I'll also note that even though your I/O is going all over the place
>> (D2C is pretty bad, 14ms), most of the time is spent waiting for a
>> struct request allocation or between Queue and Merge:
>> 
>>  All Devices 
>> 
>> ALL   MIN   AVG   MAX   N
>> --- - - - ---
>> 
>> Q2Q   0.1   0.000992259   8.898375882 2300861
>> Q2G   0.00843  10.193261239 2064.079501935 1016463 <
>
> This is not normally my sandbox so do you mind spelling this out?
>
> IIUC, the time to allocate the struct request from the slab cache is just a
> small portion of this time. The bulk of the time is spent in get_request()
> waiting for congestion to clear on the request list for either the sync or
> async queue. Once a process goes to sleep on that waitqueue, it has to wait
> until enough requests on that queue have been serviced before it gets woken
> again at which point it gets priority access to prevent further starvation.
> This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
> is that the async queue was congested and on average, we are waiting for
> 10 seconds for it to clear. The maximum value may be bogus for reasons
> explained later.
>
> Is that accurate?

Yes, without getting into excruciating detail.

>> G2I   0.00461   0.44702   3.237065090 1015803
>> Q2M   0.00101   8.203147238 2064.079367557 1311662
>> I2D   0.02012   1.476824812 2064.089774419 1014890
>> M2D   0.03283   6.994306138 283.573348664 1284872
>> D2C   0.61889   0.014438316   0.857811758 2291996
>> Q2C   0.72284  13.363007244 2064.092228625 2292191
>> 
>>  Device Overhead 
>> 
>>DEV |   Q2G   G2I   Q2M   I2D   D2C
>> -- | - - - - -
>>  (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
>> -- | - - - - -
>>Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
>> 
>> I'm not sure I believe that max value.  2064 seconds seems a bit high.
>
> It is so I looked closer at the timestamps and there is an one hour
> correction about 4400 seconds into the test.  Daylight savings time kicked
> in on March 31st and the machine is rarely rebooted until this test case
> came along. It looks like there is a timezone or time misconfiguration
> on the laptop that starts the machine with the wrong time. NTP must have
> corrected the time which skewed the readings in that window severely :(

Not sure I'm buying that argument, as there are no gaps in the blkparse
output.  The logging is not done using wallclock time.  I still haven't
had sufficient time to dig into these numbers.

>> Also, Q2M should not be anywhere near that big, so more investigation is
>> required there.  A quick look over the data doesn't show any such delays
>> (making me question the tools), but I'll write some code tomorrow to
>> verify the btt output.
>> 
>
> It might be a single set of readings during a time correction that
> screwed it.

Again, I don't think so.

> I can reproduce it at will. Due to the nature of the test, the test
> results are variable and unfortunately it is one of the tricker mmtest
> configurations to setup.
>
> 1. Get access to a webserver
> 2. Close mmtests to your test machine
>git clone https://github.com/gormanm/mmtests.git
> 3. Edit shellpacks/common-config.sh and set WEBROOT to a webserver path
> 4. Create a tar.gz of a large git tree and place it at 
> $WEBROOT/linux-2.6.tar.gz
>Alternatively place a compressed git tree anywhere and edit
>configs/config-global-dhp__io-multiple-source-latency
>and update GITCHECKOUT_SOURCETAR
> 5. Create a tar.gz of a large maildir directory and place it at
>$WEBROOT/$WEBROOT/maildir.tar.gz
>Alternatively, use an existing maildir folder and set
>MONITOR_INBOX_OPEN_MAILDIR in
>configs/config-global-dhp__io-multiple-source-latency
>
> It's awkward but it's not like there are standard benchmarks lying around
> and it seemed the best way to reproduce the problems I typically see early
> in the lifetime of a system or when running a git checkout when the tree
> has not been used in a few hours. Run the actual test with
>
> ./run-mmtests.sh --config 
> configs/config-global-dhp__io-multiple-source-latency --run-monitor 
> test-name-of-your-choice
>
> Results will be in work/log. You'll need to run this as root so it
> can run blktrace and so it can drop_caches between git checkouts
> (to force disk IO). If systemtap craps out on you, then edit
> configs/config-global-dhp__io-multiple-source-latency and remove dstate
> from MONITORS_GZIP

And how do I determine whether I've hit the 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-24 Thread Jeff Moyer
Mel Gorman mgor...@suse.de writes:

 I'll also note that even though your I/O is going all over the place
 (D2C is pretty bad, 14ms), most of the time is spent waiting for a
 struct request allocation or between Queue and Merge:
 
  All Devices 
 
 ALL   MIN   AVG   MAX   N
 --- - - - ---
 
 Q2Q   0.1   0.000992259   8.898375882 2300861
 Q2G   0.00843  10.193261239 2064.079501935 1016463 

 This is not normally my sandbox so do you mind spelling this out?

 IIUC, the time to allocate the struct request from the slab cache is just a
 small portion of this time. The bulk of the time is spent in get_request()
 waiting for congestion to clear on the request list for either the sync or
 async queue. Once a process goes to sleep on that waitqueue, it has to wait
 until enough requests on that queue have been serviced before it gets woken
 again at which point it gets priority access to prevent further starvation.
 This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
 is that the async queue was congested and on average, we are waiting for
 10 seconds for it to clear. The maximum value may be bogus for reasons
 explained later.

 Is that accurate?

Yes, without getting into excruciating detail.

 G2I   0.00461   0.44702   3.237065090 1015803
 Q2M   0.00101   8.203147238 2064.079367557 1311662
 I2D   0.02012   1.476824812 2064.089774419 1014890
 M2D   0.03283   6.994306138 283.573348664 1284872
 D2C   0.61889   0.014438316   0.857811758 2291996
 Q2C   0.72284  13.363007244 2064.092228625 2292191
 
  Device Overhead 
 
DEV |   Q2G   G2I   Q2M   I2D   D2C
 -- | - - - - -
  (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
 -- | - - - - -
Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
 
 I'm not sure I believe that max value.  2064 seconds seems a bit high.

 It is so I looked closer at the timestamps and there is an one hour
 correction about 4400 seconds into the test.  Daylight savings time kicked
 in on March 31st and the machine is rarely rebooted until this test case
 came along. It looks like there is a timezone or time misconfiguration
 on the laptop that starts the machine with the wrong time. NTP must have
 corrected the time which skewed the readings in that window severely :(

Not sure I'm buying that argument, as there are no gaps in the blkparse
output.  The logging is not done using wallclock time.  I still haven't
had sufficient time to dig into these numbers.

 Also, Q2M should not be anywhere near that big, so more investigation is
 required there.  A quick look over the data doesn't show any such delays
 (making me question the tools), but I'll write some code tomorrow to
 verify the btt output.
 

 It might be a single set of readings during a time correction that
 screwed it.

Again, I don't think so.

 I can reproduce it at will. Due to the nature of the test, the test
 results are variable and unfortunately it is one of the tricker mmtest
 configurations to setup.

 1. Get access to a webserver
 2. Close mmtests to your test machine
git clone https://github.com/gormanm/mmtests.git
 3. Edit shellpacks/common-config.sh and set WEBROOT to a webserver path
 4. Create a tar.gz of a large git tree and place it at 
 $WEBROOT/linux-2.6.tar.gz
Alternatively place a compressed git tree anywhere and edit
configs/config-global-dhp__io-multiple-source-latency
and update GITCHECKOUT_SOURCETAR
 5. Create a tar.gz of a large maildir directory and place it at
$WEBROOT/$WEBROOT/maildir.tar.gz
Alternatively, use an existing maildir folder and set
MONITOR_INBOX_OPEN_MAILDIR in
configs/config-global-dhp__io-multiple-source-latency

 It's awkward but it's not like there are standard benchmarks lying around
 and it seemed the best way to reproduce the problems I typically see early
 in the lifetime of a system or when running a git checkout when the tree
 has not been used in a few hours. Run the actual test with

 ./run-mmtests.sh --config 
 configs/config-global-dhp__io-multiple-source-latency --run-monitor 
 test-name-of-your-choice

 Results will be in work/log. You'll need to run this as root so it
 can run blktrace and so it can drop_caches between git checkouts
 (to force disk IO). If systemtap craps out on you, then edit
 configs/config-global-dhp__io-multiple-source-latency and remove dstate
 from MONITORS_GZIP

And how do I determine whether I've hit the problem?

 If you have trouble getting this running, ping me on IRC.

Yes, I'm having issues getting things 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Mel Gorman
On Tue, Apr 23, 2013 at 11:50:19AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote:
> > That's a pretty big drop but it gets bad again for the second worst stall --
> > wait_on_page_bit as a result of generic_file_buffered_write.
> > 
> > Vanilla kernel  1336064 ms stalled with 109 events
> > Patched kernel  2338781 ms stalled with 164 events
> 
> Do you have the stack trace for this stall?  I'm wondering if this is
> caused by the waiting for stable pages in write_begin() , or something
> else.
> 

[] wait_on_page_bit+0x78/0x80
[] kretprobe_trampoline+0x0/0x4c
[] generic_file_buffered_write+0x114/0x2a0
[] __generic_file_aio_write+0x1bd/0x3c0
[] generic_file_aio_write+0x7a/0xf0
[] ext4_file_write+0x99/0x420
[] do_sync_write+0xa7/0xe0
[] vfs_write+0xa7/0x180
[] sys_write+0x4d/0x90
[] system_call_fastpath+0x1a/0x1f
[] 0x

The processes that stalled in this particular trace are wget, latency-output,
tar and tclsh. Most of these are sequential writers except for tar which
is both a sequential reader and sequential writers.

> If it is blocking caused by stable page writeback that's interesting,
> since it would imply that something in your workload is trying to
> write to a page that has already been modified (i.e., appending to a
> log file, or updating a database file).  Does that make sense given
> what your workload might be running?
> 

I doubt it is stable write consider the type of processes that are running. I
would expect the bulk of the activity to be sequential readers or writers
of multiple files. The summarised report from the raw data is now at

http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-vanilla
http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-ext4tag

It's an aside but the worst of the stalls are incurred by systemd-tmpfile
which were not a deliberate part of the test and yet another thing that
I would not have caught unless I was running tests on my laptop. Looking
closer at that thing, the default configuration is to run the service 15
minutes after boot and after that it runs once a day. It looks like the
bulk of the scanning would be in /var/tmp/ looking at systemds own files
(over 3000 of them) which I'm a little amused by.

My normal test machines would not hit this because they are not systemd
based but the existance of thing thing is worth noting. Any IO-based tests
run on systemd-based distributions may give different results depending
on whether this service triggered during the test or not.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Theodore Ts'o
On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote:
> That's a pretty big drop but it gets bad again for the second worst stall --
> wait_on_page_bit as a result of generic_file_buffered_write.
> 
> Vanilla kernel  1336064 ms stalled with 109 events
> Patched kernel  2338781 ms stalled with 164 events

Do you have the stack trace for this stall?  I'm wondering if this is
caused by the waiting for stable pages in write_begin() , or something
else.

If it is blocking caused by stable page writeback that's interesting,
since it would imply that something in your workload is trying to
write to a page that has already been modified (i.e., appending to a
log file, or updating a database file).  Does that make sense given
what your workload might be running?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Mel Gorman
On Sat, Apr 20, 2013 at 08:05:22PM -0400, Theodore Ts'o wrote:
> An alternate solution which I've been playing around adds buffer_head
> flags so we can indicate that a buffer contains metadata and/or should
> have I/O submitted with the REQ_PRIO flag set.
> 

I beefed up the reporting slightly and tested the patches comparing
3.9-rc6 vanilla with your patches. The full report with graphs are at

http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/report.html

   3.9.0-rc6 3.9.0-rc6
 vanilla   ext4tag
Usermin   0.00 (  0.00%)0.00 (  0.00%)
Usermean   nan (   nan%) nan (   nan%)
Userstddev nan (   nan%) nan (   nan%)
Usermax   0.00 (  0.00%)0.00 (  0.00%)
Userrange 0.00 (  0.00%)0.00 (  0.00%)
System  min   9.14 (  0.00%)9.13 (  0.11%)
System  mean  9.60 (  0.00%)9.73 ( -1.33%)
System  stddev0.39 (  0.00%)0.94 (-142.69%)
System  max  10.31 (  0.00%)   11.58 (-12.32%)
System  range 1.17 (  0.00%)2.45 (-109.40%)
Elapsed min 665.54 (  0.00%)  612.25 (  8.01%)
Elapsed mean775.35 (  0.00%)  688.01 ( 11.26%)
Elapsed stddev   69.11 (  0.00%)   58.22 ( 15.75%)
Elapsed max 858.40 (  0.00%)  773.06 (  9.94%)
Elapsed range   192.86 (  0.00%)  160.81 ( 16.62%)
CPU min   3.00 (  0.00%)3.00 (  0.00%)
CPU mean  3.60 (  0.00%)4.20 (-16.67%)
CPU stddev0.49 (  0.00%)0.75 (-52.75%)
CPU max   4.00 (  0.00%)5.00 (-25.00%)
CPU range 1.00 (  0.00%)2.00 (-100.00%)

The patches appear to improve the git checkout times slightly but this
test is quite variable.

The vmstat figures report some reclaim activity but if you look at the graphs
further down you will see that the bulk of the kswapd reclaim scan and
steal activity is at the start of the test when it's downloading and
untarring a git tree to work with. (I also note that the mouse-over
graph for direct reclaim efficiency is broken but it's not important
right now).

>From iostat

3.9.0-rc6   3.9.0-rc6
  vanilla ext4tag
Mean dm-0-avgqz  1.181.19
Mean dm-0-await 17.30   16.50
Mean dm-0-r_await   17.30   16.50
Mean dm-0-w_await0.940.48
Mean sda-avgqz 650.29  719.81
Mean sda-await2501.33 2597.23
Mean sda-r_await30.01   24.91
Mean sda-w_await 11228.8011120.64
Max  dm-0-avgqz 12.30   10.14
Max  dm-0-await 42.65   52.23
Max  dm-0-r_await   42.65   52.23
Max  dm-0-w_await  541.00  263.83
Max  sda-avgqz3811.93 3375.11
Max  sda-await7178.61 7170.44
Max  sda-r_await   384.37  297.85
Max  sda-w_await 51353.9350338.25

There are no really obvious massive advantages to me there and if you look
at the graphs for the avgqs, await etc over time, the patched kernel are
not obviously better. The Wait CPU usage looks roughly the same too.

On the more positive side, the dstate systemtap monitor script tells me
that all processes were stalled for less time -- 9575 seconds versus
10910. The most severe event to stall on is sleep_on_buffer() as a
result of ext4_bread.

Vanilla kernel  3325677 ms stalled with 57 events
Patched kernel  2411471 ms stalled with 42 events

That's a pretty big drop but it gets bad again for the second worst stall --
wait_on_page_bit as a result of generic_file_buffered_write.

Vanilla kernel  1336064 ms stalled with 109 events
Patched kernel  2338781 ms stalled with 164 events

So conceptually the patches make sense but the first set of tests do
not indicate that they'll fix the problem and the stall times do not
indicate that interactivity will be any better. I'll still apply them
and boot them on my main work machine and see how they "feel" this
evening.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Mel Gorman
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote:
> > 3. The blktrace indicates that reads can starve writes from flusher
> >
> >While there are people that can look at a blktrace and find problems
> >like they are rain man, I'm more like an ADHD squirrel when looking at
> >a trace.  I wrote a script to look for what unrelated requests completed
> >while an request got stalled for over a second. It seemed like something
> >that a tool shoudl already exist for but I didn't find one unless btt
> >can give the information somehow.
> 
> Care to share that script?
> 

I would have preferred not to because it is an ugly hatchet job churned
out in a few minutes. It's written in perl and uses the text output from
blkparse making it slow. It uses an excessive amount of memory because I was
taking shortcuts so is resource heavy. It ignores most of the information
from blkparse and so there are gaps in what it reports. Even though it's
dogshit, it was useful in this particular case so I added it to mmtests
anyway. Be aware that it takes ages to run and you might want to break
the blkparse output into pieces

It's used something like

blkparse -i blktrace-log > blkparse-log
cat blkparse-log | $PATH_TO_MMTESTS/subreport/blktrace-queue-watch.pl

> [snip]
> 
> > I recognise that the output will have a WTF reaction but the key
> > observations to me are
> >
> > a) a single write request from flusher took over a second to complete
> > b) at the time it was queued, it was mostly other writes that were in
> >the queue at the same time
> > c) The write request and the parallel writes were all asynchronous write
> >requests
> > D) at the time the request completed, there were a LARGE number of
> >other requested queued and completed at the same time.
> >
> > Of the requests queued and completed in the meantime the breakdown was
> >
> >  22 RM
> >  31 RA
> >  82 W
> > 445 R
> >
> > If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> > 31 reads were remapped to another device (RA) which is probably reads from
> > the dm-crypt partition, 82 were writes (W) which is not far off the number
> > of writes that were in the queue and 445 were other reads. The delay was
> > dominated by reads that were queued after the write request and completed
> > before it.
> 
> RM == Read Meta
> RA == Read Ahead  (remapping, by the way, does not happen across
>devices, just into partitions)
> W and R you understood correctly.
> 

Thanks for those corrections. I misread the meaning of the action
identifiers section of the blkparse manual. I should have double checked
the source.

> >> 
> >> The thing is, we do want to make ext4 work well with cfq, and
> >> prioritizing non-readahead read requests ahead of data writeback does
> >> make sense.  The issue is with is that metadata writes going through
> >> the block device could in some cases effectively cause a priority
> >> inversion when what had previously been an asynchronous writeback
> >> starts blocking a foreground, user-visible process.
> >> 
> >> At least, that's the theory;
> >
> > I *think* the data more or less confirms the theory but it'd be nice if
> > someone else double checked in case I'm seeing what I want to see
> > instead of what is actually there.
> 
> Looks sane.  You can also see a lot of "preempt"s in the blkparse
> output, which indicates exactly what you're saying.  Any sync request
> gets priority over the async requests.
> 

Good to know.

> I'll also note that even though your I/O is going all over the place
> (D2C is pretty bad, 14ms), most of the time is spent waiting for a
> struct request allocation or between Queue and Merge:
> 
>  All Devices 
> 
> ALL   MIN   AVG   MAX   N
> --- - - - ---
> 
> Q2Q   0.1   0.000992259   8.898375882 2300861
> Q2G   0.00843  10.193261239 2064.079501935 1016463 <

This is not normally my sandbox so do you mind spelling this out?

IIUC, the time to allocate the struct request from the slab cache is just a
small portion of this time. The bulk of the time is spent in get_request()
waiting for congestion to clear on the request list for either the sync or
async queue. Once a process goes to sleep on that waitqueue, it has to wait
until enough requests on that queue have been serviced before it gets woken
again at which point it gets priority access to prevent further starvation.
This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
is that the async queue was congested and on average, we are waiting for
10 seconds for it to clear. The maximum value may be bogus for reasons
explained later.

Is that accurate?

> G2I   0.00461   0.44702   3.237065090 1015803
> Q2M   0.00101   8.203147238 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Jan Kara
On Mon 22-04-13 18:42:23, Jeff Moyer wrote:
> Jan, if I were to come up with a way of promoting a particular async
> queue to the front of the line, where would I put such a call in the
> ext4/jbd2 code to be effective?
  As Ted wrote the simplies might be to put his directly in
__lock_buffer(). Something like:

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..e026a3e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -69,6 +69,12 @@ static int sleep_on_buffer(void *word)
 
 void __lock_buffer(struct buffer_head *bh)
 {
+   /*
+* Likely under async writeback? Tell io scheduler we are
+* now waiting for the IO...
+*/
+   if (PageWriteback(bh->b_page))
+   io_now_sync(bh->b_bdev, bh->b_blocknr);
wait_on_bit_lock(>b_state, BH_Lock, sleep_on_buffer,
TASK_UNINTERRUPTIBLE);
}

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Jan Kara
On Mon 22-04-13 18:42:23, Jeff Moyer wrote:
 Jan, if I were to come up with a way of promoting a particular async
 queue to the front of the line, where would I put such a call in the
 ext4/jbd2 code to be effective?
  As Ted wrote the simplies might be to put his directly in
__lock_buffer(). Something like:

diff --git a/fs/buffer.c b/fs/buffer.c
index b4dcb34..e026a3e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -69,6 +69,12 @@ static int sleep_on_buffer(void *word)
 
 void __lock_buffer(struct buffer_head *bh)
 {
+   /*
+* Likely under async writeback? Tell io scheduler we are
+* now waiting for the IO...
+*/
+   if (PageWriteback(bh-b_page))
+   io_now_sync(bh-b_bdev, bh-b_blocknr);
wait_on_bit_lock(bh-b_state, BH_Lock, sleep_on_buffer,
TASK_UNINTERRUPTIBLE);
}

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Mel Gorman
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote:
  3. The blktrace indicates that reads can starve writes from flusher
 
 While there are people that can look at a blktrace and find problems
 like they are rain man, I'm more like an ADHD squirrel when looking at
 a trace.  I wrote a script to look for what unrelated requests completed
 while an request got stalled for over a second. It seemed like something
 that a tool shoudl already exist for but I didn't find one unless btt
 can give the information somehow.
 
 Care to share that script?
 

I would have preferred not to because it is an ugly hatchet job churned
out in a few minutes. It's written in perl and uses the text output from
blkparse making it slow. It uses an excessive amount of memory because I was
taking shortcuts so is resource heavy. It ignores most of the information
from blkparse and so there are gaps in what it reports. Even though it's
dogshit, it was useful in this particular case so I added it to mmtests
anyway. Be aware that it takes ages to run and you might want to break
the blkparse output into pieces

It's used something like

blkparse -i blktrace-log  blkparse-log
cat blkparse-log | $PATH_TO_MMTESTS/subreport/blktrace-queue-watch.pl

 [snip]
 
  I recognise that the output will have a WTF reaction but the key
  observations to me are
 
  a) a single write request from flusher took over a second to complete
  b) at the time it was queued, it was mostly other writes that were in
 the queue at the same time
  c) The write request and the parallel writes were all asynchronous write
 requests
  D) at the time the request completed, there were a LARGE number of
 other requested queued and completed at the same time.
 
  Of the requests queued and completed in the meantime the breakdown was
 
   22 RM
   31 RA
   82 W
  445 R
 
  If I'm reading this correctly, it is saying that 22 reads were merged (RM),
  31 reads were remapped to another device (RA) which is probably reads from
  the dm-crypt partition, 82 were writes (W) which is not far off the number
  of writes that were in the queue and 445 were other reads. The delay was
  dominated by reads that were queued after the write request and completed
  before it.
 
 RM == Read Meta
 RA == Read Ahead  (remapping, by the way, does not happen across
devices, just into partitions)
 W and R you understood correctly.
 

Thanks for those corrections. I misread the meaning of the action
identifiers section of the blkparse manual. I should have double checked
the source.

  SNIP
  The thing is, we do want to make ext4 work well with cfq, and
  prioritizing non-readahead read requests ahead of data writeback does
  make sense.  The issue is with is that metadata writes going through
  the block device could in some cases effectively cause a priority
  inversion when what had previously been an asynchronous writeback
  starts blocking a foreground, user-visible process.
  
  At least, that's the theory;
 
  I *think* the data more or less confirms the theory but it'd be nice if
  someone else double checked in case I'm seeing what I want to see
  instead of what is actually there.
 
 Looks sane.  You can also see a lot of preempts in the blkparse
 output, which indicates exactly what you're saying.  Any sync request
 gets priority over the async requests.
 

Good to know.

 I'll also note that even though your I/O is going all over the place
 (D2C is pretty bad, 14ms), most of the time is spent waiting for a
 struct request allocation or between Queue and Merge:
 
  All Devices 
 
 ALL   MIN   AVG   MAX   N
 --- - - - ---
 
 Q2Q   0.1   0.000992259   8.898375882 2300861
 Q2G   0.00843  10.193261239 2064.079501935 1016463 

This is not normally my sandbox so do you mind spelling this out?

IIUC, the time to allocate the struct request from the slab cache is just a
small portion of this time. The bulk of the time is spent in get_request()
waiting for congestion to clear on the request list for either the sync or
async queue. Once a process goes to sleep on that waitqueue, it has to wait
until enough requests on that queue have been serviced before it gets woken
again at which point it gets priority access to prevent further starvation.
This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here
is that the async queue was congested and on average, we are waiting for
10 seconds for it to clear. The maximum value may be bogus for reasons
explained later.

Is that accurate?

 G2I   0.00461   0.44702   3.237065090 1015803
 Q2M   0.00101   8.203147238 2064.079367557 1311662
 I2D   0.02012   1.476824812 2064.089774419 1014890
 M2D   0.03283   

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Mel Gorman
On Sat, Apr 20, 2013 at 08:05:22PM -0400, Theodore Ts'o wrote:
 An alternate solution which I've been playing around adds buffer_head
 flags so we can indicate that a buffer contains metadata and/or should
 have I/O submitted with the REQ_PRIO flag set.
 

I beefed up the reporting slightly and tested the patches comparing
3.9-rc6 vanilla with your patches. The full report with graphs are at

http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/report.html

   3.9.0-rc6 3.9.0-rc6
 vanilla   ext4tag
Usermin   0.00 (  0.00%)0.00 (  0.00%)
Usermean   nan (   nan%) nan (   nan%)
Userstddev nan (   nan%) nan (   nan%)
Usermax   0.00 (  0.00%)0.00 (  0.00%)
Userrange 0.00 (  0.00%)0.00 (  0.00%)
System  min   9.14 (  0.00%)9.13 (  0.11%)
System  mean  9.60 (  0.00%)9.73 ( -1.33%)
System  stddev0.39 (  0.00%)0.94 (-142.69%)
System  max  10.31 (  0.00%)   11.58 (-12.32%)
System  range 1.17 (  0.00%)2.45 (-109.40%)
Elapsed min 665.54 (  0.00%)  612.25 (  8.01%)
Elapsed mean775.35 (  0.00%)  688.01 ( 11.26%)
Elapsed stddev   69.11 (  0.00%)   58.22 ( 15.75%)
Elapsed max 858.40 (  0.00%)  773.06 (  9.94%)
Elapsed range   192.86 (  0.00%)  160.81 ( 16.62%)
CPU min   3.00 (  0.00%)3.00 (  0.00%)
CPU mean  3.60 (  0.00%)4.20 (-16.67%)
CPU stddev0.49 (  0.00%)0.75 (-52.75%)
CPU max   4.00 (  0.00%)5.00 (-25.00%)
CPU range 1.00 (  0.00%)2.00 (-100.00%)

The patches appear to improve the git checkout times slightly but this
test is quite variable.

The vmstat figures report some reclaim activity but if you look at the graphs
further down you will see that the bulk of the kswapd reclaim scan and
steal activity is at the start of the test when it's downloading and
untarring a git tree to work with. (I also note that the mouse-over
graph for direct reclaim efficiency is broken but it's not important
right now).

From iostat

3.9.0-rc6   3.9.0-rc6
  vanilla ext4tag
Mean dm-0-avgqz  1.181.19
Mean dm-0-await 17.30   16.50
Mean dm-0-r_await   17.30   16.50
Mean dm-0-w_await0.940.48
Mean sda-avgqz 650.29  719.81
Mean sda-await2501.33 2597.23
Mean sda-r_await30.01   24.91
Mean sda-w_await 11228.8011120.64
Max  dm-0-avgqz 12.30   10.14
Max  dm-0-await 42.65   52.23
Max  dm-0-r_await   42.65   52.23
Max  dm-0-w_await  541.00  263.83
Max  sda-avgqz3811.93 3375.11
Max  sda-await7178.61 7170.44
Max  sda-r_await   384.37  297.85
Max  sda-w_await 51353.9350338.25

There are no really obvious massive advantages to me there and if you look
at the graphs for the avgqs, await etc over time, the patched kernel are
not obviously better. The Wait CPU usage looks roughly the same too.

On the more positive side, the dstate systemtap monitor script tells me
that all processes were stalled for less time -- 9575 seconds versus
10910. The most severe event to stall on is sleep_on_buffer() as a
result of ext4_bread.

Vanilla kernel  3325677 ms stalled with 57 events
Patched kernel  2411471 ms stalled with 42 events

That's a pretty big drop but it gets bad again for the second worst stall --
wait_on_page_bit as a result of generic_file_buffered_write.

Vanilla kernel  1336064 ms stalled with 109 events
Patched kernel  2338781 ms stalled with 164 events

So conceptually the patches make sense but the first set of tests do
not indicate that they'll fix the problem and the stall times do not
indicate that interactivity will be any better. I'll still apply them
and boot them on my main work machine and see how they feel this
evening.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Theodore Ts'o
On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote:
 That's a pretty big drop but it gets bad again for the second worst stall --
 wait_on_page_bit as a result of generic_file_buffered_write.
 
 Vanilla kernel  1336064 ms stalled with 109 events
 Patched kernel  2338781 ms stalled with 164 events

Do you have the stack trace for this stall?  I'm wondering if this is
caused by the waiting for stable pages in write_begin() , or something
else.

If it is blocking caused by stable page writeback that's interesting,
since it would imply that something in your workload is trying to
write to a page that has already been modified (i.e., appending to a
log file, or updating a database file).  Does that make sense given
what your workload might be running?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-23 Thread Mel Gorman
On Tue, Apr 23, 2013 at 11:50:19AM -0400, Theodore Ts'o wrote:
 On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote:
  That's a pretty big drop but it gets bad again for the second worst stall --
  wait_on_page_bit as a result of generic_file_buffered_write.
  
  Vanilla kernel  1336064 ms stalled with 109 events
  Patched kernel  2338781 ms stalled with 164 events
 
 Do you have the stack trace for this stall?  I'm wondering if this is
 caused by the waiting for stable pages in write_begin() , or something
 else.
 

[81110238] wait_on_page_bit+0x78/0x80
[815af294] kretprobe_trampoline+0x0/0x4c
[81110e84] generic_file_buffered_write+0x114/0x2a0
[8ccd] __generic_file_aio_write+0x1bd/0x3c0
[8f4a] generic_file_aio_write+0x7a/0xf0
[811ee639] ext4_file_write+0x99/0x420
[81174d87] do_sync_write+0xa7/0xe0
[81175447] vfs_write+0xa7/0x180
[811758cd] sys_write+0x4d/0x90
[815b3eed] system_call_fastpath+0x1a/0x1f
[] 0x

The processes that stalled in this particular trace are wget, latency-output,
tar and tclsh. Most of these are sequential writers except for tar which
is both a sequential reader and sequential writers.

 If it is blocking caused by stable page writeback that's interesting,
 since it would imply that something in your workload is trying to
 write to a page that has already been modified (i.e., appending to a
 log file, or updating a database file).  Does that make sense given
 what your workload might be running?
 

I doubt it is stable write consider the type of processes that are running. I
would expect the bulk of the activity to be sequential readers or writers
of multiple files. The summarised report from the raw data is now at

http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-vanilla
http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-ext4tag

It's an aside but the worst of the stalls are incurred by systemd-tmpfile
which were not a deliberate part of the test and yet another thing that
I would not have caught unless I was running tests on my laptop. Looking
closer at that thing, the default configuration is to run the service 15
minutes after boot and after that it runs once a day. It looks like the
bulk of the scanning would be in /var/tmp/ looking at systemds own files
(over 3000 of them) which I'm a little amused by.

My normal test machines would not hit this because they are not systemd
based but the existance of thing thing is worth noting. Any IO-based tests
run on systemd-based distributions may give different results depending
on whether this service triggered during the test or not.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-22 Thread Theodore Ts'o
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote:
> 
> Jan, if I were to come up with a way of promoting a particular async
> queue to the front of the line, where would I put such a call in the
> ext4/jbd2 code to be effective?

Well, I thought we had discussed trying to bump a pending I/O
automatically when there was an attempt to call lock_buffer() on the
bh?  That would be ideal, because we could keep the async writeback
low priority until someone is trying to wait upon it, at which point
obviously it should no longer be considered an async write call.

Failing that, this is something I've been toying with what do you
think?

http://patchwork.ozlabs.org/patch/238192/
http://patchwork.ozlabs.org/patch/238257/

(The first patch in the series just makes sure that allocation bitmap
reads are marked with the META/PRIO flags.  It's not strictly speaking
related to the problem discussed here, but for completeness:
http://patchwork.ozlabs.org/patch/238193/)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-22 Thread Jeff Moyer
Mel Gorman  writes:

> (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

Thanks.  I've got a few comments and corrections for you below.

> TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
> starve flusher writes.
[snip]

> 3. The blktrace indicates that reads can starve writes from flusher
>
>While there are people that can look at a blktrace and find problems
>like they are rain man, I'm more like an ADHD squirrel when looking at
>a trace.  I wrote a script to look for what unrelated requests completed
>while an request got stalled for over a second. It seemed like something
>that a tool shoudl already exist for but I didn't find one unless btt
>can give the information somehow.

Care to share that script?

[snip]

> I recognise that the output will have a WTF reaction but the key
> observations to me are
>
> a) a single write request from flusher took over a second to complete
> b) at the time it was queued, it was mostly other writes that were in
>the queue at the same time
> c) The write request and the parallel writes were all asynchronous write
>requests
> D) at the time the request completed, there were a LARGE number of
>other requested queued and completed at the same time.
>
> Of the requests queued and completed in the meantime the breakdown was
>
>  22 RM
>  31 RA
>  82 W
> 445 R
>
> If I'm reading this correctly, it is saying that 22 reads were merged (RM),
> 31 reads were remapped to another device (RA) which is probably reads from
> the dm-crypt partition, 82 were writes (W) which is not far off the number
> of writes that were in the queue and 445 were other reads. The delay was
> dominated by reads that were queued after the write request and completed
> before it.

RM == Read Meta
RA == Read Ahead  (remapping, by the way, does not happen across
   devices, just into partitions)
W and R you understood correctly.

> That's saying that the 27128th request in the trace took over 7 seconds
> to complete and was an asynchronous write from flusher. The contents of
> the queue are displayed at that time and the breakdown of requests is
>
>  23 WS  [JEM: write sync]
>  86 RM  [JEM: Read Meta]
> 124 RA  [JEM: Read Ahead]
> 442 W
>1931 R
>
> 7 seconds later when it was completed the breakdown of completed
> requests was
>
>  25 WS
> 114 RM
> 155 RA
> 408 W
>2457 R
>
> In combination, that confirms for me that asynchronous writes from flush
> are being starved by reads. When a process requires a buffer that is locked
> by that asynchronous write from flusher, it stalls.
>
>> The thing is, we do want to make ext4 work well with cfq, and
>> prioritizing non-readahead read requests ahead of data writeback does
>> make sense.  The issue is with is that metadata writes going through
>> the block device could in some cases effectively cause a priority
>> inversion when what had previously been an asynchronous writeback
>> starts blocking a foreground, user-visible process.
>> 
>> At least, that's the theory;
>
> I *think* the data more or less confirms the theory but it'd be nice if
> someone else double checked in case I'm seeing what I want to see
> instead of what is actually there.

Looks sane.  You can also see a lot of "preempt"s in the blkparse
output, which indicates exactly what you're saying.  Any sync request
gets priority over the async requests.

I'll also note that even though your I/O is going all over the place
(D2C is pretty bad, 14ms), most of the time is spent waiting for a
struct request allocation or between Queue and Merge:

 All Devices 

ALL   MIN   AVG   MAX   N
--- - - - ---

Q2Q   0.1   0.000992259   8.898375882 2300861
Q2G   0.00843  10.193261239 2064.079501935 1016463 <
G2I   0.00461   0.44702   3.237065090 1015803
Q2M   0.00101   8.203147238 2064.079367557 1311662
I2D   0.02012   1.476824812 2064.089774419 1014890
M2D   0.03283   6.994306138 283.573348664 1284872
D2C   0.61889   0.014438316   0.857811758 2291996
Q2C   0.72284  13.363007244 2064.092228625 2292191

 Device Overhead 

   DEV |   Q2G   G2I   Q2M   I2D   D2C
-- | - - - - -
 (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
-- | - - - - -
   Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%

I'm not sure I believe that max value.  2064 seconds seems a bit high.
Also, Q2M should not be anywhere near that big, so more investigation is
required there.  A quick look over 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-22 Thread Mel Gorman
(Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
> > > If that is the case, one possible solution that comes to mind would be
> > > to mark buffer_heads that contain metadata with a flag, so that the
> > > flusher thread can write them back at the same priority as reads.
> > 
> > Ext4 is already using REQ_META for this purpose.
> 
> We're using REQ_META | REQ_PRIO for reads, not writes.
> 
> > I'm surprised that no-one has suggested "change the IO elevator"
> > yet.
> 
> Well, testing to see if the stalls go away with the noop schedule is a
> good thing to try just to validate the theory.
> 

I still haven't tested with a different elevator. While this bug is
relatively high priority for me, there are still are other issues in the way.

TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
starve flusher writes.

Now the ugliness and being a windbag.

I collected blktrace and some other logs and they are available at
http://www.csn.ul.ie/~mel/postings/stalls-20130419/log.tar.gz  and there
is a lot of stuff in there.  The unix time the test started is in the
first line of the file tests-timestamp-bisect . This can help figure out
how far into the test some of the other timestamped logs are

The kernel log with information from the lock_buffer debugging patch is
in dmesg-bisect-gitcheckout . The information in it is race-prone and
cannot be 100% trusted but it's still useful.

iostat is in iostat-bisect-gitcheckout 

Here are a few observations I got from the data.

1. flushers appear to acquire buffer locks *very* quickly after an
   application writes. Look for lines that look like

   "foo failed trylock without holder released 0 ms ago acquired 0 ms ago by 
bar"

   There are a lot of entries like this

jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 2124 tar failed trylock without holder released 0 ms ago acquired 
0 ms ago by 239 flush-8:0
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar

   I expected flushers to be writing back the buffers just released in about
   5 seconds time, not immediately.  It may indicate that when flushers
   wake to clean expired inodes that it keeps cleaning inodes as they are
   being dirtied.

2. The flush thread can prevent a process making forward progress for
   a long time. Take this as an example

 jbd2 stalled dev 8,8 for 8168 ms lock holdtime 20692 ms
 Last Owner 239 flush-8:0 Acquired Stack
  [] save_stack_trace+0x2a/0x50
  [] set_lock_buffer_owner+0x86/0x90
  [] __block_write_full_page+0x16e/0x360
  [] block_write_full_page_endio+0xd3/0x110
  [] block_write_full_page+0x10/0x20
  [] blkdev_writepage+0x13/0x20
  [] __writepage+0x12/0x40
  [] write_cache_pages+0x206/0x460
  [] generic_writepages+0x45/0x70
  [] do_writepages+0x1b/0x30
  [] __writeback_single_inode+0x40/0x1b0
  [] writeback_sb_inodes+0x19a/0x350
  [] __writeback_inodes_wb+0x96/0xc0
  [] wb_writeback+0x27b/0x330
  [] wb_do_writeback+0x190/0x1d0
  [] bdi_writeback_thread+0x83/0x280
  [] kthread+0xbb/0xc0
  [] ret_from_fork+0x7c/0xb0
  [] 0x

   This part is saying that we locked the buffer due to blkdev_writepage
   which I assume must be a metadata update. Based on where we lock the
   buffer, the only reason we would leave the buffer unlocked if this
   was an asynchronous write request leaving the buffer to be unlocked by
   end_buffer_async_write at some time in the future

 Last Owner Activity Stack: 239 flush-8:0
  [] __blkdev_issue_zeroout+0x191/0x1a0
  [] blkdev_issue_zeroout+0xe1/0xf0
  [] ext4_ext_zeroout.isra.30+0x49/0x60
  [] ext4_ext_convert_to_initialized+0x227/0x5f0
  [] ext4_ext_handle_uninitialized_extents+0x2f3/0x3a0
  [] ext4_ext_map_blocks+0x5d7/0xa00
  [] ext4_map_blocks+0x2d5/0x470
  [] mpage_da_map_and_submit+0xba/0x2f0
  [] ext4_da_writepages+0x380/0x620
  [] do_writepages+0x1b/0x30
  [] __writeback_single_inode+0x40/0x1b0
  [] writeback_sb_inodes+0x19a/0x350
  [] __writeback_inodes_wb+0x96/0xc0
  [] wb_writeback+0x27b/0x330

   This part is indicating that 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-22 Thread Mel Gorman
(Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
 On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
   If that is the case, one possible solution that comes to mind would be
   to mark buffer_heads that contain metadata with a flag, so that the
   flusher thread can write them back at the same priority as reads.
  
  Ext4 is already using REQ_META for this purpose.
 
 We're using REQ_META | REQ_PRIO for reads, not writes.
 
  I'm surprised that no-one has suggested change the IO elevator
  yet.
 
 Well, testing to see if the stalls go away with the noop schedule is a
 good thing to try just to validate the theory.
 

I still haven't tested with a different elevator. While this bug is
relatively high priority for me, there are still are other issues in the way.

TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
starve flusher writes.

Now the ugliness and being a windbag.

I collected blktrace and some other logs and they are available at
http://www.csn.ul.ie/~mel/postings/stalls-20130419/log.tar.gz  and there
is a lot of stuff in there.  The unix time the test started is in the
first line of the file tests-timestamp-bisect . This can help figure out
how far into the test some of the other timestamped logs are

The kernel log with information from the lock_buffer debugging patch is
in dmesg-bisect-gitcheckout . The information in it is race-prone and
cannot be 100% trusted but it's still useful.

iostat is in iostat-bisect-gitcheckout 

Here are a few observations I got from the data.

1. flushers appear to acquire buffer locks *very* quickly after an
   application writes. Look for lines that look like

   foo failed trylock without holder released 0 ms ago acquired 0 ms ago by 
bar

   There are a lot of entries like this

jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 2124 tar failed trylock without holder released 0 ms ago acquired 
0 ms ago by 239 flush-8:0
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar
jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago 
acquired 0 ms ago by 2124 tar

   I expected flushers to be writing back the buffers just released in about
   5 seconds time, not immediately.  It may indicate that when flushers
   wake to clean expired inodes that it keeps cleaning inodes as they are
   being dirtied.

2. The flush thread can prevent a process making forward progress for
   a long time. Take this as an example

 jbd2 stalled dev 8,8 for 8168 ms lock holdtime 20692 ms
 Last Owner 239 flush-8:0 Acquired Stack
  [8100fd8a] save_stack_trace+0x2a/0x50
  [811a3ad6] set_lock_buffer_owner+0x86/0x90
  [811a72ee] __block_write_full_page+0x16e/0x360
  [811a75b3] block_write_full_page_endio+0xd3/0x110
  [811a7600] block_write_full_page+0x10/0x20
  [811aa7f3] blkdev_writepage+0x13/0x20
  [81119352] __writepage+0x12/0x40
  [81119b56] write_cache_pages+0x206/0x460
  [81119df5] generic_writepages+0x45/0x70
  [8111accb] do_writepages+0x1b/0x30
  [81199d60] __writeback_single_inode+0x40/0x1b0
  [8119c40a] writeback_sb_inodes+0x19a/0x350
  [8119c656] __writeback_inodes_wb+0x96/0xc0
  [8119c8fb] wb_writeback+0x27b/0x330
  [8119e300] wb_do_writeback+0x190/0x1d0
  [8119e3c3] bdi_writeback_thread+0x83/0x280
  [8106901b] kthread+0xbb/0xc0
  [8159e1fc] ret_from_fork+0x7c/0xb0
  [] 0x

   This part is saying that we locked the buffer due to blkdev_writepage
   which I assume must be a metadata update. Based on where we lock the
   buffer, the only reason we would leave the buffer unlocked if this
   was an asynchronous write request leaving the buffer to be unlocked by
   end_buffer_async_write at some time in the future

 Last Owner Activity Stack: 239 flush-8:0
  [812aee61] __blkdev_issue_zeroout+0x191/0x1a0
  [812aef51] blkdev_issue_zeroout+0xe1/0xf0
  [8121abe9] ext4_ext_zeroout.isra.30+0x49/0x60
  [8121ee47] ext4_ext_convert_to_initialized+0x227/0x5f0
  [8121f8a3] ext4_ext_handle_uninitialized_extents+0x2f3/0x3a0
  [8121ff57] ext4_ext_map_blocks+0x5d7/0xa00
  

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-22 Thread Jeff Moyer
Mel Gorman mgor...@suse.de writes:

 (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace)

Thanks.  I've got a few comments and corrections for you below.

 TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads
 starve flusher writes.
[snip]

 3. The blktrace indicates that reads can starve writes from flusher

While there are people that can look at a blktrace and find problems
like they are rain man, I'm more like an ADHD squirrel when looking at
a trace.  I wrote a script to look for what unrelated requests completed
while an request got stalled for over a second. It seemed like something
that a tool shoudl already exist for but I didn't find one unless btt
can give the information somehow.

Care to share that script?

[snip]

 I recognise that the output will have a WTF reaction but the key
 observations to me are

 a) a single write request from flusher took over a second to complete
 b) at the time it was queued, it was mostly other writes that were in
the queue at the same time
 c) The write request and the parallel writes were all asynchronous write
requests
 D) at the time the request completed, there were a LARGE number of
other requested queued and completed at the same time.

 Of the requests queued and completed in the meantime the breakdown was

  22 RM
  31 RA
  82 W
 445 R

 If I'm reading this correctly, it is saying that 22 reads were merged (RM),
 31 reads were remapped to another device (RA) which is probably reads from
 the dm-crypt partition, 82 were writes (W) which is not far off the number
 of writes that were in the queue and 445 were other reads. The delay was
 dominated by reads that were queued after the write request and completed
 before it.

RM == Read Meta
RA == Read Ahead  (remapping, by the way, does not happen across
   devices, just into partitions)
W and R you understood correctly.

 That's saying that the 27128th request in the trace took over 7 seconds
 to complete and was an asynchronous write from flusher. The contents of
 the queue are displayed at that time and the breakdown of requests is

  23 WS  [JEM: write sync]
  86 RM  [JEM: Read Meta]
 124 RA  [JEM: Read Ahead]
 442 W
1931 R

 7 seconds later when it was completed the breakdown of completed
 requests was

  25 WS
 114 RM
 155 RA
 408 W
2457 R

 In combination, that confirms for me that asynchronous writes from flush
 are being starved by reads. When a process requires a buffer that is locked
 by that asynchronous write from flusher, it stalls.

 The thing is, we do want to make ext4 work well with cfq, and
 prioritizing non-readahead read requests ahead of data writeback does
 make sense.  The issue is with is that metadata writes going through
 the block device could in some cases effectively cause a priority
 inversion when what had previously been an asynchronous writeback
 starts blocking a foreground, user-visible process.
 
 At least, that's the theory;

 I *think* the data more or less confirms the theory but it'd be nice if
 someone else double checked in case I'm seeing what I want to see
 instead of what is actually there.

Looks sane.  You can also see a lot of preempts in the blkparse
output, which indicates exactly what you're saying.  Any sync request
gets priority over the async requests.

I'll also note that even though your I/O is going all over the place
(D2C is pretty bad, 14ms), most of the time is spent waiting for a
struct request allocation or between Queue and Merge:

 All Devices 

ALL   MIN   AVG   MAX   N
--- - - - ---

Q2Q   0.1   0.000992259   8.898375882 2300861
Q2G   0.00843  10.193261239 2064.079501935 1016463 
G2I   0.00461   0.44702   3.237065090 1015803
Q2M   0.00101   8.203147238 2064.079367557 1311662
I2D   0.02012   1.476824812 2064.089774419 1014890
M2D   0.03283   6.994306138 283.573348664 1284872
D2C   0.61889   0.014438316   0.857811758 2291996
Q2C   0.72284  13.363007244 2064.092228625 2292191

 Device Overhead 

   DEV |   Q2G   G2I   Q2M   I2D   D2C
-- | - - - - -
 (  8,  0) |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%
-- | - - - - -
   Overall |  33.8259%   0.0001%  35.1275%   4.8932%   0.1080%

I'm not sure I believe that max value.  2064 seconds seems a bit high.
Also, Q2M should not be anywhere near that big, so more investigation is
required there.  A quick look over the data doesn't show any such delays
(making me question the 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-22 Thread Theodore Ts'o
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote:
 
 Jan, if I were to come up with a way of promoting a particular async
 queue to the front of the line, where would I put such a call in the
 ext4/jbd2 code to be effective?

Well, I thought we had discussed trying to bump a pending I/O
automatically when there was an attempt to call lock_buffer() on the
bh?  That would be ideal, because we could keep the async writeback
low priority until someone is trying to wait upon it, at which point
obviously it should no longer be considered an async write call.

Failing that, this is something I've been toying with what do you
think?

http://patchwork.ozlabs.org/patch/238192/
http://patchwork.ozlabs.org/patch/238257/

(The first patch in the series just makes sure that allocation bitmap
reads are marked with the META/PRIO flags.  It's not strictly speaking
related to the problem discussed here, but for completeness:
http://patchwork.ozlabs.org/patch/238193/)

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-20 Thread Theodore Ts'o
As an update to this thread, we brought up this issue at LSF/MM, and
there is a thought that we should be able to solve this problem by
having lock_buffer() check to see if the buffer is locked due to a
write being queued, to have the priority of the write bumped up in the
write queues to resolve the priority inversion.  I believe Jeff Moyer
was going to look into this, if I remember correctly.

An alternate solution which I've been playing around adds buffer_head
flags so we can indicate that a buffer contains metadata and/or should
have I/O submitted with the REQ_PRIO flag set.

Adding a buffer_head flag for at least BH_Meta is probably a good
thing, since that way the blktrace will be properly annotated.
Whether we should keep the BH_Prio flag or rely on lock_buffer()
automatically raising the priority is, my feeling is that if
lock_buffer() can do the right thing, we should probably do it via
lock_buffer().  I have a feeling this might be decidedly non-trivial,
though, so perhaps we should just doing via BH flags?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-20 Thread Theodore Ts'o
As an update to this thread, we brought up this issue at LSF/MM, and
there is a thought that we should be able to solve this problem by
having lock_buffer() check to see if the buffer is locked due to a
write being queued, to have the priority of the write bumped up in the
write queues to resolve the priority inversion.  I believe Jeff Moyer
was going to look into this, if I remember correctly.

An alternate solution which I've been playing around adds buffer_head
flags so we can indicate that a buffer contains metadata and/or should
have I/O submitted with the REQ_PRIO flag set.

Adding a buffer_head flag for at least BH_Meta is probably a good
thing, since that way the blktrace will be properly annotated.
Whether we should keep the BH_Prio flag or rely on lock_buffer()
automatically raising the priority is, my feeling is that if
lock_buffer() can do the right thing, we should probably do it via
lock_buffer().  I have a feeling this might be decidedly non-trivial,
though, so perhaps we should just doing via BH flags?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Dave Chinner
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
> > > If that is the case, one possible solution that comes to mind would be
> > > to mark buffer_heads that contain metadata with a flag, so that the
> > > flusher thread can write them back at the same priority as reads.
> > 
> > Ext4 is already using REQ_META for this purpose.
> 
> We're using REQ_META | REQ_PRIO for reads, not writes.
> 
> > I'm surprised that no-one has suggested "change the IO elevator"
> > yet.
> 
> Well, testing to see if the stalls go away with the noop schedule is a
> good thing to try just to validate the theory.

Exactly.

> The thing is, we do want to make ext4 work well with cfq, and
> prioritizing non-readahead read requests ahead of data writeback does
> make sense.  The issue is with is that metadata writes going through
> the block device could in some cases effectively cause a priority
> inversion when what had previously been an asynchronous writeback
> starts blocking a foreground, user-visible process.

Here's the historic problem with CFQ: it's scheduling algorithms
change from release to release, and so what you tune the filesystem
to for this release is likely to cause different behaviour
in a few releases time.

We've had this problem time and time again with CFQ+XFS, so we
stopped trying to "tune" to a particular elevator long ago.  The
best you can do it tag the Io as appropriately as possible (e.g.
metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ
hasn't been broken since the last release

> At least, that's the theory; we should confirm that this is indeed
> what is causing the data stalls which Mel is reporting on HDD's before
> we start figuring out how to fix this problem.

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Theodore Ts'o
On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
> > If that is the case, one possible solution that comes to mind would be
> > to mark buffer_heads that contain metadata with a flag, so that the
> > flusher thread can write them back at the same priority as reads.
> 
> Ext4 is already using REQ_META for this purpose.

We're using REQ_META | REQ_PRIO for reads, not writes.

> I'm surprised that no-one has suggested "change the IO elevator"
> yet.

Well, testing to see if the stalls go away with the noop schedule is a
good thing to try just to validate the theory.

The thing is, we do want to make ext4 work well with cfq, and
prioritizing non-readahead read requests ahead of data writeback does
make sense.  The issue is with is that metadata writes going through
the block device could in some cases effectively cause a priority
inversion when what had previously been an asynchronous writeback
starts blocking a foreground, user-visible process.

At least, that's the theory; we should confirm that this is indeed
what is causing the data stalls which Mel is reporting on HDD's before
we start figuring out how to fix this problem.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Tvrtko Ursulin

Hi all,

On Thursday 11 April 2013 22:57:08 Theodore Ts'o wrote:
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.
> 
> The only problem I can see with this hypothesis is that if this is the
> explanation for what Mel and Jiri are seeing, it's something that
> would have been around for a long time, and would affect ext3 as well
> as ext4.  That isn't quite consistent, however, with Mel's observation
> that this is a probablem which has gotten worse in relatively
> recently.

Dropping in as a casual observer and having missed the start of the thread, 
risking that I will just muddle the water for you.

I had a similar problem for quite a while with ext4, at least that was my 
conclusion since the fix was to migrate one filesystem to xfs which fixed it 
for me. Time period when I observed this was between 3.5 and 3.7 kernels.

Situation was I had an ext4 filesystem (on top of LVM, which was on top of MD 
RAID 1, which was on top of two mechanical hard drives) which was dedicated to 
holding a large SVN check-out. Other filesystems were also ext4 on different 
logical volumes (but same spindles).

Symptoms were long stalls of everything (including window management!) on a 
relatively heavily loaded desktop (which was KDE). Stalls would last anything 
from five to maybe even 30 seconds. Not sure exactly but long enough that you 
think the system has actually crashed. I couldn't even switch away to a 
different virtual terminal during the stall, nothing.

Eventually I traced it down to kdesvn (subversion client) periodically 
refreshing (or something) it's metadata and hence generating some IO on that 
dedicated filesystem. That combined with some other desktop activity had an 
effect of stalling everything else. I thought it was very weird, but I suppose 
KDE and all the rest nowadays do to much IO in everything they do.

Following a hunch I reformatted that filesystem as XFS which fixed the 
problem.

I can't reproduce this now to run any tests so I know this is not very helpful 
now. But perhaps some of the info will be useful to someone.

Tvrtko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Mel Gorman
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
> >   I think it might be more enlightening if Mel traced which process in
> > which funclion is holding the buffer lock. I suspect we'll find out that
> > the flusher thread has submitted the buffer for IO as an async write and
> > thus it takes a long time to complete in presence of reads which have
> > higher priority.
> 
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.
> 
> The only problem I can see with this hypothesis is that if this is the
> explanation for what Mel and Jiri are seeing, it's something that
> would have been around for a long time, and would affect ext3 as well
> as ext4.  That isn't quite consistent, however, with Mel's observation
> that this is a probablem which has gotten worse in relatively
> recently.
> 

According to the tests I've run, multi-second stalls have been a problem for
a while but never really bothered me. I'm not sure why it felt particularly
bad around -rc2 or why it seems to be better now. Maybe I just had my
cranky pants on.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Mel Gorman
On Thu, Apr 11, 2013 at 02:35:12PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
> > > If we're stalling on lock_buffer(), that implies that buffer was being
> > > written, and for some reason it was taking a very long time to
> > > complete.
> > > 
> > 
> > Yes.
> > 
> > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> > > record the time when a particular I/O encryption/decryption is getting
> > > queued to the kcryptd workqueues, and when they finally squirt out.
> > > 
> > 
> > That somewhat assumes that dm_crypt was at fault which is not unreasonable
> > but I was skeptical as the workload on dm_crypt was opening a maildir
> > and mostly reads.
> 
> Hmm... well, I've reviewed all of the places in the ext4 and jbd2
> layer where we call lock_buffer(), and with one exception[1] we're not
> holding the the bh locked any longer than necessary.  There are a few
> places where we grab a spinlock or two before we can do what we need
> to do and then release the lock'ed buffer head, but the only time we
> hold the bh locked for long periods of time is when we submit metadata
> blocks for I/O.
> 

Yeah, ok. This is not the answer I was hoping for but it's the answer I
expected.

> Could you code which checks the hold time of lock_buffer(), measuing
> from when the lock is successfully grabbed, to see if you can see if I
> missed some code path in ext4 or jbd2 where the bh is locked and then
> there is some call to some function which needs to block for some
> random reason?
>
> What I'd suggest is putting a timestamp in buffer_head
> structure, which is set by lock_buffer once it is successfully grabbed
> the lock, and then in unlock_buffer(), if it is held for more than a
> second or some such, to dump out the stack trace.
> 

I can do that but the results might lack meaning. What I could do instead
is use a variation of the page owner tracking patch (current iteration at
https://lkml.org/lkml/2012/12/7/487) to record a stack trace in lock_buffer
and dump it from jbd2/transaction.c if it stalls for too long. I'll report
if I find something useful.

> Because at this point, either I'm missing something or I'm beginning
> to suspect that your hard drive (or maybe something the block layer?)
> is simply taking a long time to service an I/O request. 

It could be because the drive is a piece of crap.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Mel Gorman
On Thu, Apr 11, 2013 at 02:35:12PM -0400, Theodore Ts'o wrote:
 On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
   If we're stalling on lock_buffer(), that implies that buffer was being
   written, and for some reason it was taking a very long time to
   complete.
   
  
  Yes.
  
   It might be worthwhile to put a timestamp in struct dm_crypt_io, and
   record the time when a particular I/O encryption/decryption is getting
   queued to the kcryptd workqueues, and when they finally squirt out.
   
  
  That somewhat assumes that dm_crypt was at fault which is not unreasonable
  but I was skeptical as the workload on dm_crypt was opening a maildir
  and mostly reads.
 
 Hmm... well, I've reviewed all of the places in the ext4 and jbd2
 layer where we call lock_buffer(), and with one exception[1] we're not
 holding the the bh locked any longer than necessary.  There are a few
 places where we grab a spinlock or two before we can do what we need
 to do and then release the lock'ed buffer head, but the only time we
 hold the bh locked for long periods of time is when we submit metadata
 blocks for I/O.
 

Yeah, ok. This is not the answer I was hoping for but it's the answer I
expected.

 Could you code which checks the hold time of lock_buffer(), measuing
 from when the lock is successfully grabbed, to see if you can see if I
 missed some code path in ext4 or jbd2 where the bh is locked and then
 there is some call to some function which needs to block for some
 random reason?

 What I'd suggest is putting a timestamp in buffer_head
 structure, which is set by lock_buffer once it is successfully grabbed
 the lock, and then in unlock_buffer(), if it is held for more than a
 second or some such, to dump out the stack trace.
 

I can do that but the results might lack meaning. What I could do instead
is use a variation of the page owner tracking patch (current iteration at
https://lkml.org/lkml/2012/12/7/487) to record a stack trace in lock_buffer
and dump it from jbd2/transaction.c if it stalls for too long. I'll report
if I find something useful.

 Because at this point, either I'm missing something or I'm beginning
 to suspect that your hard drive (or maybe something the block layer?)
 is simply taking a long time to service an I/O request. 

It could be because the drive is a piece of crap.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Mel Gorman
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
 On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
I think it might be more enlightening if Mel traced which process in
  which funclion is holding the buffer lock. I suspect we'll find out that
  the flusher thread has submitted the buffer for IO as an async write and
  thus it takes a long time to complete in presence of reads which have
  higher priority.
 
 That's an interesting theory.  If the workload is one which is very
 heavy on reads and writes, that could explain the high latency.  That
 would explain why those of us who are using primarily SSD's are seeing
 the problems, because would be reads are nice and fast.
 
 If that is the case, one possible solution that comes to mind would be
 to mark buffer_heads that contain metadata with a flag, so that the
 flusher thread can write them back at the same priority as reads.
 
 The only problem I can see with this hypothesis is that if this is the
 explanation for what Mel and Jiri are seeing, it's something that
 would have been around for a long time, and would affect ext3 as well
 as ext4.  That isn't quite consistent, however, with Mel's observation
 that this is a probablem which has gotten worse in relatively
 recently.
 

According to the tests I've run, multi-second stalls have been a problem for
a while but never really bothered me. I'm not sure why it felt particularly
bad around -rc2 or why it seems to be better now. Maybe I just had my
cranky pants on.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Tvrtko Ursulin

Hi all,

On Thursday 11 April 2013 22:57:08 Theodore Ts'o wrote:
 That's an interesting theory.  If the workload is one which is very
 heavy on reads and writes, that could explain the high latency.  That
 would explain why those of us who are using primarily SSD's are seeing
 the problems, because would be reads are nice and fast.
 
 If that is the case, one possible solution that comes to mind would be
 to mark buffer_heads that contain metadata with a flag, so that the
 flusher thread can write them back at the same priority as reads.
 
 The only problem I can see with this hypothesis is that if this is the
 explanation for what Mel and Jiri are seeing, it's something that
 would have been around for a long time, and would affect ext3 as well
 as ext4.  That isn't quite consistent, however, with Mel's observation
 that this is a probablem which has gotten worse in relatively
 recently.

Dropping in as a casual observer and having missed the start of the thread, 
risking that I will just muddle the water for you.

I had a similar problem for quite a while with ext4, at least that was my 
conclusion since the fix was to migrate one filesystem to xfs which fixed it 
for me. Time period when I observed this was between 3.5 and 3.7 kernels.

Situation was I had an ext4 filesystem (on top of LVM, which was on top of MD 
RAID 1, which was on top of two mechanical hard drives) which was dedicated to 
holding a large SVN check-out. Other filesystems were also ext4 on different 
logical volumes (but same spindles).

Symptoms were long stalls of everything (including window management!) on a 
relatively heavily loaded desktop (which was KDE). Stalls would last anything 
from five to maybe even 30 seconds. Not sure exactly but long enough that you 
think the system has actually crashed. I couldn't even switch away to a 
different virtual terminal during the stall, nothing.

Eventually I traced it down to kdesvn (subversion client) periodically 
refreshing (or something) it's metadata and hence generating some IO on that 
dedicated filesystem. That combined with some other desktop activity had an 
effect of stalling everything else. I thought it was very weird, but I suppose 
KDE and all the rest nowadays do to much IO in everything they do.

Following a hunch I reformatted that filesystem as XFS which fixed the 
problem.

I can't reproduce this now to run any tests so I know this is not very helpful 
now. But perhaps some of the info will be useful to someone.

Tvrtko

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Theodore Ts'o
On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
  If that is the case, one possible solution that comes to mind would be
  to mark buffer_heads that contain metadata with a flag, so that the
  flusher thread can write them back at the same priority as reads.
 
 Ext4 is already using REQ_META for this purpose.

We're using REQ_META | REQ_PRIO for reads, not writes.

 I'm surprised that no-one has suggested change the IO elevator
 yet.

Well, testing to see if the stalls go away with the noop schedule is a
good thing to try just to validate the theory.

The thing is, we do want to make ext4 work well with cfq, and
prioritizing non-readahead read requests ahead of data writeback does
make sense.  The issue is with is that metadata writes going through
the block device could in some cases effectively cause a priority
inversion when what had previously been an asynchronous writeback
starts blocking a foreground, user-visible process.

At least, that's the theory; we should confirm that this is indeed
what is causing the data stalls which Mel is reporting on HDD's before
we start figuring out how to fix this problem.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-12 Thread Dave Chinner
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote:
 On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote:
   If that is the case, one possible solution that comes to mind would be
   to mark buffer_heads that contain metadata with a flag, so that the
   flusher thread can write them back at the same priority as reads.
  
  Ext4 is already using REQ_META for this purpose.
 
 We're using REQ_META | REQ_PRIO for reads, not writes.
 
  I'm surprised that no-one has suggested change the IO elevator
  yet.
 
 Well, testing to see if the stalls go away with the noop schedule is a
 good thing to try just to validate the theory.

Exactly.

 The thing is, we do want to make ext4 work well with cfq, and
 prioritizing non-readahead read requests ahead of data writeback does
 make sense.  The issue is with is that metadata writes going through
 the block device could in some cases effectively cause a priority
 inversion when what had previously been an asynchronous writeback
 starts blocking a foreground, user-visible process.

Here's the historic problem with CFQ: it's scheduling algorithms
change from release to release, and so what you tune the filesystem
to for this release is likely to cause different behaviour
in a few releases time.

We've had this problem time and time again with CFQ+XFS, so we
stopped trying to tune to a particular elevator long ago.  The
best you can do it tag the Io as appropriately as possible (e.g.
metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ
hasn't been broken since the last release

 At least, that's the theory; we should confirm that this is indeed
 what is causing the data stalls which Mel is reporting on HDD's before
 we start figuring out how to fix this problem.

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Dave Chinner
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
> >   I think it might be more enlightening if Mel traced which process in
> > which funclion is holding the buffer lock. I suspect we'll find out that
> > the flusher thread has submitted the buffer for IO as an async write and
> > thus it takes a long time to complete in presence of reads which have
> > higher priority.
> 
> That's an interesting theory.  If the workload is one which is very
> heavy on reads and writes, that could explain the high latency.  That
> would explain why those of us who are using primarily SSD's are seeing
> the problems, because would be reads are nice and fast.
> 
> If that is the case, one possible solution that comes to mind would be
> to mark buffer_heads that contain metadata with a flag, so that the
> flusher thread can write them back at the same priority as reads.

Ext4 is already using REQ_META for this purpose.

I'm surprised that no-one has suggested "change the IO elevator"
yet.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Theodore Ts'o
On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
>   I think it might be more enlightening if Mel traced which process in
> which funclion is holding the buffer lock. I suspect we'll find out that
> the flusher thread has submitted the buffer for IO as an async write and
> thus it takes a long time to complete in presence of reads which have
> higher priority.

That's an interesting theory.  If the workload is one which is very
heavy on reads and writes, that could explain the high latency.  That
would explain why those of us who are using primarily SSD's are seeing
the problems, because would be reads are nice and fast.

If that is the case, one possible solution that comes to mind would be
to mark buffer_heads that contain metadata with a flag, so that the
flusher thread can write them back at the same priority as reads.

The only problem I can see with this hypothesis is that if this is the
explanation for what Mel and Jiri are seeing, it's something that
would have been around for a long time, and would affect ext3 as well
as ext4.  That isn't quite consistent, however, with Mel's observation
that this is a probablem which has gotten worse in relatively
recently.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Jan Kara
On Thu 11-04-13 14:35:12, Ted Tso wrote:
> On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
> > > If we're stalling on lock_buffer(), that implies that buffer was being
> > > written, and for some reason it was taking a very long time to
> > > complete.
> > > 
> > 
> > Yes.
> > 
> > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> > > record the time when a particular I/O encryption/decryption is getting
> > > queued to the kcryptd workqueues, and when they finally squirt out.
> > > 
> > 
> > That somewhat assumes that dm_crypt was at fault which is not unreasonable
> > but I was skeptical as the workload on dm_crypt was opening a maildir
> > and mostly reads.
> 
> Hmm... well, I've reviewed all of the places in the ext4 and jbd2
> layer where we call lock_buffer(), and with one exception[1] we're not
> holding the the bh locked any longer than necessary.  There are a few
> places where we grab a spinlock or two before we can do what we need
> to do and then release the lock'ed buffer head, but the only time we
> hold the bh locked for long periods of time is when we submit metadata
> blocks for I/O.
> 
> [1] There is one exception in ext4_xattr_release_block() where I
> believe we should move the call to unlock_buffer(bh) before the call
> to ext4_free_blocks(), since we've already elevanted the bh count and
> ext4_free_blocks() does not need to have the bh locked.  It's not
> related to any of the stalls you've repored, though, as near as I can
> tell (none of the stack traces include the ext4 xattr code, and this
> would only affect external extended attribute blocks).
> 
> 
> Could you code which checks the hold time of lock_buffer(), measuing
> from when the lock is successfully grabbed, to see if you can see if I
> missed some code path in ext4 or jbd2 where the bh is locked and then
> there is some call to some function which needs to block for some
> random reason?  What I'd suggest is putting a timestamp in buffer_head
> structure, which is set by lock_buffer once it is successfully grabbed
> the lock, and then in unlock_buffer(), if it is held for more than a
> second or some such, to dump out the stack trace.
> 
> Because at this point, either I'm missing something or I'm beginning
> to suspect that your hard drive (or maybe something the block layer?)
> is simply taking a long time to service an I/O request.  Putting in
> this check should be able to very quickly determine what code path
> and/or which subsystem we should be focused upon.
  I think it might be more enlightening if Mel traced which process in
which funclion is holding the buffer lock. I suspect we'll find out that
the flusher thread has submitted the buffer for IO as an async write and
thus it takes a long time to complete in presence of reads which have
higher priority.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Theodore Ts'o
On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
> > If we're stalling on lock_buffer(), that implies that buffer was being
> > written, and for some reason it was taking a very long time to
> > complete.
> > 
> 
> Yes.
> 
> > It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> > record the time when a particular I/O encryption/decryption is getting
> > queued to the kcryptd workqueues, and when they finally squirt out.
> > 
> 
> That somewhat assumes that dm_crypt was at fault which is not unreasonable
> but I was skeptical as the workload on dm_crypt was opening a maildir
> and mostly reads.

Hmm... well, I've reviewed all of the places in the ext4 and jbd2
layer where we call lock_buffer(), and with one exception[1] we're not
holding the the bh locked any longer than necessary.  There are a few
places where we grab a spinlock or two before we can do what we need
to do and then release the lock'ed buffer head, but the only time we
hold the bh locked for long periods of time is when we submit metadata
blocks for I/O.

[1] There is one exception in ext4_xattr_release_block() where I
believe we should move the call to unlock_buffer(bh) before the call
to ext4_free_blocks(), since we've already elevanted the bh count and
ext4_free_blocks() does not need to have the bh locked.  It's not
related to any of the stalls you've repored, though, as near as I can
tell (none of the stack traces include the ext4 xattr code, and this
would only affect external extended attribute blocks).


Could you code which checks the hold time of lock_buffer(), measuing
from when the lock is successfully grabbed, to see if you can see if I
missed some code path in ext4 or jbd2 where the bh is locked and then
there is some call to some function which needs to block for some
random reason?  What I'd suggest is putting a timestamp in buffer_head
structure, which is set by lock_buffer once it is successfully grabbed
the lock, and then in unlock_buffer(), if it is held for more than a
second or some such, to dump out the stack trace.

Because at this point, either I'm missing something or I'm beginning
to suspect that your hard drive (or maybe something the block layer?)
is simply taking a long time to service an I/O request.  Putting in
this check should be able to very quickly determine what code path
and/or which subsystem we should be focused upon.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Mel Gorman
On Wed, Apr 10, 2013 at 09:12:45AM -0400, Theodore Ts'o wrote:
> On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote:
> > During major activity there is likely to be "good" behaviour
> > with stalls roughly every 30 seconds roughly corresponding to
> > dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
> > when this happens.
> > 
> >   237 ?00:00:00 flush-8:0
> > [] sleep_on_buffer+0x9/0x10
> > [] __lock_buffer+0x2e/0x30
> > [] do_get_write_access+0x43f/0x4b0
> 
> If we're stalling on lock_buffer(), that implies that buffer was being
> written, and for some reason it was taking a very long time to
> complete.
> 

Yes.

> It might be worthwhile to put a timestamp in struct dm_crypt_io, and
> record the time when a particular I/O encryption/decryption is getting
> queued to the kcryptd workqueues, and when they finally squirt out.
> 

That somewhat assumes that dm_crypt was at fault which is not unreasonable
but I was skeptical as the workload on dm_crypt was opening a maildir
and mostly reads.

I used a tracepoint in jbd2 to get an idea of what device the buffer_head
was managing and dm did not show up on the list. This is what a trace-cmd
log of the test told me.

   flush-8:0-240   [005]   236.655363: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 1096
 awesome-1364  [005]   290.594396: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 7312
 gnome-pty-helpe-2256  [005]   290.836952: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 7528
   flush-8:0-240   [003]   304.012424: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 4472
  gnome-terminal-2332  [005]   308.290879: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3060
 awesome-1364  [006]   308.291318: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3048
   flush-8:0-240   [005]   331.525996: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 8732
   flush-8:0-240   [005]   332.353526: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 472
   flush-8:0-240   [005]   345.341547: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 10024
  gnome-terminal-2418  [005]   347.166876: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 11852
 awesome-1364  [005]   347.167082: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 11844
   flush-8:0-240   [005]   347.424520: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 2012
   flush-8:0-240   [005]   347.583752: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 156
   flush-8:0-240   [005]   390.079682: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 396
   flush-8:0-240   [002]   407.882385: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 12244
   flush-8:0-240   [005]   408.003976: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 124
  gnome-terminal-2610  [005]   413.613365: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3400
 awesome-1364  [006]   413.613605: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3736
   flush-8:0-240   [002]   430.706475: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 9748
   flush-8:0-240   [005]   458.188896: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 7748
   flush-8:0-240   [005]   458.828143: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 348
   flush-8:0-240   [006]   459.163814: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 252
   flush-8:0-240   [005]   462.340173: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 3160
   flush-8:0-240   [005]   469.917705: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 6340
   flush-8:0-240   [005]   474.434206: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 4512
 tar-2315  [005]   510.043613: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 4316
   tclsh-1780  [005]   773.336488: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 736
 git-3100  [005]   775.933506: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 3664
 git-4763  [005]   864.093317: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 140
   flush-8:0-240   [005]   864.242068: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 280
 git-4763  [005]   864.264157: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 148
   flush-8:0-240   [005]   865.24: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 464
 git-4763  [000]   865.602469: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 300
   flush-8:0-240   [005]   865.705448: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 500
   flush-8:0-240   [005]   885.367576: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 11024
   flush-8:0-240   [005]   895.339697: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 120
   flush-8:0-240   [005]   895.765488: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 424
 systemd-journal-265   [005]   915.687201: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 14844
   flush-8:0-240   [005]   915.690529: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 19656
 git-5442  [005]  1034.845674: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 344
 git-5442  [005]  1035.157389: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 264
   flush-8:0-240   [005]  1035.875478: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 1368
   

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Mel Gorman
On Wed, Apr 10, 2013 at 09:12:45AM -0400, Theodore Ts'o wrote:
 On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote:
  During major activity there is likely to be good behaviour
  with stalls roughly every 30 seconds roughly corresponding to
  dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
  when this happens.
  
237 ?00:00:00 flush-8:0
  [811a35b9] sleep_on_buffer+0x9/0x10
  [811a35ee] __lock_buffer+0x2e/0x30
  [8123a21f] do_get_write_access+0x43f/0x4b0
 
 If we're stalling on lock_buffer(), that implies that buffer was being
 written, and for some reason it was taking a very long time to
 complete.
 

Yes.

 It might be worthwhile to put a timestamp in struct dm_crypt_io, and
 record the time when a particular I/O encryption/decryption is getting
 queued to the kcryptd workqueues, and when they finally squirt out.
 

That somewhat assumes that dm_crypt was at fault which is not unreasonable
but I was skeptical as the workload on dm_crypt was opening a maildir
and mostly reads.

I used a tracepoint in jbd2 to get an idea of what device the buffer_head
was managing and dm did not show up on the list. This is what a trace-cmd
log of the test told me.

   flush-8:0-240   [005]   236.655363: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 1096
 awesome-1364  [005]   290.594396: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 7312
 gnome-pty-helpe-2256  [005]   290.836952: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 7528
   flush-8:0-240   [003]   304.012424: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 4472
  gnome-terminal-2332  [005]   308.290879: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3060
 awesome-1364  [006]   308.291318: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3048
   flush-8:0-240   [005]   331.525996: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 8732
   flush-8:0-240   [005]   332.353526: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 472
   flush-8:0-240   [005]   345.341547: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 10024
  gnome-terminal-2418  [005]   347.166876: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 11852
 awesome-1364  [005]   347.167082: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 11844
   flush-8:0-240   [005]   347.424520: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 2012
   flush-8:0-240   [005]   347.583752: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 156
   flush-8:0-240   [005]   390.079682: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 396
   flush-8:0-240   [002]   407.882385: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 12244
   flush-8:0-240   [005]   408.003976: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 124
  gnome-terminal-2610  [005]   413.613365: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3400
 awesome-1364  [006]   413.613605: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 3736
   flush-8:0-240   [002]   430.706475: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 9748
   flush-8:0-240   [005]   458.188896: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 7748
   flush-8:0-240   [005]   458.828143: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 348
   flush-8:0-240   [006]   459.163814: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 252
   flush-8:0-240   [005]   462.340173: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 3160
   flush-8:0-240   [005]   469.917705: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 6340
   flush-8:0-240   [005]   474.434206: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 4512
 tar-2315  [005]   510.043613: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 4316
   tclsh-1780  [005]   773.336488: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 736
 git-3100  [005]   775.933506: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 3664
 git-4763  [005]   864.093317: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 140
   flush-8:0-240   [005]   864.242068: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 280
 git-4763  [005]   864.264157: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 148
   flush-8:0-240   [005]   865.24: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 464
 git-4763  [000]   865.602469: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 300
   flush-8:0-240   [005]   865.705448: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 500
   flush-8:0-240   [005]   885.367576: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 11024
   flush-8:0-240   [005]   895.339697: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 120
   flush-8:0-240   [005]   895.765488: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 424
 systemd-journal-265   [005]   915.687201: jbd2_lock_buffer_stall: dev 8,8 
stall_ms 14844
   flush-8:0-240   [005]   915.690529: jbd2_lock_buffer_stall: dev 8,6 
stall_ms 19656
 git-5442  [005]  1034.845674: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 344
 git-5442  [005]  1035.157389: jbd2_lock_buffer_stall: dev 8,5 
stall_ms 264
   flush-8:0-240   [005]  1035.875478: jbd2_lock_buffer_stall: dev 8,8 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Theodore Ts'o
On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
  If we're stalling on lock_buffer(), that implies that buffer was being
  written, and for some reason it was taking a very long time to
  complete.
  
 
 Yes.
 
  It might be worthwhile to put a timestamp in struct dm_crypt_io, and
  record the time when a particular I/O encryption/decryption is getting
  queued to the kcryptd workqueues, and when they finally squirt out.
  
 
 That somewhat assumes that dm_crypt was at fault which is not unreasonable
 but I was skeptical as the workload on dm_crypt was opening a maildir
 and mostly reads.

Hmm... well, I've reviewed all of the places in the ext4 and jbd2
layer where we call lock_buffer(), and with one exception[1] we're not
holding the the bh locked any longer than necessary.  There are a few
places where we grab a spinlock or two before we can do what we need
to do and then release the lock'ed buffer head, but the only time we
hold the bh locked for long periods of time is when we submit metadata
blocks for I/O.

[1] There is one exception in ext4_xattr_release_block() where I
believe we should move the call to unlock_buffer(bh) before the call
to ext4_free_blocks(), since we've already elevanted the bh count and
ext4_free_blocks() does not need to have the bh locked.  It's not
related to any of the stalls you've repored, though, as near as I can
tell (none of the stack traces include the ext4 xattr code, and this
would only affect external extended attribute blocks).


Could you code which checks the hold time of lock_buffer(), measuing
from when the lock is successfully grabbed, to see if you can see if I
missed some code path in ext4 or jbd2 where the bh is locked and then
there is some call to some function which needs to block for some
random reason?  What I'd suggest is putting a timestamp in buffer_head
structure, which is set by lock_buffer once it is successfully grabbed
the lock, and then in unlock_buffer(), if it is held for more than a
second or some such, to dump out the stack trace.

Because at this point, either I'm missing something or I'm beginning
to suspect that your hard drive (or maybe something the block layer?)
is simply taking a long time to service an I/O request.  Putting in
this check should be able to very quickly determine what code path
and/or which subsystem we should be focused upon.

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Jan Kara
On Thu 11-04-13 14:35:12, Ted Tso wrote:
 On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote:
   If we're stalling on lock_buffer(), that implies that buffer was being
   written, and for some reason it was taking a very long time to
   complete.
   
  
  Yes.
  
   It might be worthwhile to put a timestamp in struct dm_crypt_io, and
   record the time when a particular I/O encryption/decryption is getting
   queued to the kcryptd workqueues, and when they finally squirt out.
   
  
  That somewhat assumes that dm_crypt was at fault which is not unreasonable
  but I was skeptical as the workload on dm_crypt was opening a maildir
  and mostly reads.
 
 Hmm... well, I've reviewed all of the places in the ext4 and jbd2
 layer where we call lock_buffer(), and with one exception[1] we're not
 holding the the bh locked any longer than necessary.  There are a few
 places where we grab a spinlock or two before we can do what we need
 to do and then release the lock'ed buffer head, but the only time we
 hold the bh locked for long periods of time is when we submit metadata
 blocks for I/O.
 
 [1] There is one exception in ext4_xattr_release_block() where I
 believe we should move the call to unlock_buffer(bh) before the call
 to ext4_free_blocks(), since we've already elevanted the bh count and
 ext4_free_blocks() does not need to have the bh locked.  It's not
 related to any of the stalls you've repored, though, as near as I can
 tell (none of the stack traces include the ext4 xattr code, and this
 would only affect external extended attribute blocks).
 
 
 Could you code which checks the hold time of lock_buffer(), measuing
 from when the lock is successfully grabbed, to see if you can see if I
 missed some code path in ext4 or jbd2 where the bh is locked and then
 there is some call to some function which needs to block for some
 random reason?  What I'd suggest is putting a timestamp in buffer_head
 structure, which is set by lock_buffer once it is successfully grabbed
 the lock, and then in unlock_buffer(), if it is held for more than a
 second or some such, to dump out the stack trace.
 
 Because at this point, either I'm missing something or I'm beginning
 to suspect that your hard drive (or maybe something the block layer?)
 is simply taking a long time to service an I/O request.  Putting in
 this check should be able to very quickly determine what code path
 and/or which subsystem we should be focused upon.
  I think it might be more enlightening if Mel traced which process in
which funclion is holding the buffer lock. I suspect we'll find out that
the flusher thread has submitted the buffer for IO as an async write and
thus it takes a long time to complete in presence of reads which have
higher priority.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Theodore Ts'o
On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
   I think it might be more enlightening if Mel traced which process in
 which funclion is holding the buffer lock. I suspect we'll find out that
 the flusher thread has submitted the buffer for IO as an async write and
 thus it takes a long time to complete in presence of reads which have
 higher priority.

That's an interesting theory.  If the workload is one which is very
heavy on reads and writes, that could explain the high latency.  That
would explain why those of us who are using primarily SSD's are seeing
the problems, because would be reads are nice and fast.

If that is the case, one possible solution that comes to mind would be
to mark buffer_heads that contain metadata with a flag, so that the
flusher thread can write them back at the same priority as reads.

The only problem I can see with this hypothesis is that if this is the
explanation for what Mel and Jiri are seeing, it's something that
would have been around for a long time, and would affect ext3 as well
as ext4.  That isn't quite consistent, however, with Mel's observation
that this is a probablem which has gotten worse in relatively
recently.

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-11 Thread Dave Chinner
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote:
 On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote:
I think it might be more enlightening if Mel traced which process in
  which funclion is holding the buffer lock. I suspect we'll find out that
  the flusher thread has submitted the buffer for IO as an async write and
  thus it takes a long time to complete in presence of reads which have
  higher priority.
 
 That's an interesting theory.  If the workload is one which is very
 heavy on reads and writes, that could explain the high latency.  That
 would explain why those of us who are using primarily SSD's are seeing
 the problems, because would be reads are nice and fast.
 
 If that is the case, one possible solution that comes to mind would be
 to mark buffer_heads that contain metadata with a flag, so that the
 flusher thread can write them back at the same priority as reads.

Ext4 is already using REQ_META for this purpose.

I'm surprised that no-one has suggested change the IO elevator
yet.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-10 Thread Theodore Ts'o
On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote:
> During major activity there is likely to be "good" behaviour
> with stalls roughly every 30 seconds roughly corresponding to
> dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
> when this happens.
> 
>   237 ?00:00:00 flush-8:0
> [] sleep_on_buffer+0x9/0x10
> [] __lock_buffer+0x2e/0x30
> [] do_get_write_access+0x43f/0x4b0

If we're stalling on lock_buffer(), that implies that buffer was being
written, and for some reason it was taking a very long time to
complete.

It might be worthwhile to put a timestamp in struct dm_crypt_io, and
record the time when a particular I/O encryption/decryption is getting
queued to the kcryptd workqueues, and when they finally squirt out.

Something else that might be worth trying is to add WQ_HIGHPRI to the
workqueue flags and see if that makes a difference.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-10 Thread Mel Gorman
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote:
> > I'm testing a page-reclaim-related series on my laptop that is partially
> > aimed at fixing long stalls when doing metadata-intensive operations on
> > low memory such as a git checkout. I've been running 3.9-rc2 with the
> > series applied but found that the interactive performance was awful even
> > when there was plenty of free memory.
> 
> Can you try 3.9-rc4 or later and see if the problem still persists?
> There were a number of ext4 issues especially around low memory
> performance which weren't resolved until -rc4.
> 

I experimented with this for a while. -rc6 "feels" much better where
-rc2 felt it would stall for prolonged periods of time but it could be
my imagination too. It does appear that queue depth and await times are
slowly increasing for various reasons.

It's always been the case for me that metadata intensive and write activities
in the background (opening maildir + cache cold git checkout) would stall
the machine for periods of time. This time around, I timed how long it
takes gnome-terminal to open, run find on a directory and exit again while
a cache cold git checkout and a maildir folder were running

v3.0.66
  count time
471 5
 23 10
 11 15
 14 20
  4 25
  8 30
  3 35

v3.7
636 5
 20 10
 13 15
 11 20
  7 25
  1 30
  3 35
  1 40
  1 45

v3.8
  count time
394 5
 10 10
 12 15
  8 20
  9 25
  6 30
  2 35
  3 40

v3.9-rc6
  count time
481 5
 14 10
  9 15
 12 20
  8 25
  4 30
  2 35
  3 40
  1 45
  1 50
  1 140

This shows that kernel 3.7 was able to open the terminal in 5 seconds or
less 636 times during the test. Very broadly speaking, v3.0.66 is snappier
and generally able to open the terminal and do some work faster. v3.9-rc6 is
sometimes much slower such as when it took 140 seconds to open the terminal
but not consistently slow enough to allow it to be reliably bisected.

Further, whatever my perceptions are telling me, the fact is that git
checkouts are not obviously worse. However, queue depth and IO wait
times are higher but it's gradual and would not obviously make a very bad
impression. See here;

 v3.0.66  checkout:278 depth:387.36 await: 878.97 launch:29.39 max_launch:34.20
 v3.7 checkout:268 depth:439.96 await: 971.39 launch:29.46 max_launch:40.42
 v3.8 checkout:275 depth:598.12 await:1280.62 launch:31.95 max_launch:38.50
 v3.9-rc6 checkout:266 depth:540.74 await:1182.10 launch:45.39 max_launch:138.14

Cache cold git checkout times are roughly comparable but average queue depth
has been increasing and average wait times in v3.8 and v3.9-rc6 are higher
in comparison to v3.0.66. The average time it takes to launch a terminal
and do something with it is also increasing. Unfortunately, these results
are not always perfectly reproducible and it cannot be reliably bisected.

That said, the worst IO wait times (in milliseconds) are getting higher
   
   await  r_await  w_await
 v3.0.66 5811.2439.19 28309.72 
v3.7 7508.7946.36 36318.96 
v3.8 7083.3547.55 35305.46 
v3.9-rc2 9211.1435.25 34560.08 
v3.9-rc6 7499.5395.21122780.43 

Worst-case small read times have almost doubled. A worst case write
delay was 122 seconds in v3.9-rc6!

The average wait times are also not painting a pretty picture

   await  r_await  w_await
 v3.0.66  878.97 7.79  6975.51 
v3.7  971.39 7.84  7745.57 
v3.8 1280.63 7.75 10306.62 
v3.9-rc2 1280.37 7.55  7687.20 
v3.9-rc6 1182.11 8.11 13869.67 

That is indicating that average wait times have almost doubled since
v3.7. Even though -rc2 felt bad, it's not obviously reflected in the await
figures which is partially what makes bisecting this difficult. At least
you can get an impression of the wait times from this smoothened graph
showing await times from iostat

http://www.csn.ul.ie/~mel/postings/interactivity-20130410/await-times-smooth.png

Again, while one can see the wait times are worse, it's not generally
worse enough to pinpoint it to a single change.

Other observations

On my laptop, pm-utils was setting dirty_background_ratio to 5% and
dirty_ratio to 10% away from the expected defaults of 10% and 20%. Any
of the changes related to dirty balancing could have affected how often
processes get dirty rate-limited.

During major activity there is likely to be "good" behaviour
with stalls roughly every 30 seconds roughly corresponding to
dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
when this happens.

  237 ?00:00:00 flush-8:0
[] sleep_on_buffer+0x9/0x10
[] __lock_buffer+0x2e/0x30
[] do_get_write_access+0x43f/0x4b0
[] 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-10 Thread Mel Gorman
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
 On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote:
  I'm testing a page-reclaim-related series on my laptop that is partially
  aimed at fixing long stalls when doing metadata-intensive operations on
  low memory such as a git checkout. I've been running 3.9-rc2 with the
  series applied but found that the interactive performance was awful even
  when there was plenty of free memory.
 
 Can you try 3.9-rc4 or later and see if the problem still persists?
 There were a number of ext4 issues especially around low memory
 performance which weren't resolved until -rc4.
 

I experimented with this for a while. -rc6 feels much better where
-rc2 felt it would stall for prolonged periods of time but it could be
my imagination too. It does appear that queue depth and await times are
slowly increasing for various reasons.

It's always been the case for me that metadata intensive and write activities
in the background (opening maildir + cache cold git checkout) would stall
the machine for periods of time. This time around, I timed how long it
takes gnome-terminal to open, run find on a directory and exit again while
a cache cold git checkout and a maildir folder were running

v3.0.66
  count time
471 5
 23 10
 11 15
 14 20
  4 25
  8 30
  3 35

v3.7
636 5
 20 10
 13 15
 11 20
  7 25
  1 30
  3 35
  1 40
  1 45

v3.8
  count time
394 5
 10 10
 12 15
  8 20
  9 25
  6 30
  2 35
  3 40

v3.9-rc6
  count time
481 5
 14 10
  9 15
 12 20
  8 25
  4 30
  2 35
  3 40
  1 45
  1 50
  1 140

This shows that kernel 3.7 was able to open the terminal in 5 seconds or
less 636 times during the test. Very broadly speaking, v3.0.66 is snappier
and generally able to open the terminal and do some work faster. v3.9-rc6 is
sometimes much slower such as when it took 140 seconds to open the terminal
but not consistently slow enough to allow it to be reliably bisected.

Further, whatever my perceptions are telling me, the fact is that git
checkouts are not obviously worse. However, queue depth and IO wait
times are higher but it's gradual and would not obviously make a very bad
impression. See here;

 v3.0.66  checkout:278 depth:387.36 await: 878.97 launch:29.39 max_launch:34.20
 v3.7 checkout:268 depth:439.96 await: 971.39 launch:29.46 max_launch:40.42
 v3.8 checkout:275 depth:598.12 await:1280.62 launch:31.95 max_launch:38.50
 v3.9-rc6 checkout:266 depth:540.74 await:1182.10 launch:45.39 max_launch:138.14

Cache cold git checkout times are roughly comparable but average queue depth
has been increasing and average wait times in v3.8 and v3.9-rc6 are higher
in comparison to v3.0.66. The average time it takes to launch a terminal
and do something with it is also increasing. Unfortunately, these results
are not always perfectly reproducible and it cannot be reliably bisected.

That said, the worst IO wait times (in milliseconds) are getting higher
   
   await  r_await  w_await
 v3.0.66 5811.2439.19 28309.72 
v3.7 7508.7946.36 36318.96 
v3.8 7083.3547.55 35305.46 
v3.9-rc2 9211.1435.25 34560.08 
v3.9-rc6 7499.5395.21122780.43 

Worst-case small read times have almost doubled. A worst case write
delay was 122 seconds in v3.9-rc6!

The average wait times are also not painting a pretty picture

   await  r_await  w_await
 v3.0.66  878.97 7.79  6975.51 
v3.7  971.39 7.84  7745.57 
v3.8 1280.63 7.75 10306.62 
v3.9-rc2 1280.37 7.55  7687.20 
v3.9-rc6 1182.11 8.11 13869.67 

That is indicating that average wait times have almost doubled since
v3.7. Even though -rc2 felt bad, it's not obviously reflected in the await
figures which is partially what makes bisecting this difficult. At least
you can get an impression of the wait times from this smoothened graph
showing await times from iostat

http://www.csn.ul.ie/~mel/postings/interactivity-20130410/await-times-smooth.png

Again, while one can see the wait times are worse, it's not generally
worse enough to pinpoint it to a single change.

Other observations

On my laptop, pm-utils was setting dirty_background_ratio to 5% and
dirty_ratio to 10% away from the expected defaults of 10% and 20%. Any
of the changes related to dirty balancing could have affected how often
processes get dirty rate-limited.

During major activity there is likely to be good behaviour
with stalls roughly every 30 seconds roughly corresponding to
dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
when this happens.

  237 ?00:00:00 flush-8:0
[811a35b9] sleep_on_buffer+0x9/0x10
[811a35ee] __lock_buffer+0x2e/0x30
[8123a21f] 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-10 Thread Theodore Ts'o
On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote:
 During major activity there is likely to be good behaviour
 with stalls roughly every 30 seconds roughly corresponding to
 dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck
 when this happens.
 
   237 ?00:00:00 flush-8:0
 [811a35b9] sleep_on_buffer+0x9/0x10
 [811a35ee] __lock_buffer+0x2e/0x30
 [8123a21f] do_get_write_access+0x43f/0x4b0

If we're stalling on lock_buffer(), that implies that buffer was being
written, and for some reason it was taking a very long time to
complete.

It might be worthwhile to put a timestamp in struct dm_crypt_io, and
record the time when a particular I/O encryption/decryption is getting
queued to the kcryptd workqueues, and when they finally squirt out.

Something else that might be worth trying is to add WQ_HIGHPRI to the
workqueue flags and see if that makes a difference.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-08 Thread Theodore Ts'o
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote:
> > semantic error: while resolving probe point: identifier 'kprobe' at 
> > /tmp/stapdjN4_l:18:7
> > source: probe kprobe.function("get_request_wait")
> >   ^
> > Pass 2: analysis failed.  [man error::pass2]
> > Unexpected exit of STAP script at ./watch-dstate.pl line 296.
> > I have no clue what to do next.  Can you give me a hint?

Is there any reason why the error message couldn't be simplified, to
something as "kernel symbol not found"?  I wasn't sure if the problem
was that there was some incompatibility between a recent change with
kprobe and systemtap, or parse failure in the systemtap script, etc.

> Systemtap could endavour to list roughly-matching functions that do
> exist, if you think that's be helpful.

If the goal is ease of use, I suspect the more important thing that
systemtap could do is to make its error messages more easily
understandable, instead of pointing the user to read a man page where
the user then has to figure out which one of a number of failure
scenarios were caused by a particularly opaque error message.  (The
man page doesn't even say that "semantic error while resolving probe
point" means that a kernel function doesn't exist -- especially
complaining about the kprobe identifier points the user in the wrong
direction.)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-08 Thread Frank Ch. Eigler
Hi, Mel -

> > [...]  git kernel developers
> > should use git systemtap, as has always been the case.  [...]
> 
> At one point in the past this used to be the case but then systemtap had to
> be compiled as part of automated tests across different kernel versions. It
> could have been worked around in various ways or even installed manually
> when machines were deployed but stap-fix.sh generally took less time to
> keep working.

OK, if that works for you.  Keep in mind though that newer versions of
systemtap retain backward-compatibility for ancient versions of the
kernel, so git systemtap should work on those older versions just
fine.


> [...]
> Yes, this was indeed the problem. The next version of watch-dstate.pl
> treated get_request_wait() as a function that may or may not exist. It
> uses /proc/kallsyms to figure it out.

... or you can use the "?" punctuation in the script to have
systemtap adapt:

probe kprobe.function("get_request_wait") ?  { ... }


- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-08 Thread Mel Gorman
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote:
> 
> Hi -
> 
> 
> tytso wrote:
> 
> > So I tried to reproduce the problem, and so I installed systemtap
> > (bleeding edge, since otherwise it won't work with development
> > kernel), and then rebuilt a kernel with all of the necessary CONFIG
> > options enabled:
> >
> > CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS,
> > CONFIG_MODULES, CONFIG_MODULE_UNLOAD
> > [...]
> 
> That sounds about right.
> 
> 
> > I then pulled down mmtests, and tried running watch-dstate.pl, which
> > is what I assume you were using [...]
> 
> I just took a look at the mmtests, particularly the stap-fix.sh stuff.
> The heroics therein are really not called for.  git kernel developers
> should use git systemtap, as has always been the case.  All
> compatibility hacks in stap-fix.sh have already been merged, in many
> cases for months.
> 

At one point in the past this used to be the case but then systemtap had to
be compiled as part of automated tests across different kernel versions. It
could have been worked around in various ways or even installed manually
when machines were deployed but stap-fix.sh generally took less time to
keep working.

> 
> > [...]
> > semantic error: while resolving probe point: identifier 'kprobe' at 
> > /tmp/stapdjN4_l:18:7
> > source: probe kprobe.function("get_request_wait")
> >   ^
> > Pass 2: analysis failed.  [man error::pass2]
> > Unexpected exit of STAP script at ./watch-dstate.pl line 296.
> > I have no clue what to do next.  Can you give me a hint?
> 
> You should see the error::pass2 man page, which refers to
> error::reporting, which refers to involving stap folks and running
> stap-report to gather needed info.
> 
> But in this case, that's unnecessary: the problem is most likely that
> the get_request_wait function does not actually exist any longer, since
> 
> commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6
> Author: Tejun Heo 
> Date:   Mon Jun 4 20:40:55 2012 -0700
> 
> block: refactor get_request[_wait]()
> 

Yes, this was indeed the problem. The next version of watch-dstate.pl
treated get_request_wait() as a function that may or may not exist. It
uses /proc/kallsyms to figure it out.

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-08 Thread Mel Gorman
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote:
 
 Hi -
 
 
 tytso wrote:
 
  So I tried to reproduce the problem, and so I installed systemtap
  (bleeding edge, since otherwise it won't work with development
  kernel), and then rebuilt a kernel with all of the necessary CONFIG
  options enabled:
 
  CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS,
  CONFIG_MODULES, CONFIG_MODULE_UNLOAD
  [...]
 
 That sounds about right.
 
 
  I then pulled down mmtests, and tried running watch-dstate.pl, which
  is what I assume you were using [...]
 
 I just took a look at the mmtests, particularly the stap-fix.sh stuff.
 The heroics therein are really not called for.  git kernel developers
 should use git systemtap, as has always been the case.  All
 compatibility hacks in stap-fix.sh have already been merged, in many
 cases for months.
 

At one point in the past this used to be the case but then systemtap had to
be compiled as part of automated tests across different kernel versions. It
could have been worked around in various ways or even installed manually
when machines were deployed but stap-fix.sh generally took less time to
keep working.

 
  [...]
  semantic error: while resolving probe point: identifier 'kprobe' at 
  /tmp/stapdjN4_l:18:7
  source: probe kprobe.function(get_request_wait)
^
  Pass 2: analysis failed.  [man error::pass2]
  Unexpected exit of STAP script at ./watch-dstate.pl line 296.
  I have no clue what to do next.  Can you give me a hint?
 
 You should see the error::pass2 man page, which refers to
 error::reporting, which refers to involving stap folks and running
 stap-report to gather needed info.
 
 But in this case, that's unnecessary: the problem is most likely that
 the get_request_wait function does not actually exist any longer, since
 
 commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6
 Author: Tejun Heo t...@kernel.org
 Date:   Mon Jun 4 20:40:55 2012 -0700
 
 block: refactor get_request[_wait]()
 

Yes, this was indeed the problem. The next version of watch-dstate.pl
treated get_request_wait() as a function that may or may not exist. It
uses /proc/kallsyms to figure it out.

Thanks.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-08 Thread Frank Ch. Eigler
Hi, Mel -

  [...]  git kernel developers
  should use git systemtap, as has always been the case.  [...]
 
 At one point in the past this used to be the case but then systemtap had to
 be compiled as part of automated tests across different kernel versions. It
 could have been worked around in various ways or even installed manually
 when machines were deployed but stap-fix.sh generally took less time to
 keep working.

OK, if that works for you.  Keep in mind though that newer versions of
systemtap retain backward-compatibility for ancient versions of the
kernel, so git systemtap should work on those older versions just
fine.


 [...]
 Yes, this was indeed the problem. The next version of watch-dstate.pl
 treated get_request_wait() as a function that may or may not exist. It
 uses /proc/kallsyms to figure it out.

... or you can use the ? punctuation in the script to have
systemtap adapt:

probe kprobe.function(get_request_wait) ?  { ... }


- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-08 Thread Theodore Ts'o
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote:
  semantic error: while resolving probe point: identifier 'kprobe' at 
  /tmp/stapdjN4_l:18:7
  source: probe kprobe.function(get_request_wait)
^
  Pass 2: analysis failed.  [man error::pass2]
  Unexpected exit of STAP script at ./watch-dstate.pl line 296.
  I have no clue what to do next.  Can you give me a hint?

Is there any reason why the error message couldn't be simplified, to
something as kernel symbol not found?  I wasn't sure if the problem
was that there was some incompatibility between a recent change with
kprobe and systemtap, or parse failure in the systemtap script, etc.

 Systemtap could endavour to list roughly-matching functions that do
 exist, if you think that's be helpful.

If the goal is ease of use, I suspect the more important thing that
systemtap could do is to make its error messages more easily
understandable, instead of pointing the user to read a man page where
the user then has to figure out which one of a number of failure
scenarios were caused by a particularly opaque error message.  (The
man page doesn't even say that semantic error while resolving probe
point means that a kernel function doesn't exist -- especially
complaining about the kprobe identifier points the user in the wrong
direction.)

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-07 Thread Frank Ch. Eigler

Hi -


tytso wrote:

> So I tried to reproduce the problem, and so I installed systemtap
> (bleeding edge, since otherwise it won't work with development
> kernel), and then rebuilt a kernel with all of the necessary CONFIG
> options enabled:
>
>   CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS,
>   CONFIG_MODULES, CONFIG_MODULE_UNLOAD
> [...]

That sounds about right.


> I then pulled down mmtests, and tried running watch-dstate.pl, which
> is what I assume you were using [...]

I just took a look at the mmtests, particularly the stap-fix.sh stuff.
The heroics therein are really not called for.  git kernel developers
should use git systemtap, as has always been the case.  All
compatibility hacks in stap-fix.sh have already been merged, in many
cases for months.


> [...]
> semantic error: while resolving probe point: identifier 'kprobe' at 
> /tmp/stapdjN4_l:18:7
> source: probe kprobe.function("get_request_wait")
>   ^
> Pass 2: analysis failed.  [man error::pass2]
> Unexpected exit of STAP script at ./watch-dstate.pl line 296.
> I have no clue what to do next.  Can you give me a hint?

You should see the error::pass2 man page, which refers to
error::reporting, which refers to involving stap folks and running
stap-report to gather needed info.

But in this case, that's unnecessary: the problem is most likely that
the get_request_wait function does not actually exist any longer, since

commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6
Author: Tejun Heo 
Date:   Mon Jun 4 20:40:55 2012 -0700

block: refactor get_request[_wait]()


Systemtap could endavour to list roughly-matching functions that do
exist, if you think that's be helpful.


- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-07 Thread Frank Ch. Eigler

Hi -


tytso wrote:

 So I tried to reproduce the problem, and so I installed systemtap
 (bleeding edge, since otherwise it won't work with development
 kernel), and then rebuilt a kernel with all of the necessary CONFIG
 options enabled:

   CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS,
   CONFIG_MODULES, CONFIG_MODULE_UNLOAD
 [...]

That sounds about right.


 I then pulled down mmtests, and tried running watch-dstate.pl, which
 is what I assume you were using [...]

I just took a look at the mmtests, particularly the stap-fix.sh stuff.
The heroics therein are really not called for.  git kernel developers
should use git systemtap, as has always been the case.  All
compatibility hacks in stap-fix.sh have already been merged, in many
cases for months.


 [...]
 semantic error: while resolving probe point: identifier 'kprobe' at 
 /tmp/stapdjN4_l:18:7
 source: probe kprobe.function(get_request_wait)
   ^
 Pass 2: analysis failed.  [man error::pass2]
 Unexpected exit of STAP script at ./watch-dstate.pl line 296.
 I have no clue what to do next.  Can you give me a hint?

You should see the error::pass2 man page, which refers to
error::reporting, which refers to involving stap folks and running
stap-report to gather needed info.

But in this case, that's unnecessary: the problem is most likely that
the get_request_wait function does not actually exist any longer, since

commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6
Author: Tejun Heo t...@kernel.org
Date:   Mon Jun 4 20:40:55 2012 -0700

block: refactor get_request[_wait]()


Systemtap could endavour to list roughly-matching functions that do
exist, if you think that's be helpful.


- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Theodore Ts'o
On Sat, Apr 06, 2013 at 09:29:48AM +0200, Jiri Slaby wrote:
> 
> I'm not sure, as I am using -next like for ever. But sure, there was a
> kernel which didn't ahve this problem.

Any chance you could try rolling back to 3.2 or 3.5 to see if you can
get a starting point?  Even a high-level bisection search would be
helpful to give us a hint.

>Ok,
>  dd if=/dev/zero of=xxx
>is enough instead of "kernel update".

Was the dd running in the VM or in the host OS?  Basically, is running
the VM required?

> Nothing, just VM (kernel update from console) and mplayer2 on the host.
> This is more-or-less reproducible with these two.

No browser or anything else running that might be introducing a stream
of fsync()'s?

> jbd2/sda5-8-10969 [000]    403.679552: jbd2_run_stats: dev
>259,655360 tid 348895 wait 0 request_delay 0 running 5000 locked 1040
>flushing 0 logging 112 handle_count 148224 blocks 1 blocks_logged 2

>  jbd2/md2-8-959   [000]    408.111339: jbd2_run_stats: dev 9,2
>tid 382991 wait 0 request_delay 0 running 5156 locked 2268 flushing 0
>logging 124 handle_count 5 blocks 1 blocks_logged 2

OK, so this is interesting.  The commit is stalling for 1 second in
the the transaction commit on sda5, and then very shortly thereafter
for 2.2 seconds on md2, while we are trying to lock down the
transaction.  What that means is that we are waiting for all of the
transaction handles opened against that particular commit to complete
before we can let the transaction commit proceed.

Is md2 sharing the same disk spindle as sda5?  And to which disk were
you doing the "dd if=/dev/zero of=/dev/XXX" command?

If I had to guess what's going on, the disk is accepting a huge amount
of writes to its track buffer, and then occasionally it is going out
to lunch trying to write all of this data to the disk platter.  This
is not (always) happening when we do the commit (with its attended
cache flush command), but in a few cases, we are doing a read command
which is getting stalled.  There are a few cases where we start a
transaction handle, and then discover that we need to read in a disk
block, and if that read stalls for a long period of time, it will hold
the transaction handle open, and this will in turn stall the commit.

If you were to grab a blocktrace, I suspect that is what you will
find; that it's actually a read command which is stalling at some
point, correlated with when we are trying to start transaction commit.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Jiri Slaby
On 04/06/2013 09:37 AM, Jiri Slaby wrote:
> On 04/06/2013 09:29 AM, Jiri Slaby wrote:
>> On 04/06/2013 01:16 AM, Theodore Ts'o wrote:
>>> On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
 Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
 it still sucks. Updating a kernel in a VM still results in "Your system
 is too SLOW to play this!" by mplayer and frame dropping.
>>>
>>> What was the first kernel where you didn't have the problem?  Were you
>>> using the 3.8 kernel earlier, and did you see the interactivity
>>> problems there?
>>
>> I'm not sure, as I am using -next like for ever. But sure, there was a
>> kernel which didn't ahve this problem.
>>
>>> What else was running in on your desktop at the same time?
>>
>> Nothing, just VM (kernel update from console) and mplayer2 on the host.
>> This is more-or-less reproducible with these two.
> 
> Ok,
>   dd if=/dev/zero of=xxx
> is enough instead of "kernel update".
> 
> Writeback mount doesn't help.
> 
>>> How was
>>> the file system mounted,
>>
>> Both are actually a single device /dev/sda5:
>> /dev/sda5 on /win type ext4 (rw,noatime,data=ordered)
>>
>> Should I try writeback?
>>
>>> and can you send me the output of dumpe2fs -h
>>> /dev/XXX?
>>
>> dumpe2fs 1.42.7 (21-Jan-2013)
>> Filesystem volume name:   
>> Last mounted on:  /win
>> Filesystem UUID:  cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:1 (dynamic)
>> Filesystem features:  has_journal ext_attr resize_inode dir_index
>> filetype needs_recovery extent flex_bg sparse_super large_file huge_file
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags: signed_directory_hash
>> Default mount options:user_xattr acl
>> Filesystem state: clean
>> Errors behavior:  Continue
>> Filesystem OS type:   Linux
>> Inode count:  30507008
>> Block count:  122012416
>> Reserved block count: 0
>> Free blocks:  72021328
>> Free inodes:  30474619
>> First block:  0
>> Block size:   4096
>> Fragment size:4096
>> Reserved GDT blocks:  994
>> Blocks per group: 32768
>> Fragments per group:  32768
>> Inodes per group: 8192
>> Inode blocks per group:   512
>> RAID stride:  32747
>> Flex block group size:16
>> Filesystem created:   Fri Sep  7 20:44:21 2012
>> Last mount time:  Thu Apr  4 12:22:01 2013
>> Last write time:  Thu Apr  4 12:22:01 2013
>> Mount count:  256
>> Maximum mount count:  -1
>> Last checked: Sat Sep  8 21:13:28 2012
>> Check interval:   0 ()
>> Lifetime writes:  1011 GB
>> Reserved blocks uid:  0 (user root)
>> Reserved blocks gid:  0 (group root)
>> First inode:  11
>> Inode size:   256
>> Required extra isize: 28
>> Desired extra isize:  28
>> Journal inode:8
>> Default directory hash:   half_md4
>> Directory Hash Seed:  b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e
>> Journal backup:   inode blocks
>> Journal features: journal_incompat_revoke
>> Journal size: 128M
>> Journal length:   32768
>> Journal sequence: 0x00054dc7
>> Journal start:8193
>>
>>> Oh, and what options were you using to when you kicked off
>>> the VM?
>>
>> qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet
>> -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img
>>
>>> The other thing that would be useful was to enable the jbd2_run_stats
>>> tracepoint and to send the output of the trace log when you notice the
>>> interactivity problems.
>>
>> Ok, I will try.

Inline here, as well as attached:
# tracer: nop
#
# entries-in-buffer/entries-written: 46/46   #P:2
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
 jbd2/sda5-8-10969 [000]    387.054319: jbd2_run_stats: dev
259,655360 tid 348892 wait 0 request_delay 0 running 5728 locked 0
flushing 0 logging 28 handle_count 10 blocks 1 blocks_logged 2
 jbd2/sda5-8-10969 [000]    392.594132: jbd2_run_stats: dev
259,655360 tid 348893 wait 0 request_delay 0 running 5300 locked 0
flushing 0 logging 64 handle_count 75944 blocks 1 blocks_logged 2
  jbd2/md2-8-959   [000]    396.249990: jbd2_run_stats: dev 9,2
tid 382990 wait 0 request_delay 0 running 5500 locked 0 flushing 0
logging 220 handle_count 3 blocks 1 blocks_logged 2
  jbd2/md1-8-1826  [000]    397.205670: jbd2_run_stats: dev 9,1
tid 1081270 wait 0 request_delay 0 running 5760 locked 0 flushing 0
logging 200 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Jiri Slaby
On 04/06/2013 09:29 AM, Jiri Slaby wrote:
> On 04/06/2013 01:16 AM, Theodore Ts'o wrote:
>> On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
>>> Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
>>> it still sucks. Updating a kernel in a VM still results in "Your system
>>> is too SLOW to play this!" by mplayer and frame dropping.
>>
>> What was the first kernel where you didn't have the problem?  Were you
>> using the 3.8 kernel earlier, and did you see the interactivity
>> problems there?
> 
> I'm not sure, as I am using -next like for ever. But sure, there was a
> kernel which didn't ahve this problem.
> 
>> What else was running in on your desktop at the same time?
> 
> Nothing, just VM (kernel update from console) and mplayer2 on the host.
> This is more-or-less reproducible with these two.

Ok,
  dd if=/dev/zero of=xxx
is enough instead of "kernel update".

Writeback mount doesn't help.

>> How was
>> the file system mounted,
> 
> Both are actually a single device /dev/sda5:
> /dev/sda5 on /win type ext4 (rw,noatime,data=ordered)
> 
> Should I try writeback?
> 
>> and can you send me the output of dumpe2fs -h
>> /dev/XXX?
> 
> dumpe2fs 1.42.7 (21-Jan-2013)
> Filesystem volume name:   
> Last mounted on:  /win
> Filesystem UUID:  cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b
> Filesystem magic number:  0xEF53
> Filesystem revision #:1 (dynamic)
> Filesystem features:  has_journal ext_attr resize_inode dir_index
> filetype needs_recovery extent flex_bg sparse_super large_file huge_file
> uninit_bg dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options:user_xattr acl
> Filesystem state: clean
> Errors behavior:  Continue
> Filesystem OS type:   Linux
> Inode count:  30507008
> Block count:  122012416
> Reserved block count: 0
> Free blocks:  72021328
> Free inodes:  30474619
> First block:  0
> Block size:   4096
> Fragment size:4096
> Reserved GDT blocks:  994
> Blocks per group: 32768
> Fragments per group:  32768
> Inodes per group: 8192
> Inode blocks per group:   512
> RAID stride:  32747
> Flex block group size:16
> Filesystem created:   Fri Sep  7 20:44:21 2012
> Last mount time:  Thu Apr  4 12:22:01 2013
> Last write time:  Thu Apr  4 12:22:01 2013
> Mount count:  256
> Maximum mount count:  -1
> Last checked: Sat Sep  8 21:13:28 2012
> Check interval:   0 ()
> Lifetime writes:  1011 GB
> Reserved blocks uid:  0 (user root)
> Reserved blocks gid:  0 (group root)
> First inode:  11
> Inode size:   256
> Required extra isize: 28
> Desired extra isize:  28
> Journal inode:8
> Default directory hash:   half_md4
> Directory Hash Seed:  b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e
> Journal backup:   inode blocks
> Journal features: journal_incompat_revoke
> Journal size: 128M
> Journal length:   32768
> Journal sequence: 0x00054dc7
> Journal start:8193
> 
>> Oh, and what options were you using to when you kicked off
>> the VM?
> 
> qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet
> -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img
> 
>> The other thing that would be useful was to enable the jbd2_run_stats
>> tracepoint and to send the output of the trace log when you notice the
>> interactivity problems.
> 
> Ok, I will try.
> 
> thanks,
> 


-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Jiri Slaby
On 04/06/2013 01:16 AM, Theodore Ts'o wrote:
> On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
>> Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
>> it still sucks. Updating a kernel in a VM still results in "Your system
>> is too SLOW to play this!" by mplayer and frame dropping.
> 
> What was the first kernel where you didn't have the problem?  Were you
> using the 3.8 kernel earlier, and did you see the interactivity
> problems there?

I'm not sure, as I am using -next like for ever. But sure, there was a
kernel which didn't ahve this problem.

> What else was running in on your desktop at the same time?

Nothing, just VM (kernel update from console) and mplayer2 on the host.
This is more-or-less reproducible with these two.

> How was
> the file system mounted,

Both are actually a single device /dev/sda5:
/dev/sda5 on /win type ext4 (rw,noatime,data=ordered)

Should I try writeback?

> and can you send me the output of dumpe2fs -h
> /dev/XXX?

dumpe2fs 1.42.7 (21-Jan-2013)
Filesystem volume name:   
Last mounted on:  /win
Filesystem UUID:  cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file huge_file
uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
Inode count:  30507008
Block count:  122012416
Reserved block count: 0
Free blocks:  72021328
Free inodes:  30474619
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  994
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 8192
Inode blocks per group:   512
RAID stride:  32747
Flex block group size:16
Filesystem created:   Fri Sep  7 20:44:21 2012
Last mount time:  Thu Apr  4 12:22:01 2013
Last write time:  Thu Apr  4 12:22:01 2013
Mount count:  256
Maximum mount count:  -1
Last checked: Sat Sep  8 21:13:28 2012
Check interval:   0 ()
Lifetime writes:  1011 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 28
Desired extra isize:  28
Journal inode:8
Default directory hash:   half_md4
Directory Hash Seed:  b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e
Journal backup:   inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length:   32768
Journal sequence: 0x00054dc7
Journal start:8193

> Oh, and what options were you using to when you kicked off
> the VM?

qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet
-net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img

> The other thing that would be useful was to enable the jbd2_run_stats
> tracepoint and to send the output of the trace log when you notice the
> interactivity problems.

Ok, I will try.

thanks,
-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Jiri Slaby
On 04/06/2013 01:16 AM, Theodore Ts'o wrote:
 On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
 Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
 it still sucks. Updating a kernel in a VM still results in Your system
 is too SLOW to play this! by mplayer and frame dropping.
 
 What was the first kernel where you didn't have the problem?  Were you
 using the 3.8 kernel earlier, and did you see the interactivity
 problems there?

I'm not sure, as I am using -next like for ever. But sure, there was a
kernel which didn't ahve this problem.

 What else was running in on your desktop at the same time?

Nothing, just VM (kernel update from console) and mplayer2 on the host.
This is more-or-less reproducible with these two.

 How was
 the file system mounted,

Both are actually a single device /dev/sda5:
/dev/sda5 on /win type ext4 (rw,noatime,data=ordered)

Should I try writeback?

 and can you send me the output of dumpe2fs -h
 /dev/XXX?

dumpe2fs 1.42.7 (21-Jan-2013)
Filesystem volume name:   none
Last mounted on:  /win
Filesystem UUID:  cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index
filetype needs_recovery extent flex_bg sparse_super large_file huge_file
uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
Inode count:  30507008
Block count:  122012416
Reserved block count: 0
Free blocks:  72021328
Free inodes:  30474619
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  994
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 8192
Inode blocks per group:   512
RAID stride:  32747
Flex block group size:16
Filesystem created:   Fri Sep  7 20:44:21 2012
Last mount time:  Thu Apr  4 12:22:01 2013
Last write time:  Thu Apr  4 12:22:01 2013
Mount count:  256
Maximum mount count:  -1
Last checked: Sat Sep  8 21:13:28 2012
Check interval:   0 (none)
Lifetime writes:  1011 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 28
Desired extra isize:  28
Journal inode:8
Default directory hash:   half_md4
Directory Hash Seed:  b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e
Journal backup:   inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length:   32768
Journal sequence: 0x00054dc7
Journal start:8193

 Oh, and what options were you using to when you kicked off
 the VM?

qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet
-net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img

 The other thing that would be useful was to enable the jbd2_run_stats
 tracepoint and to send the output of the trace log when you notice the
 interactivity problems.

Ok, I will try.

thanks,
-- 
js
suse labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Jiri Slaby
On 04/06/2013 09:29 AM, Jiri Slaby wrote:
 On 04/06/2013 01:16 AM, Theodore Ts'o wrote:
 On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
 Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
 it still sucks. Updating a kernel in a VM still results in Your system
 is too SLOW to play this! by mplayer and frame dropping.

 What was the first kernel where you didn't have the problem?  Were you
 using the 3.8 kernel earlier, and did you see the interactivity
 problems there?
 
 I'm not sure, as I am using -next like for ever. But sure, there was a
 kernel which didn't ahve this problem.
 
 What else was running in on your desktop at the same time?
 
 Nothing, just VM (kernel update from console) and mplayer2 on the host.
 This is more-or-less reproducible with these two.

Ok,
  dd if=/dev/zero of=xxx
is enough instead of kernel update.

Writeback mount doesn't help.

 How was
 the file system mounted,
 
 Both are actually a single device /dev/sda5:
 /dev/sda5 on /win type ext4 (rw,noatime,data=ordered)
 
 Should I try writeback?
 
 and can you send me the output of dumpe2fs -h
 /dev/XXX?
 
 dumpe2fs 1.42.7 (21-Jan-2013)
 Filesystem volume name:   none
 Last mounted on:  /win
 Filesystem UUID:  cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal ext_attr resize_inode dir_index
 filetype needs_recovery extent flex_bg sparse_super large_file huge_file
 uninit_bg dir_nlink extra_isize
 Filesystem flags: signed_directory_hash
 Default mount options:user_xattr acl
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 Inode count:  30507008
 Block count:  122012416
 Reserved block count: 0
 Free blocks:  72021328
 Free inodes:  30474619
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  994
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 8192
 Inode blocks per group:   512
 RAID stride:  32747
 Flex block group size:16
 Filesystem created:   Fri Sep  7 20:44:21 2012
 Last mount time:  Thu Apr  4 12:22:01 2013
 Last write time:  Thu Apr  4 12:22:01 2013
 Mount count:  256
 Maximum mount count:  -1
 Last checked: Sat Sep  8 21:13:28 2012
 Check interval:   0 (none)
 Lifetime writes:  1011 GB
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   256
 Required extra isize: 28
 Desired extra isize:  28
 Journal inode:8
 Default directory hash:   half_md4
 Directory Hash Seed:  b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e
 Journal backup:   inode blocks
 Journal features: journal_incompat_revoke
 Journal size: 128M
 Journal length:   32768
 Journal sequence: 0x00054dc7
 Journal start:8193
 
 Oh, and what options were you using to when you kicked off
 the VM?
 
 qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet
 -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img
 
 The other thing that would be useful was to enable the jbd2_run_stats
 tracepoint and to send the output of the trace log when you notice the
 interactivity problems.
 
 Ok, I will try.
 
 thanks,
 


-- 
js
suse labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Jiri Slaby
On 04/06/2013 09:37 AM, Jiri Slaby wrote:
 On 04/06/2013 09:29 AM, Jiri Slaby wrote:
 On 04/06/2013 01:16 AM, Theodore Ts'o wrote:
 On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
 Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
 it still sucks. Updating a kernel in a VM still results in Your system
 is too SLOW to play this! by mplayer and frame dropping.

 What was the first kernel where you didn't have the problem?  Were you
 using the 3.8 kernel earlier, and did you see the interactivity
 problems there?

 I'm not sure, as I am using -next like for ever. But sure, there was a
 kernel which didn't ahve this problem.

 What else was running in on your desktop at the same time?

 Nothing, just VM (kernel update from console) and mplayer2 on the host.
 This is more-or-less reproducible with these two.
 
 Ok,
   dd if=/dev/zero of=xxx
 is enough instead of kernel update.
 
 Writeback mount doesn't help.
 
 How was
 the file system mounted,

 Both are actually a single device /dev/sda5:
 /dev/sda5 on /win type ext4 (rw,noatime,data=ordered)

 Should I try writeback?

 and can you send me the output of dumpe2fs -h
 /dev/XXX?

 dumpe2fs 1.42.7 (21-Jan-2013)
 Filesystem volume name:   none
 Last mounted on:  /win
 Filesystem UUID:  cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal ext_attr resize_inode dir_index
 filetype needs_recovery extent flex_bg sparse_super large_file huge_file
 uninit_bg dir_nlink extra_isize
 Filesystem flags: signed_directory_hash
 Default mount options:user_xattr acl
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 Inode count:  30507008
 Block count:  122012416
 Reserved block count: 0
 Free blocks:  72021328
 Free inodes:  30474619
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  994
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 8192
 Inode blocks per group:   512
 RAID stride:  32747
 Flex block group size:16
 Filesystem created:   Fri Sep  7 20:44:21 2012
 Last mount time:  Thu Apr  4 12:22:01 2013
 Last write time:  Thu Apr  4 12:22:01 2013
 Mount count:  256
 Maximum mount count:  -1
 Last checked: Sat Sep  8 21:13:28 2012
 Check interval:   0 (none)
 Lifetime writes:  1011 GB
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   256
 Required extra isize: 28
 Desired extra isize:  28
 Journal inode:8
 Default directory hash:   half_md4
 Directory Hash Seed:  b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e
 Journal backup:   inode blocks
 Journal features: journal_incompat_revoke
 Journal size: 128M
 Journal length:   32768
 Journal sequence: 0x00054dc7
 Journal start:8193

 Oh, and what options were you using to when you kicked off
 the VM?

 qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet
 -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img

 The other thing that would be useful was to enable the jbd2_run_stats
 tracepoint and to send the output of the trace log when you notice the
 interactivity problems.

 Ok, I will try.

Inline here, as well as attached:
# tracer: nop
#
# entries-in-buffer/entries-written: 46/46   #P:2
#
#  _-= irqs-off
# / _= need-resched
#| / _---= hardirq/softirq
#|| / _--= preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
 jbd2/sda5-8-10969 [000]    387.054319: jbd2_run_stats: dev
259,655360 tid 348892 wait 0 request_delay 0 running 5728 locked 0
flushing 0 logging 28 handle_count 10 blocks 1 blocks_logged 2
 jbd2/sda5-8-10969 [000]    392.594132: jbd2_run_stats: dev
259,655360 tid 348893 wait 0 request_delay 0 running 5300 locked 0
flushing 0 logging 64 handle_count 75944 blocks 1 blocks_logged 2
  jbd2/md2-8-959   [000]    396.249990: jbd2_run_stats: dev 9,2
tid 382990 wait 0 request_delay 0 running 5500 locked 0 flushing 0
logging 220 handle_count 3 blocks 1 blocks_logged 2
  jbd2/md1-8-1826  [000]    397.205670: jbd2_run_stats: dev 9,1
tid 1081270 wait 0 request_delay 0 running 5760 locked 0 flushing 0
logging 200 handle_count 2 blocks 0 blocks_logged 0
 jbd2/sda5-8-10969 [000]    397.563660: jbd2_run_stats: dev
259,655360 tid 348894 wait 0 request_delay 0 running 5000 locked 0
flushing 0 logging 32 handle_count 89397 blocks 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-06 Thread Theodore Ts'o
On Sat, Apr 06, 2013 at 09:29:48AM +0200, Jiri Slaby wrote:
 
 I'm not sure, as I am using -next like for ever. But sure, there was a
 kernel which didn't ahve this problem.

Any chance you could try rolling back to 3.2 or 3.5 to see if you can
get a starting point?  Even a high-level bisection search would be
helpful to give us a hint.

Ok,
  dd if=/dev/zero of=xxx
is enough instead of kernel update.

Was the dd running in the VM or in the host OS?  Basically, is running
the VM required?

 Nothing, just VM (kernel update from console) and mplayer2 on the host.
 This is more-or-less reproducible with these two.

No browser or anything else running that might be introducing a stream
of fsync()'s?

 jbd2/sda5-8-10969 [000]    403.679552: jbd2_run_stats: dev
259,655360 tid 348895 wait 0 request_delay 0 running 5000 locked 1040
flushing 0 logging 112 handle_count 148224 blocks 1 blocks_logged 2

  jbd2/md2-8-959   [000]    408.111339: jbd2_run_stats: dev 9,2
tid 382991 wait 0 request_delay 0 running 5156 locked 2268 flushing 0
logging 124 handle_count 5 blocks 1 blocks_logged 2

OK, so this is interesting.  The commit is stalling for 1 second in
the the transaction commit on sda5, and then very shortly thereafter
for 2.2 seconds on md2, while we are trying to lock down the
transaction.  What that means is that we are waiting for all of the
transaction handles opened against that particular commit to complete
before we can let the transaction commit proceed.

Is md2 sharing the same disk spindle as sda5?  And to which disk were
you doing the dd if=/dev/zero of=/dev/XXX command?

If I had to guess what's going on, the disk is accepting a huge amount
of writes to its track buffer, and then occasionally it is going out
to lunch trying to write all of this data to the disk platter.  This
is not (always) happening when we do the commit (with its attended
cache flush command), but in a few cases, we are doing a read command
which is getting stalled.  There are a few cases where we start a
transaction handle, and then discover that we need to read in a disk
block, and if that read stalls for a long period of time, it will hold
the transaction handle open, and this will in turn stall the commit.

If you were to grab a blocktrace, I suspect that is what you will
find; that it's actually a read command which is stalling at some
point, correlated with when we are trying to start transaction commit.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-05 Thread Theodore Ts'o
On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
> Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
> it still sucks. Updating a kernel in a VM still results in "Your system
> is too SLOW to play this!" by mplayer and frame dropping.

What was the first kernel where you didn't have the problem?  Were you
using the 3.8 kernel earlier, and did you see the interactivity
problems there?

What else was running in on your desktop at the same time?  How was
the file system mounted, and can you send me the output of dumpe2fs -h
/dev/XXX?  Oh, and what options were you using to when you kicked off
the VM?

The other thing that would be useful was to enable the jbd2_run_stats
tracepoint and to send the output of the trace log when you notice the
interactivity problems.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-05 Thread Jiri Slaby
On 04/03/2013 12:19 PM, Mel Gorman wrote:
> On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote:
>> On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
>>>
>>> Can you try 3.9-rc4 or later and see if the problem still persists?
>>> There were a number of ext4 issues especially around low memory
>>> performance which weren't resolved until -rc4.
>>
>> Actually, sorry, I took a closer look and I'm not as sure going to
>> -rc4 is going to help (although we did have some ext4 patches to fix a
>> number of bugs that flowed in as late as -rc4).
>>
> 
> I'm running with -rc5 now. I have not noticed much interactivity problems
> as such but the stall detection script reported that mutt stalled for
> 20 seconds opening an inbox and imapd blocked for 59 seconds doing path
> lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
> reader blocked for 3.5 seconds writing a file. etc.
> 
> There has been no reclaim activity in the system yet and 2G is still free
> so it's very unlikely to be a page or slab reclaim problem.

Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
it still sucks. Updating a kernel in a VM still results in "Your system
is too SLOW to play this!" by mplayer and frame dropping.

3.5G out of 6G memory used, the rest is I/O cache.

I have 7200RPM disks in my desktop.

-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-05 Thread Jiri Slaby
On 04/03/2013 12:19 PM, Mel Gorman wrote:
 On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote:
 On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:

 Can you try 3.9-rc4 or later and see if the problem still persists?
 There were a number of ext4 issues especially around low memory
 performance which weren't resolved until -rc4.

 Actually, sorry, I took a closer look and I'm not as sure going to
 -rc4 is going to help (although we did have some ext4 patches to fix a
 number of bugs that flowed in as late as -rc4).

 
 I'm running with -rc5 now. I have not noticed much interactivity problems
 as such but the stall detection script reported that mutt stalled for
 20 seconds opening an inbox and imapd blocked for 59 seconds doing path
 lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
 reader blocked for 3.5 seconds writing a file. etc.
 
 There has been no reclaim activity in the system yet and 2G is still free
 so it's very unlikely to be a page or slab reclaim problem.

Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
it still sucks. Updating a kernel in a VM still results in Your system
is too SLOW to play this! by mplayer and frame dropping.

3.5G out of 6G memory used, the rest is I/O cache.

I have 7200RPM disks in my desktop.

-- 
js
suse labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-05 Thread Theodore Ts'o
On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote:
 Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but
 it still sucks. Updating a kernel in a VM still results in Your system
 is too SLOW to play this! by mplayer and frame dropping.

What was the first kernel where you didn't have the problem?  Were you
using the 3.8 kernel earlier, and did you see the interactivity
problems there?

What else was running in on your desktop at the same time?  How was
the file system mounted, and can you send me the output of dumpe2fs -h
/dev/XXX?  Oh, and what options were you using to when you kicked off
the VM?

The other thing that would be useful was to enable the jbd2_run_stats
tracepoint and to send the output of the trace log when you notice the
interactivity problems.

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Mel Gorman
On Tue, Apr 02, 2013 at 07:16:13PM -0400, Theodore Ts'o wrote:
> I've tried doing some quick timing, and if it is a performance
> regression, it's not a recent one --- or I haven't been able to
> reproduce what Mel is seeing.  I tried the following commands while
> booted into 3.2, 3.8, and 3.9-rc3 kernels:
> 
> time git clone ...
> rm .git/index ; time git reset
> 

FWIW, I had run a number if git checkout based tests over time and none
of them revealed anything useful. Granted it was on other machines but I
don't think it's git on its own. It's a combination that leads to this
problem. Maybe it's really an IO scheduler problem and I need to figure
out what combination triggers it.

> 
>
> Mel, how bad is various git commands that you are trying?  Have you
> tried using time to get estimates of how long a git clone or other git
> operation is taking?
> 

Unfortunately, the milage varies considerably and it's not always
possible to time the operation. It may be that one occasion that opening
a mail takes an abnormal length time with git operations occasionally
making it far worse.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Mel Gorman
On Wed, Apr 03, 2013 at 08:05:30AM -0400, Theodore Ts'o wrote:
> On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote:
> > 
> > I'm running with -rc5 now. I have not noticed much interactivity problems
> > as such but the stall detection script reported that mutt stalled for
> > 20 seconds opening an inbox and imapd blocked for 59 seconds doing path
> > lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
> > reader blocked for 3.5 seconds writing a file. etc.
> 
> If imaps blocked for 12 seconds during an atime update, combined with
> everything else, at a guess it got caught by something holding up a
> journal commit. 

It's a possibility.

I apologise but I forgot that mail is stored on a crypted partition on
this machine. It's formatted ext4 but dmcrypt could be making this problem
worse if it's stalling ext4 waiting to encrypt/decrypt data due to either
a scheduler or workqueue change.

> Could you try enabling the jbd2_run_stats tracepoint
> and grabbing the trace log?  This will give you statistics on how long
> (in milliseconds) each of the various phases of a jbd2 commit is
> taking, i.e.:
> 
> jbd2/sdb1-8-327   [002]  39681.874661: jbd2_run_stats: dev 8,17 tid 
> 7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 
> handle_count 75 blocks 8 blocks_logged 9
>  jbd2/sdb1-8-327   [003]  39682.514153: jbd2_run_stats: dev 8,17 tid 
> 7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 
> handle_count 39 blocks 12 blocks_logged 13
>  jbd2/sdb1-8-327   [000]  39687.665609: jbd2_run_stats: dev 8,17 tid 
> 7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 
> handle_count 60 blocks 13 blocks_logged 14
>  jbd2/sdb1-8-327   [000]  39693.200453: jbd2_run_stats: dev 8,17 tid 
> 7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 
> handle_count 53 blocks 10 blocks_logged 11
>  jbd2/sdb1-8-327   [001]  39695.061657: jbd2_run_stats: dev 8,17 tid 
> 7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 
> handle_count 124 blocks 19 blocks_logged 20
> 

Attached as well as the dstate summary that was recorded at the same
time. It's not quite as compelling but I'll keep the monitor running and
see if something falls out. I didn't find anything useful in the existing
mmtests tests that could be used to bisect this but not many of them are
focused on IO.

> In the above sample each journal commit is running for no more than 5
> seconds or so (since that's the default jbd2 commit timeout; if a
> transaction is running for less than 5 seconds, then either we ran out
> of room in the journal, and the blocks_logged number will be high, or
> a commit was forced by something such as an fsync call).  
> 

I didn't see anything majorly compelling in the jbd tracepoints but I'm
not 100% sure I'm looking for the right thing either. I also recorded
/proc/latency_stat and there were some bad sync latencies from the file
as you can see here

3 4481 1586 jbd2_log_wait_commit ext4_sync_file vfs_fsync sys_msync 
system_call_fastpath
3 11325 4373 sleep_on_page wait_on_page_bit kretprobe_trampoline 
filemap_write_and_wait_range ext4_sync_file vfs_fsync sys_msync 
system_call_fastpath
85 1130707 14904 jbd2_journal_stop jbd2_journal_force_commit ext4_force_commit 
ext4_sync_file do_fsync sys_fsync system_call_fastpath
1 2161073 2161073 start_this_handle jbd2__journal_start.part.8 
jbd2__journal_start __ext4_journal_start_sb ext4_da_writepages do_writepages 
__filemap_fdatawrite_range filemap_write_and_wait_range ext4_sync_file do_fsync 
sys_fsync system_call_fastpath
118 7798435 596184 jbd2_log_wait_commit jbd2_journal_stop 
jbd2_journal_force_commit ext4_force_commit ext4_sync_file do_fsync sys_fsync 
system_call_fastpath
599 15496449 3405822 sleep_on_page wait_on_page_bit kretprobe_trampoline 
filemap_write_and_wait_range ext4_sync_file do_fsync sys_fsync 
system_call_fastpath
405 28572881 2619592 jbd2_log_wait_commit ext4_sync_file do_fsync sys_fsync 
system_call_fastpath


> If an atime update is getting blocked by 12 seconds, then it would be
> interesting to see if a journal commit is running for significantly
> longer than 5 seconds, or if one of the other commit phases is taking
> significant amounts of time.  (On the example above they are all
> taking no time, since I ran this on a relatively uncontended system;
> only a single git operation taking place.)
> 
> Something else that might be worth trying is grabbing a lock_stat
> report and see if something is sitting on an ext4 or jbd2 mutex for a
> long time.
> 

Ok, if nothing useful falls out in this session I'll enable lock
debugging. latency_stat on its own would not be enough to conclude that
a problem was related to lock contention.

> Finally, as I mentioned I tried some rather simplistic tests and I
> didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5
> kernel.  Assuming you 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Theodore Ts'o
On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote:
> 
> I'm running with -rc5 now. I have not noticed much interactivity problems
> as such but the stall detection script reported that mutt stalled for
> 20 seconds opening an inbox and imapd blocked for 59 seconds doing path
> lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
> reader blocked for 3.5 seconds writing a file. etc.

If imaps blocked for 12 seconds during an atime update, combined with
everything else, at a guess it got caught by something holding up a
journal commit.  Could you try enabling the jbd2_run_stats tracepoint
and grabbing the trace log?  This will give you statistics on how long
(in milliseconds) each of the various phases of a jbd2 commit is
taking, i.e.:

jbd2/sdb1-8-327   [002]  39681.874661: jbd2_run_stats: dev 8,17 tid 
7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 
handle_count 75 blocks 8 blocks_logged 9
 jbd2/sdb1-8-327   [003]  39682.514153: jbd2_run_stats: dev 8,17 tid 
7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 
handle_count 39 blocks 12 blocks_logged 13
 jbd2/sdb1-8-327   [000]  39687.665609: jbd2_run_stats: dev 8,17 tid 
7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 
handle_count 60 blocks 13 blocks_logged 14
 jbd2/sdb1-8-327   [000]  39693.200453: jbd2_run_stats: dev 8,17 tid 
7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 
handle_count 53 blocks 10 blocks_logged 11
 jbd2/sdb1-8-327   [001]  39695.061657: jbd2_run_stats: dev 8,17 tid 
7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 
handle_count 124 blocks 19 blocks_logged 20

In the above sample each journal commit is running for no more than 5
seconds or so (since that's the default jbd2 commit timeout; if a
transaction is running for less than 5 seconds, then either we ran out
of room in the journal, and the blocks_logged number will be high, or
a commit was forced by something such as an fsync call).  

If an atime update is getting blocked by 12 seconds, then it would be
interesting to see if a journal commit is running for significantly
longer than 5 seconds, or if one of the other commit phases is taking
significant amounts of time.  (On the example above they are all
taking no time, since I ran this on a relatively uncontended system;
only a single git operation taking place.)

Something else that might be worth trying is grabbing a lock_stat
report and see if something is sitting on an ext4 or jbd2 mutex for a
long time.

Finally, as I mentioned I tried some rather simplistic tests and I
didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5
kernel.  Assuming you can get a version of systemtap that
simultaneously works on 3.2 and 3.9-rc5 :-P, and chance you could do a
quick experiment and see if you're seeing a difference on your setup?

Thanks!!

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Mel Gorman
On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
> > 
> > Can you try 3.9-rc4 or later and see if the problem still persists?
> > There were a number of ext4 issues especially around low memory
> > performance which weren't resolved until -rc4.
> 
> Actually, sorry, I took a closer look and I'm not as sure going to
> -rc4 is going to help (although we did have some ext4 patches to fix a
> number of bugs that flowed in as late as -rc4).
> 

I'm running with -rc5 now. I have not noticed much interactivity problems
as such but the stall detection script reported that mutt stalled for
20 seconds opening an inbox and imapd blocked for 59 seconds doing path
lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
reader blocked for 3.5 seconds writing a file. etc.

There has been no reclaim activity in the system yet and 2G is still free
so it's very unlikely to be a page or slab reclaim problem.

> Can you send us the patch that you used to get record these long stall
> times? 

No patch but it depends on systemtap which you are already aware is a wreck
to work with and frequently breaks between kernel versions for a variety of
reasons. Minimally, it is necessary to revert commit ba6fdda4 (profiling:
Remove unused timer hook) to get systemtap working.  I've reported this
problem to the patch author and the systemtap mailing list.

Other workarounds are necessary so I updated mmtests in git and at
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.10-mmtests-0.01.tar.gz
. Download and untar it

1. stap can be "fixed" by running bin/stap-fix.sh . It will try and run
   a one-liner stap script and if that fails it'll try very crude workarounds.
   Your milage may vary considerably

2. If you want to run the monitor script yourself, it's
   sudo monitors/watch-dstate.pl | tee /tmp/foo.log

   but be aware the log may be truncated due to buffeering.  Optionally you
   can avoid the buffered write problem by running mmtests as

   sudo ./run-mmtests.sh --config configs/config-monitor-interactive stall-debug

   and the log will be in work/log/dstate-stall-debug-monitor.gz

3. Summarise the report with

   cat /tmp/foo.log | subreport/stap-dstate-frequency

I'll be digging through other mmtests results shortly to see if I already
have a better reproduction case that is eligible for bisection but those
results are based on different machines so no guarantees of success.

> And I assume you're using a laptop drive?  5400RPM or 7200RPM?
> 

Yes, laptop drive, 7200RPM. CFQ scheduler. Drive queue depth is 32. 

/dev/sda:

ATA device, with non-removable media
Model Number:   ST9320423AS 
Serial Number:  5VH5M0LY
Firmware Revision:  0003LVM1
Transport:  Serial
Standards:
Used: unknown (minor revision code 0x0029) 
Supported: 8 7 6 5 
Likely used: 8
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  625142448
Logical  Sector size:   512 bytes
Physical Sector size:   512 bytes
device size with M = 1024*1024:  305245 MBytes
device size with M = 1000*1000:  320072 MBytes (320 GB)
cache/buffer size  = 16384 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: 128
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_BUFFER command
   *READ_BUFFER command
   *DOWNLOAD_MICROCODE
   *Advanced Power Management feature set
SET_MAX security extension
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
   *General Purpose Logging feature set
   *64-bit World wide name

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Mel Gorman
On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote:
 On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
  
  Can you try 3.9-rc4 or later and see if the problem still persists?
  There were a number of ext4 issues especially around low memory
  performance which weren't resolved until -rc4.
 
 Actually, sorry, I took a closer look and I'm not as sure going to
 -rc4 is going to help (although we did have some ext4 patches to fix a
 number of bugs that flowed in as late as -rc4).
 

I'm running with -rc5 now. I have not noticed much interactivity problems
as such but the stall detection script reported that mutt stalled for
20 seconds opening an inbox and imapd blocked for 59 seconds doing path
lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
reader blocked for 3.5 seconds writing a file. etc.

There has been no reclaim activity in the system yet and 2G is still free
so it's very unlikely to be a page or slab reclaim problem.

 Can you send us the patch that you used to get record these long stall
 times? 

No patch but it depends on systemtap which you are already aware is a wreck
to work with and frequently breaks between kernel versions for a variety of
reasons. Minimally, it is necessary to revert commit ba6fdda4 (profiling:
Remove unused timer hook) to get systemtap working.  I've reported this
problem to the patch author and the systemtap mailing list.

Other workarounds are necessary so I updated mmtests in git and at
http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.10-mmtests-0.01.tar.gz
. Download and untar it

1. stap can be fixed by running bin/stap-fix.sh . It will try and run
   a one-liner stap script and if that fails it'll try very crude workarounds.
   Your milage may vary considerably

2. If you want to run the monitor script yourself, it's
   sudo monitors/watch-dstate.pl | tee /tmp/foo.log

   but be aware the log may be truncated due to buffeering.  Optionally you
   can avoid the buffered write problem by running mmtests as

   sudo ./run-mmtests.sh --config configs/config-monitor-interactive stall-debug

   and the log will be in work/log/dstate-stall-debug-monitor.gz

3. Summarise the report with

   cat /tmp/foo.log | subreport/stap-dstate-frequency

I'll be digging through other mmtests results shortly to see if I already
have a better reproduction case that is eligible for bisection but those
results are based on different machines so no guarantees of success.

 And I assume you're using a laptop drive?  5400RPM or 7200RPM?
 

Yes, laptop drive, 7200RPM. CFQ scheduler. Drive queue depth is 32. 

/dev/sda:

ATA device, with non-removable media
Model Number:   ST9320423AS 
Serial Number:  5VH5M0LY
Firmware Revision:  0003LVM1
Transport:  Serial
Standards:
Used: unknown (minor revision code 0x0029) 
Supported: 8 7 6 5 
Likely used: 8
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  625142448
Logical  Sector size:   512 bytes
Physical Sector size:   512 bytes
device size with M = 1024*1024:  305245 MBytes
device size with M = 1000*1000:  320072 MBytes (320 GB)
cache/buffer size  = 16384 KBytes
Nominal Media Rotation Rate: 7200
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: 128
Recommended acoustic management value: 254, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_BUFFER command
   *READ_BUFFER command
   *DOWNLOAD_MICROCODE
   *Advanced Power Management feature set
SET_MAX security extension
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
   *General Purpose Logging feature set
   *64-bit World wide name
   *

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Theodore Ts'o
On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote:
 
 I'm running with -rc5 now. I have not noticed much interactivity problems
 as such but the stall detection script reported that mutt stalled for
 20 seconds opening an inbox and imapd blocked for 59 seconds doing path
 lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
 reader blocked for 3.5 seconds writing a file. etc.

If imaps blocked for 12 seconds during an atime update, combined with
everything else, at a guess it got caught by something holding up a
journal commit.  Could you try enabling the jbd2_run_stats tracepoint
and grabbing the trace log?  This will give you statistics on how long
(in milliseconds) each of the various phases of a jbd2 commit is
taking, i.e.:

jbd2/sdb1-8-327   [002]  39681.874661: jbd2_run_stats: dev 8,17 tid 
7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 
handle_count 75 blocks 8 blocks_logged 9
 jbd2/sdb1-8-327   [003]  39682.514153: jbd2_run_stats: dev 8,17 tid 
7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 
handle_count 39 blocks 12 blocks_logged 13
 jbd2/sdb1-8-327   [000]  39687.665609: jbd2_run_stats: dev 8,17 tid 
7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 
handle_count 60 blocks 13 blocks_logged 14
 jbd2/sdb1-8-327   [000]  39693.200453: jbd2_run_stats: dev 8,17 tid 
7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 
handle_count 53 blocks 10 blocks_logged 11
 jbd2/sdb1-8-327   [001]  39695.061657: jbd2_run_stats: dev 8,17 tid 
7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 
handle_count 124 blocks 19 blocks_logged 20

In the above sample each journal commit is running for no more than 5
seconds or so (since that's the default jbd2 commit timeout; if a
transaction is running for less than 5 seconds, then either we ran out
of room in the journal, and the blocks_logged number will be high, or
a commit was forced by something such as an fsync call).  

If an atime update is getting blocked by 12 seconds, then it would be
interesting to see if a journal commit is running for significantly
longer than 5 seconds, or if one of the other commit phases is taking
significant amounts of time.  (On the example above they are all
taking no time, since I ran this on a relatively uncontended system;
only a single git operation taking place.)

Something else that might be worth trying is grabbing a lock_stat
report and see if something is sitting on an ext4 or jbd2 mutex for a
long time.

Finally, as I mentioned I tried some rather simplistic tests and I
didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5
kernel.  Assuming you can get a version of systemtap that
simultaneously works on 3.2 and 3.9-rc5 :-P, and chance you could do a
quick experiment and see if you're seeing a difference on your setup?

Thanks!!

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Mel Gorman
On Wed, Apr 03, 2013 at 08:05:30AM -0400, Theodore Ts'o wrote:
 On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote:
  
  I'm running with -rc5 now. I have not noticed much interactivity problems
  as such but the stall detection script reported that mutt stalled for
  20 seconds opening an inbox and imapd blocked for 59 seconds doing path
  lookups, imaps blocked again for 12 seconds doing an atime update, an RSS
  reader blocked for 3.5 seconds writing a file. etc.
 
 If imaps blocked for 12 seconds during an atime update, combined with
 everything else, at a guess it got caught by something holding up a
 journal commit. 

It's a possibility.

I apologise but I forgot that mail is stored on a crypted partition on
this machine. It's formatted ext4 but dmcrypt could be making this problem
worse if it's stalling ext4 waiting to encrypt/decrypt data due to either
a scheduler or workqueue change.

 Could you try enabling the jbd2_run_stats tracepoint
 and grabbing the trace log?  This will give you statistics on how long
 (in milliseconds) each of the various phases of a jbd2 commit is
 taking, i.e.:
 
 jbd2/sdb1-8-327   [002]  39681.874661: jbd2_run_stats: dev 8,17 tid 
 7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 
 handle_count 75 blocks 8 blocks_logged 9
  jbd2/sdb1-8-327   [003]  39682.514153: jbd2_run_stats: dev 8,17 tid 
 7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 
 handle_count 39 blocks 12 blocks_logged 13
  jbd2/sdb1-8-327   [000]  39687.665609: jbd2_run_stats: dev 8,17 tid 
 7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 
 handle_count 60 blocks 13 blocks_logged 14
  jbd2/sdb1-8-327   [000]  39693.200453: jbd2_run_stats: dev 8,17 tid 
 7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 
 handle_count 53 blocks 10 blocks_logged 11
  jbd2/sdb1-8-327   [001]  39695.061657: jbd2_run_stats: dev 8,17 tid 
 7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 
 handle_count 124 blocks 19 blocks_logged 20
 

Attached as well as the dstate summary that was recorded at the same
time. It's not quite as compelling but I'll keep the monitor running and
see if something falls out. I didn't find anything useful in the existing
mmtests tests that could be used to bisect this but not many of them are
focused on IO.

 In the above sample each journal commit is running for no more than 5
 seconds or so (since that's the default jbd2 commit timeout; if a
 transaction is running for less than 5 seconds, then either we ran out
 of room in the journal, and the blocks_logged number will be high, or
 a commit was forced by something such as an fsync call).  
 

I didn't see anything majorly compelling in the jbd tracepoints but I'm
not 100% sure I'm looking for the right thing either. I also recorded
/proc/latency_stat and there were some bad sync latencies from the file
as you can see here

3 4481 1586 jbd2_log_wait_commit ext4_sync_file vfs_fsync sys_msync 
system_call_fastpath
3 11325 4373 sleep_on_page wait_on_page_bit kretprobe_trampoline 
filemap_write_and_wait_range ext4_sync_file vfs_fsync sys_msync 
system_call_fastpath
85 1130707 14904 jbd2_journal_stop jbd2_journal_force_commit ext4_force_commit 
ext4_sync_file do_fsync sys_fsync system_call_fastpath
1 2161073 2161073 start_this_handle jbd2__journal_start.part.8 
jbd2__journal_start __ext4_journal_start_sb ext4_da_writepages do_writepages 
__filemap_fdatawrite_range filemap_write_and_wait_range ext4_sync_file do_fsync 
sys_fsync system_call_fastpath
118 7798435 596184 jbd2_log_wait_commit jbd2_journal_stop 
jbd2_journal_force_commit ext4_force_commit ext4_sync_file do_fsync sys_fsync 
system_call_fastpath
599 15496449 3405822 sleep_on_page wait_on_page_bit kretprobe_trampoline 
filemap_write_and_wait_range ext4_sync_file do_fsync sys_fsync 
system_call_fastpath
405 28572881 2619592 jbd2_log_wait_commit ext4_sync_file do_fsync sys_fsync 
system_call_fastpath


 If an atime update is getting blocked by 12 seconds, then it would be
 interesting to see if a journal commit is running for significantly
 longer than 5 seconds, or if one of the other commit phases is taking
 significant amounts of time.  (On the example above they are all
 taking no time, since I ran this on a relatively uncontended system;
 only a single git operation taking place.)
 
 Something else that might be worth trying is grabbing a lock_stat
 report and see if something is sitting on an ext4 or jbd2 mutex for a
 long time.
 

Ok, if nothing useful falls out in this session I'll enable lock
debugging. latency_stat on its own would not be enough to conclude that
a problem was related to lock contention.

 Finally, as I mentioned I tried some rather simplistic tests and I
 didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5
 kernel.  Assuming you can get a version of systemtap that
 simultaneously works on 

Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-03 Thread Mel Gorman
On Tue, Apr 02, 2013 at 07:16:13PM -0400, Theodore Ts'o wrote:
 I've tried doing some quick timing, and if it is a performance
 regression, it's not a recent one --- or I haven't been able to
 reproduce what Mel is seeing.  I tried the following commands while
 booted into 3.2, 3.8, and 3.9-rc3 kernels:
 
 time git clone ...
 rm .git/index ; time git reset
 

FWIW, I had run a number if git checkout based tests over time and none
of them revealed anything useful. Granted it was on other machines but I
don't think it's git on its own. It's a combination that leads to this
problem. Maybe it's really an IO scheduler problem and I need to figure
out what combination triggers it.

 SNIP

 Mel, how bad is various git commands that you are trying?  Have you
 tried using time to get estimates of how long a git clone or other git
 operation is taking?
 

Unfortunately, the milage varies considerably and it's not always
possible to time the operation. It may be that one occasion that opening
a mail takes an abnormal length time with git operations occasionally
making it far worse.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
I've tried doing some quick timing, and if it is a performance
regression, it's not a recent one --- or I haven't been able to
reproduce what Mel is seeing.  I tried the following commands while
booted into 3.2, 3.8, and 3.9-rc3 kernels:

time git clone ...
rm .git/index ; time git reset

I did this a number of git repo's; including one that was freshly
cloned, and one that had around 3 dozen patches applied via git am (so
there were a bunch of loose objects).  And I tried doing this on an
SSD and a 5400rpm HDD, and I did it with all of the in-memory cache
flushed via "git 3 > /proc/sys/vm/drop_caches".  The worst case was
doing a "time git reset" after deleting the .git/index file after
applying all of Kent Overstreet's recent AIO patches that had been
sent out for review.  It took around 55 seconds, on 3.2, 3.8 and
3.9-rc3.  That is pretty horrible, but for me that's the reason why I
use SSD's.

Mel, how bad is various git commands that you are trying?  Have you
tried using time to get estimates of how long a git clone or other git
operation is taking?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
So I tried to reproduce the problem, and so I installed systemtap
(bleeding edge, since otherwise it won't work with development
kernel), and then rebuilt a kernel with all of the necessary CONFIG
options enabled:

CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS,
CONFIG_MODULES, CONFIG_MODULE_UNLOAD

I then pulled down mmtests, and tried running watch-dstate.pl, which
is what I sasume you were using, and I got a reminder of why I've
tried very hard to use systemtap:

semantic error: while resolving probe point: identifier 'kprobe' at 
/tmp/stapdjN4_l:18:7
source: probe kprobe.function("get_request_wait")
  ^

semantic error: no match
semantic error: while resolving probe point: identifier 'kprobe' at :74:8
source: }probe kprobe.function("get_request_wait").return
   ^

Pass 2: analysis failed.  [man error::pass2]
Unexpected exit of STAP script at ./watch-dstate.pl line 296.

I have no clue what to do next.  Can you give me a hint?

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Mel Gorman
On Tue, Apr 02, 2013 at 11:03:36PM +0800, Zheng Liu wrote:
> Hi Mel,
> 
> Thanks for reporting it.
> 
> On 04/02/2013 10:27 PM, Mel Gorman wrote:
> > I'm testing a page-reclaim-related series on my laptop that is partially
> > aimed at fixing long stalls when doing metadata-intensive operations on
> > low memory such as a git checkout. I've been running 3.9-rc2 with the
> > series applied but found that the interactive performance was awful even
> > when there was plenty of free memory.
> > 
> > I activated a monitor from mmtests that logs when a process is stuck for
> > a long time in D state and found that there are a lot of stalls in ext4.
> > The report first states that processes have been stalled for a total of
> > 6498 seconds on IO which seems like a lot. Here is a breakdown of the
> > recorded events.
> 
> In this merge window, we add a status tree as a extent cache.  Meanwhile
> a es_cache shrinker is registered to try to reclaim from this cache when
> we are under a high memory pressure. 

Ok.

> So I suspect that the root cause
> is this shrinker.  Could you please tell me how to reproduce this
> problem?  If I understand correctly, I can run mmtest to reproduce this
> problem, right?
> 

This is normal desktop usage with some development thrown in, nothing
spectacular but nothing obviously reproducible either unfortuantely. I
just noticed that some git operations were taking abnormally long, mutt
was very slow opening mail, applications like mozilla were very slow to
launch etc. and dug a little further. I haven't checked if regression
tests under mmtests captured something similar yet.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
> 
> Can you try 3.9-rc4 or later and see if the problem still persists?
> There were a number of ext4 issues especially around low memory
> performance which weren't resolved until -rc4.

Actually, sorry, I took a closer look and I'm not as sure going to
-rc4 is going to help (although we did have some ext4 patches to fix a
number of bugs that flowed in as late as -rc4).

Can you send us the patch that you used to get record these long stall
times?  And I assume you're using a laptop drive?  5400RPM or 7200RPM?

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote:
> I'm testing a page-reclaim-related series on my laptop that is partially
> aimed at fixing long stalls when doing metadata-intensive operations on
> low memory such as a git checkout. I've been running 3.9-rc2 with the
> series applied but found that the interactive performance was awful even
> when there was plenty of free memory.

Can you try 3.9-rc4 or later and see if the problem still persists?
There were a number of ext4 issues especially around low memory
performance which weren't resolved until -rc4.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Zheng Liu
Hi Mel,

Thanks for reporting it.

On 04/02/2013 10:27 PM, Mel Gorman wrote:
> I'm testing a page-reclaim-related series on my laptop that is partially
> aimed at fixing long stalls when doing metadata-intensive operations on
> low memory such as a git checkout. I've been running 3.9-rc2 with the
> series applied but found that the interactive performance was awful even
> when there was plenty of free memory.
> 
> I activated a monitor from mmtests that logs when a process is stuck for
> a long time in D state and found that there are a lot of stalls in ext4.
> The report first states that processes have been stalled for a total of
> 6498 seconds on IO which seems like a lot. Here is a breakdown of the
> recorded events.

In this merge window, we add a status tree as a extent cache.  Meanwhile
a es_cache shrinker is registered to try to reclaim from this cache when
we are under a high memory pressure.  So I suspect that the root cause
is this shrinker.  Could you please tell me how to reproduce this
problem?  If I understand correctly, I can run mmtest to reproduce this
problem, right?

Thanks,
- Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Jiri Slaby
On 04/02/2013 04:27 PM, Mel Gorman wrote:
> I'm testing a page-reclaim-related series on my laptop that is partially
> aimed at fixing long stalls when doing metadata-intensive operations on
> low memory such as a git checkout. I've been running 3.9-rc2 with the
> series applied but found that the interactive performance was awful even
> when there was plenty of free memory.
> 
> I activated a monitor from mmtests that logs when a process is stuck for
> a long time in D state and found that there are a lot of stalls in ext4.
> The report first states that processes have been stalled for a total of
> 6498 seconds on IO which seems like a lot. Here is a breakdown of the
> recorded events.

Just a note that I am indeed using ext4 on the affected machine for all
filesystems I have except for an efi partition...

-- 
js
suse labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Jiri Slaby
On 04/02/2013 04:27 PM, Mel Gorman wrote:
 I'm testing a page-reclaim-related series on my laptop that is partially
 aimed at fixing long stalls when doing metadata-intensive operations on
 low memory such as a git checkout. I've been running 3.9-rc2 with the
 series applied but found that the interactive performance was awful even
 when there was plenty of free memory.
 
 I activated a monitor from mmtests that logs when a process is stuck for
 a long time in D state and found that there are a lot of stalls in ext4.
 The report first states that processes have been stalled for a total of
 6498 seconds on IO which seems like a lot. Here is a breakdown of the
 recorded events.

Just a note that I am indeed using ext4 on the affected machine for all
filesystems I have except for an efi partition...

-- 
js
suse labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Zheng Liu
Hi Mel,

Thanks for reporting it.

On 04/02/2013 10:27 PM, Mel Gorman wrote:
 I'm testing a page-reclaim-related series on my laptop that is partially
 aimed at fixing long stalls when doing metadata-intensive operations on
 low memory such as a git checkout. I've been running 3.9-rc2 with the
 series applied but found that the interactive performance was awful even
 when there was plenty of free memory.
 
 I activated a monitor from mmtests that logs when a process is stuck for
 a long time in D state and found that there are a lot of stalls in ext4.
 The report first states that processes have been stalled for a total of
 6498 seconds on IO which seems like a lot. Here is a breakdown of the
 recorded events.

In this merge window, we add a status tree as a extent cache.  Meanwhile
a es_cache shrinker is registered to try to reclaim from this cache when
we are under a high memory pressure.  So I suspect that the root cause
is this shrinker.  Could you please tell me how to reproduce this
problem?  If I understand correctly, I can run mmtest to reproduce this
problem, right?

Thanks,
- Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote:
 I'm testing a page-reclaim-related series on my laptop that is partially
 aimed at fixing long stalls when doing metadata-intensive operations on
 low memory such as a git checkout. I've been running 3.9-rc2 with the
 series applied but found that the interactive performance was awful even
 when there was plenty of free memory.

Can you try 3.9-rc4 or later and see if the problem still persists?
There were a number of ext4 issues especially around low memory
performance which weren't resolved until -rc4.

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote:
 
 Can you try 3.9-rc4 or later and see if the problem still persists?
 There were a number of ext4 issues especially around low memory
 performance which weren't resolved until -rc4.

Actually, sorry, I took a closer look and I'm not as sure going to
-rc4 is going to help (although we did have some ext4 patches to fix a
number of bugs that flowed in as late as -rc4).

Can you send us the patch that you used to get record these long stall
times?  And I assume you're using a laptop drive?  5400RPM or 7200RPM?

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Mel Gorman
On Tue, Apr 02, 2013 at 11:03:36PM +0800, Zheng Liu wrote:
 Hi Mel,
 
 Thanks for reporting it.
 
 On 04/02/2013 10:27 PM, Mel Gorman wrote:
  I'm testing a page-reclaim-related series on my laptop that is partially
  aimed at fixing long stalls when doing metadata-intensive operations on
  low memory such as a git checkout. I've been running 3.9-rc2 with the
  series applied but found that the interactive performance was awful even
  when there was plenty of free memory.
  
  I activated a monitor from mmtests that logs when a process is stuck for
  a long time in D state and found that there are a lot of stalls in ext4.
  The report first states that processes have been stalled for a total of
  6498 seconds on IO which seems like a lot. Here is a breakdown of the
  recorded events.
 
 In this merge window, we add a status tree as a extent cache.  Meanwhile
 a es_cache shrinker is registered to try to reclaim from this cache when
 we are under a high memory pressure. 

Ok.

 So I suspect that the root cause
 is this shrinker.  Could you please tell me how to reproduce this
 problem?  If I understand correctly, I can run mmtest to reproduce this
 problem, right?
 

This is normal desktop usage with some development thrown in, nothing
spectacular but nothing obviously reproducible either unfortuantely. I
just noticed that some git operations were taking abnormally long, mutt
was very slow opening mail, applications like mozilla were very slow to
launch etc. and dug a little further. I haven't checked if regression
tests under mmtests captured something similar yet.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
So I tried to reproduce the problem, and so I installed systemtap
(bleeding edge, since otherwise it won't work with development
kernel), and then rebuilt a kernel with all of the necessary CONFIG
options enabled:

CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS,
CONFIG_MODULES, CONFIG_MODULE_UNLOAD

I then pulled down mmtests, and tried running watch-dstate.pl, which
is what I sasume you were using, and I got a reminder of why I've
tried very hard to use systemtap:

semantic error: while resolving probe point: identifier 'kprobe' at 
/tmp/stapdjN4_l:18:7
source: probe kprobe.function(get_request_wait)
  ^

semantic error: no match
semantic error: while resolving probe point: identifier 'kprobe' at :74:8
source: }probe kprobe.function(get_request_wait).return
   ^

Pass 2: analysis failed.  [man error::pass2]
Unexpected exit of STAP script at ./watch-dstate.pl line 296.

I have no clue what to do next.  Can you give me a hint?

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Excessive stall times on ext4 in 3.9-rc2

2013-04-02 Thread Theodore Ts'o
I've tried doing some quick timing, and if it is a performance
regression, it's not a recent one --- or I haven't been able to
reproduce what Mel is seeing.  I tried the following commands while
booted into 3.2, 3.8, and 3.9-rc3 kernels:

time git clone ...
rm .git/index ; time git reset

I did this a number of git repo's; including one that was freshly
cloned, and one that had around 3 dozen patches applied via git am (so
there were a bunch of loose objects).  And I tried doing this on an
SSD and a 5400rpm HDD, and I did it with all of the in-memory cache
flushed via git 3  /proc/sys/vm/drop_caches.  The worst case was
doing a time git reset after deleting the .git/index file after
applying all of Kent Overstreet's recent AIO patches that had been
sent out for review.  It took around 55 seconds, on 3.2, 3.8 and
3.9-rc3.  That is pretty horrible, but for me that's the reason why I
use SSD's.

Mel, how bad is various git commands that you are trying?  Have you
tried using time to get estimates of how long a git clone or other git
operation is taking?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/