Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 24, 2013 at 03:09:13PM -0400, Jeff Moyer wrote: > Mel Gorman writes: > > >> I'll also note that even though your I/O is going all over the place > >> (D2C is pretty bad, 14ms), most of the time is spent waiting for a > >> struct request allocation or between Queue and Merge: > >> > >> All Devices > >> > >> ALL MIN AVG MAX N > >> --- - - - --- > >> > >> Q2Q 0.1 0.000992259 8.898375882 2300861 > >> Q2G 0.00843 10.193261239 2064.079501935 1016463 > >> < > > > > This is not normally my sandbox so do you mind spelling this out? > > > > IIUC, the time to allocate the struct request from the slab cache is just a > > small portion of this time. The bulk of the time is spent in get_request() > > waiting for congestion to clear on the request list for either the sync or > > async queue. Once a process goes to sleep on that waitqueue, it has to wait > > until enough requests on that queue have been serviced before it gets woken > > again at which point it gets priority access to prevent further starvation. > > This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here > > is that the async queue was congested and on average, we are waiting for > > 10 seconds for it to clear. The maximum value may be bogus for reasons > > explained later. > > > > Is that accurate? > > Yes, without getting into excruciating detail. Good enough, thanks. > >> I'm not sure I believe that max value. 2064 seconds seems a bit high. > > > > It is so I looked closer at the timestamps and there is an one hour > > correction about 4400 seconds into the test. Daylight savings time kicked > > in on March 31st and the machine is rarely rebooted until this test case > > came along. It looks like there is a timezone or time misconfiguration > > on the laptop that starts the machine with the wrong time. NTP must have > > corrected the time which skewed the readings in that window severely :( > > Not sure I'm buying that argument, as there are no gaps in the blkparse > output. The logging is not done using wallclock time. I still haven't > had sufficient time to dig into these numbers. > Ok. > > It's awkward but it's not like there are standard benchmarks lying around > > and it seemed the best way to reproduce the problems I typically see early > > in the lifetime of a system or when running a git checkout when the tree > > has not been used in a few hours. Run the actual test with > > > > ./run-mmtests.sh --config > > configs/config-global-dhp__io-multiple-source-latency --run-monitor > > test-name-of-your-choice > > > > Results will be in work/log. You'll need to run this as root so it > > can run blktrace and so it can drop_caches between git checkouts > > (to force disk IO). If systemtap craps out on you, then edit > > configs/config-global-dhp__io-multiple-source-latency and remove dstate > > from MONITORS_GZIP > > And how do I determine whether I've hit the problem? > If systemtap is available then cat work/log/dstate-TESTNAME-gitcheckout | subreport/stap-dstate-frequency will give you a report on the worst stalls and the stack traces when those stalls occurred. If the stalls are 10+ seconds then you've certainly hit the problem. Alternatively cd work/log ../../compare-kernels.sh Look at the average time it takes to run the git checkout. Is it abnormally high in comparison to if there was no parallel IO? If you do not know what the normal time is, run with ./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --no-monitor test-name-no-monitor The monitors are what's opening the maildir and generating the parallel IO. If there is an excessive difference between the git checkout times, then you've hit the problem. Furthermore, look at the await times. If they do not appear in the compare-kernels.sh output then ../../bin/compare-mmtests.pl -d . -b gitcheckout -n test-name-of-your-choice --print-monitor iostat And look at the await times. Are they very high? If so, you hit the problem. If you want a better look at the await figures over time, either look at the iostat file or you can graph it with ../../bin/graph-mmtests.sh -d . -b gitcheckout -n test-name-of-your-choice --print-monitor iostat --sub-heading sda-await where sda-await should be substituted for whatever the disk is that you're running the test one. > > If you have trouble getting this running, ping me on IRC. > > Yes, I'm having issues getting things to go, but you didn't provide me a > time zone, an irc server or a nick to help me find you. Was that > intentional? ;-) > Not consciously :) . I'm in the IST timezone (GMT+1), OFTC IRC network and #mm channel. If it's my evening I'm not always responsive so send me the error output on email and I'll try fixing any problems that way. --
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 24, 2013 at 03:09:13PM -0400, Jeff Moyer wrote: Mel Gorman mgor...@suse.de writes: I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: All Devices ALL MIN AVG MAX N --- - - - --- Q2Q 0.1 0.000992259 8.898375882 2300861 Q2G 0.00843 10.193261239 2064.079501935 1016463 This is not normally my sandbox so do you mind spelling this out? IIUC, the time to allocate the struct request from the slab cache is just a small portion of this time. The bulk of the time is spent in get_request() waiting for congestion to clear on the request list for either the sync or async queue. Once a process goes to sleep on that waitqueue, it has to wait until enough requests on that queue have been serviced before it gets woken again at which point it gets priority access to prevent further starvation. This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here is that the async queue was congested and on average, we are waiting for 10 seconds for it to clear. The maximum value may be bogus for reasons explained later. Is that accurate? Yes, without getting into excruciating detail. Good enough, thanks. I'm not sure I believe that max value. 2064 seconds seems a bit high. It is so I looked closer at the timestamps and there is an one hour correction about 4400 seconds into the test. Daylight savings time kicked in on March 31st and the machine is rarely rebooted until this test case came along. It looks like there is a timezone or time misconfiguration on the laptop that starts the machine with the wrong time. NTP must have corrected the time which skewed the readings in that window severely :( Not sure I'm buying that argument, as there are no gaps in the blkparse output. The logging is not done using wallclock time. I still haven't had sufficient time to dig into these numbers. Ok. It's awkward but it's not like there are standard benchmarks lying around and it seemed the best way to reproduce the problems I typically see early in the lifetime of a system or when running a git checkout when the tree has not been used in a few hours. Run the actual test with ./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --run-monitor test-name-of-your-choice Results will be in work/log. You'll need to run this as root so it can run blktrace and so it can drop_caches between git checkouts (to force disk IO). If systemtap craps out on you, then edit configs/config-global-dhp__io-multiple-source-latency and remove dstate from MONITORS_GZIP And how do I determine whether I've hit the problem? If systemtap is available then cat work/log/dstate-TESTNAME-gitcheckout | subreport/stap-dstate-frequency will give you a report on the worst stalls and the stack traces when those stalls occurred. If the stalls are 10+ seconds then you've certainly hit the problem. Alternatively cd work/log ../../compare-kernels.sh Look at the average time it takes to run the git checkout. Is it abnormally high in comparison to if there was no parallel IO? If you do not know what the normal time is, run with ./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --no-monitor test-name-no-monitor The monitors are what's opening the maildir and generating the parallel IO. If there is an excessive difference between the git checkout times, then you've hit the problem. Furthermore, look at the await times. If they do not appear in the compare-kernels.sh output then ../../bin/compare-mmtests.pl -d . -b gitcheckout -n test-name-of-your-choice --print-monitor iostat And look at the await times. Are they very high? If so, you hit the problem. If you want a better look at the await figures over time, either look at the iostat file or you can graph it with ../../bin/graph-mmtests.sh -d . -b gitcheckout -n test-name-of-your-choice --print-monitor iostat --sub-heading sda-await where sda-await should be substituted for whatever the disk is that you're running the test one. If you have trouble getting this running, ping me on IRC. Yes, I'm having issues getting things to go, but you didn't provide me a time zone, an irc server or a nick to help me find you. Was that intentional? ;-) Not consciously :) . I'm in the IST timezone (GMT+1), OFTC IRC network and #mm channel. If it's my evening I'm not always responsive so send me the error output on email and I'll try fixing any problems that way. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a
Re: Excessive stall times on ext4 in 3.9-rc2
Mel Gorman writes: >> I'll also note that even though your I/O is going all over the place >> (D2C is pretty bad, 14ms), most of the time is spent waiting for a >> struct request allocation or between Queue and Merge: >> >> All Devices >> >> ALL MIN AVG MAX N >> --- - - - --- >> >> Q2Q 0.1 0.000992259 8.898375882 2300861 >> Q2G 0.00843 10.193261239 2064.079501935 1016463 < > > This is not normally my sandbox so do you mind spelling this out? > > IIUC, the time to allocate the struct request from the slab cache is just a > small portion of this time. The bulk of the time is spent in get_request() > waiting for congestion to clear on the request list for either the sync or > async queue. Once a process goes to sleep on that waitqueue, it has to wait > until enough requests on that queue have been serviced before it gets woken > again at which point it gets priority access to prevent further starvation. > This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here > is that the async queue was congested and on average, we are waiting for > 10 seconds for it to clear. The maximum value may be bogus for reasons > explained later. > > Is that accurate? Yes, without getting into excruciating detail. >> G2I 0.00461 0.44702 3.237065090 1015803 >> Q2M 0.00101 8.203147238 2064.079367557 1311662 >> I2D 0.02012 1.476824812 2064.089774419 1014890 >> M2D 0.03283 6.994306138 283.573348664 1284872 >> D2C 0.61889 0.014438316 0.857811758 2291996 >> Q2C 0.72284 13.363007244 2064.092228625 2292191 >> >> Device Overhead >> >>DEV | Q2G G2I Q2M I2D D2C >> -- | - - - - - >> ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% >> -- | - - - - - >>Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% >> >> I'm not sure I believe that max value. 2064 seconds seems a bit high. > > It is so I looked closer at the timestamps and there is an one hour > correction about 4400 seconds into the test. Daylight savings time kicked > in on March 31st and the machine is rarely rebooted until this test case > came along. It looks like there is a timezone or time misconfiguration > on the laptop that starts the machine with the wrong time. NTP must have > corrected the time which skewed the readings in that window severely :( Not sure I'm buying that argument, as there are no gaps in the blkparse output. The logging is not done using wallclock time. I still haven't had sufficient time to dig into these numbers. >> Also, Q2M should not be anywhere near that big, so more investigation is >> required there. A quick look over the data doesn't show any such delays >> (making me question the tools), but I'll write some code tomorrow to >> verify the btt output. >> > > It might be a single set of readings during a time correction that > screwed it. Again, I don't think so. > I can reproduce it at will. Due to the nature of the test, the test > results are variable and unfortunately it is one of the tricker mmtest > configurations to setup. > > 1. Get access to a webserver > 2. Close mmtests to your test machine >git clone https://github.com/gormanm/mmtests.git > 3. Edit shellpacks/common-config.sh and set WEBROOT to a webserver path > 4. Create a tar.gz of a large git tree and place it at > $WEBROOT/linux-2.6.tar.gz >Alternatively place a compressed git tree anywhere and edit >configs/config-global-dhp__io-multiple-source-latency >and update GITCHECKOUT_SOURCETAR > 5. Create a tar.gz of a large maildir directory and place it at >$WEBROOT/$WEBROOT/maildir.tar.gz >Alternatively, use an existing maildir folder and set >MONITOR_INBOX_OPEN_MAILDIR in >configs/config-global-dhp__io-multiple-source-latency > > It's awkward but it's not like there are standard benchmarks lying around > and it seemed the best way to reproduce the problems I typically see early > in the lifetime of a system or when running a git checkout when the tree > has not been used in a few hours. Run the actual test with > > ./run-mmtests.sh --config > configs/config-global-dhp__io-multiple-source-latency --run-monitor > test-name-of-your-choice > > Results will be in work/log. You'll need to run this as root so it > can run blktrace and so it can drop_caches between git checkouts > (to force disk IO). If systemtap craps out on you, then edit > configs/config-global-dhp__io-multiple-source-latency and remove dstate > from MONITORS_GZIP And how do I determine whether I've hit the
Re: Excessive stall times on ext4 in 3.9-rc2
Mel Gorman mgor...@suse.de writes: I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: All Devices ALL MIN AVG MAX N --- - - - --- Q2Q 0.1 0.000992259 8.898375882 2300861 Q2G 0.00843 10.193261239 2064.079501935 1016463 This is not normally my sandbox so do you mind spelling this out? IIUC, the time to allocate the struct request from the slab cache is just a small portion of this time. The bulk of the time is spent in get_request() waiting for congestion to clear on the request list for either the sync or async queue. Once a process goes to sleep on that waitqueue, it has to wait until enough requests on that queue have been serviced before it gets woken again at which point it gets priority access to prevent further starvation. This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here is that the async queue was congested and on average, we are waiting for 10 seconds for it to clear. The maximum value may be bogus for reasons explained later. Is that accurate? Yes, without getting into excruciating detail. G2I 0.00461 0.44702 3.237065090 1015803 Q2M 0.00101 8.203147238 2064.079367557 1311662 I2D 0.02012 1.476824812 2064.089774419 1014890 M2D 0.03283 6.994306138 283.573348664 1284872 D2C 0.61889 0.014438316 0.857811758 2291996 Q2C 0.72284 13.363007244 2064.092228625 2292191 Device Overhead DEV | Q2G G2I Q2M I2D D2C -- | - - - - - ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% -- | - - - - - Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% I'm not sure I believe that max value. 2064 seconds seems a bit high. It is so I looked closer at the timestamps and there is an one hour correction about 4400 seconds into the test. Daylight savings time kicked in on March 31st and the machine is rarely rebooted until this test case came along. It looks like there is a timezone or time misconfiguration on the laptop that starts the machine with the wrong time. NTP must have corrected the time which skewed the readings in that window severely :( Not sure I'm buying that argument, as there are no gaps in the blkparse output. The logging is not done using wallclock time. I still haven't had sufficient time to dig into these numbers. Also, Q2M should not be anywhere near that big, so more investigation is required there. A quick look over the data doesn't show any such delays (making me question the tools), but I'll write some code tomorrow to verify the btt output. It might be a single set of readings during a time correction that screwed it. Again, I don't think so. I can reproduce it at will. Due to the nature of the test, the test results are variable and unfortunately it is one of the tricker mmtest configurations to setup. 1. Get access to a webserver 2. Close mmtests to your test machine git clone https://github.com/gormanm/mmtests.git 3. Edit shellpacks/common-config.sh and set WEBROOT to a webserver path 4. Create a tar.gz of a large git tree and place it at $WEBROOT/linux-2.6.tar.gz Alternatively place a compressed git tree anywhere and edit configs/config-global-dhp__io-multiple-source-latency and update GITCHECKOUT_SOURCETAR 5. Create a tar.gz of a large maildir directory and place it at $WEBROOT/$WEBROOT/maildir.tar.gz Alternatively, use an existing maildir folder and set MONITOR_INBOX_OPEN_MAILDIR in configs/config-global-dhp__io-multiple-source-latency It's awkward but it's not like there are standard benchmarks lying around and it seemed the best way to reproduce the problems I typically see early in the lifetime of a system or when running a git checkout when the tree has not been used in a few hours. Run the actual test with ./run-mmtests.sh --config configs/config-global-dhp__io-multiple-source-latency --run-monitor test-name-of-your-choice Results will be in work/log. You'll need to run this as root so it can run blktrace and so it can drop_caches between git checkouts (to force disk IO). If systemtap craps out on you, then edit configs/config-global-dhp__io-multiple-source-latency and remove dstate from MONITORS_GZIP And how do I determine whether I've hit the problem? If you have trouble getting this running, ping me on IRC. Yes, I'm having issues getting things
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 23, 2013 at 11:50:19AM -0400, Theodore Ts'o wrote: > On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote: > > That's a pretty big drop but it gets bad again for the second worst stall -- > > wait_on_page_bit as a result of generic_file_buffered_write. > > > > Vanilla kernel 1336064 ms stalled with 109 events > > Patched kernel 2338781 ms stalled with 164 events > > Do you have the stack trace for this stall? I'm wondering if this is > caused by the waiting for stable pages in write_begin() , or something > else. > [] wait_on_page_bit+0x78/0x80 [] kretprobe_trampoline+0x0/0x4c [] generic_file_buffered_write+0x114/0x2a0 [] __generic_file_aio_write+0x1bd/0x3c0 [] generic_file_aio_write+0x7a/0xf0 [] ext4_file_write+0x99/0x420 [] do_sync_write+0xa7/0xe0 [] vfs_write+0xa7/0x180 [] sys_write+0x4d/0x90 [] system_call_fastpath+0x1a/0x1f [] 0x The processes that stalled in this particular trace are wget, latency-output, tar and tclsh. Most of these are sequential writers except for tar which is both a sequential reader and sequential writers. > If it is blocking caused by stable page writeback that's interesting, > since it would imply that something in your workload is trying to > write to a page that has already been modified (i.e., appending to a > log file, or updating a database file). Does that make sense given > what your workload might be running? > I doubt it is stable write consider the type of processes that are running. I would expect the bulk of the activity to be sequential readers or writers of multiple files. The summarised report from the raw data is now at http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-vanilla http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-ext4tag It's an aside but the worst of the stalls are incurred by systemd-tmpfile which were not a deliberate part of the test and yet another thing that I would not have caught unless I was running tests on my laptop. Looking closer at that thing, the default configuration is to run the service 15 minutes after boot and after that it runs once a day. It looks like the bulk of the scanning would be in /var/tmp/ looking at systemds own files (over 3000 of them) which I'm a little amused by. My normal test machines would not hit this because they are not systemd based but the existance of thing thing is worth noting. Any IO-based tests run on systemd-based distributions may give different results depending on whether this service triggered during the test or not. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote: > That's a pretty big drop but it gets bad again for the second worst stall -- > wait_on_page_bit as a result of generic_file_buffered_write. > > Vanilla kernel 1336064 ms stalled with 109 events > Patched kernel 2338781 ms stalled with 164 events Do you have the stack trace for this stall? I'm wondering if this is caused by the waiting for stable pages in write_begin() , or something else. If it is blocking caused by stable page writeback that's interesting, since it would imply that something in your workload is trying to write to a page that has already been modified (i.e., appending to a log file, or updating a database file). Does that make sense given what your workload might be running? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sat, Apr 20, 2013 at 08:05:22PM -0400, Theodore Ts'o wrote: > An alternate solution which I've been playing around adds buffer_head > flags so we can indicate that a buffer contains metadata and/or should > have I/O submitted with the REQ_PRIO flag set. > I beefed up the reporting slightly and tested the patches comparing 3.9-rc6 vanilla with your patches. The full report with graphs are at http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/report.html 3.9.0-rc6 3.9.0-rc6 vanilla ext4tag Usermin 0.00 ( 0.00%)0.00 ( 0.00%) Usermean nan ( nan%) nan ( nan%) Userstddev nan ( nan%) nan ( nan%) Usermax 0.00 ( 0.00%)0.00 ( 0.00%) Userrange 0.00 ( 0.00%)0.00 ( 0.00%) System min 9.14 ( 0.00%)9.13 ( 0.11%) System mean 9.60 ( 0.00%)9.73 ( -1.33%) System stddev0.39 ( 0.00%)0.94 (-142.69%) System max 10.31 ( 0.00%) 11.58 (-12.32%) System range 1.17 ( 0.00%)2.45 (-109.40%) Elapsed min 665.54 ( 0.00%) 612.25 ( 8.01%) Elapsed mean775.35 ( 0.00%) 688.01 ( 11.26%) Elapsed stddev 69.11 ( 0.00%) 58.22 ( 15.75%) Elapsed max 858.40 ( 0.00%) 773.06 ( 9.94%) Elapsed range 192.86 ( 0.00%) 160.81 ( 16.62%) CPU min 3.00 ( 0.00%)3.00 ( 0.00%) CPU mean 3.60 ( 0.00%)4.20 (-16.67%) CPU stddev0.49 ( 0.00%)0.75 (-52.75%) CPU max 4.00 ( 0.00%)5.00 (-25.00%) CPU range 1.00 ( 0.00%)2.00 (-100.00%) The patches appear to improve the git checkout times slightly but this test is quite variable. The vmstat figures report some reclaim activity but if you look at the graphs further down you will see that the bulk of the kswapd reclaim scan and steal activity is at the start of the test when it's downloading and untarring a git tree to work with. (I also note that the mouse-over graph for direct reclaim efficiency is broken but it's not important right now). >From iostat 3.9.0-rc6 3.9.0-rc6 vanilla ext4tag Mean dm-0-avgqz 1.181.19 Mean dm-0-await 17.30 16.50 Mean dm-0-r_await 17.30 16.50 Mean dm-0-w_await0.940.48 Mean sda-avgqz 650.29 719.81 Mean sda-await2501.33 2597.23 Mean sda-r_await30.01 24.91 Mean sda-w_await 11228.8011120.64 Max dm-0-avgqz 12.30 10.14 Max dm-0-await 42.65 52.23 Max dm-0-r_await 42.65 52.23 Max dm-0-w_await 541.00 263.83 Max sda-avgqz3811.93 3375.11 Max sda-await7178.61 7170.44 Max sda-r_await 384.37 297.85 Max sda-w_await 51353.9350338.25 There are no really obvious massive advantages to me there and if you look at the graphs for the avgqs, await etc over time, the patched kernel are not obviously better. The Wait CPU usage looks roughly the same too. On the more positive side, the dstate systemtap monitor script tells me that all processes were stalled for less time -- 9575 seconds versus 10910. The most severe event to stall on is sleep_on_buffer() as a result of ext4_bread. Vanilla kernel 3325677 ms stalled with 57 events Patched kernel 2411471 ms stalled with 42 events That's a pretty big drop but it gets bad again for the second worst stall -- wait_on_page_bit as a result of generic_file_buffered_write. Vanilla kernel 1336064 ms stalled with 109 events Patched kernel 2338781 ms stalled with 164 events So conceptually the patches make sense but the first set of tests do not indicate that they'll fix the problem and the stall times do not indicate that interactivity will be any better. I'll still apply them and boot them on my main work machine and see how they "feel" this evening. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote: > > 3. The blktrace indicates that reads can starve writes from flusher > > > >While there are people that can look at a blktrace and find problems > >like they are rain man, I'm more like an ADHD squirrel when looking at > >a trace. I wrote a script to look for what unrelated requests completed > >while an request got stalled for over a second. It seemed like something > >that a tool shoudl already exist for but I didn't find one unless btt > >can give the information somehow. > > Care to share that script? > I would have preferred not to because it is an ugly hatchet job churned out in a few minutes. It's written in perl and uses the text output from blkparse making it slow. It uses an excessive amount of memory because I was taking shortcuts so is resource heavy. It ignores most of the information from blkparse and so there are gaps in what it reports. Even though it's dogshit, it was useful in this particular case so I added it to mmtests anyway. Be aware that it takes ages to run and you might want to break the blkparse output into pieces It's used something like blkparse -i blktrace-log > blkparse-log cat blkparse-log | $PATH_TO_MMTESTS/subreport/blktrace-queue-watch.pl > [snip] > > > I recognise that the output will have a WTF reaction but the key > > observations to me are > > > > a) a single write request from flusher took over a second to complete > > b) at the time it was queued, it was mostly other writes that were in > >the queue at the same time > > c) The write request and the parallel writes were all asynchronous write > >requests > > D) at the time the request completed, there were a LARGE number of > >other requested queued and completed at the same time. > > > > Of the requests queued and completed in the meantime the breakdown was > > > > 22 RM > > 31 RA > > 82 W > > 445 R > > > > If I'm reading this correctly, it is saying that 22 reads were merged (RM), > > 31 reads were remapped to another device (RA) which is probably reads from > > the dm-crypt partition, 82 were writes (W) which is not far off the number > > of writes that were in the queue and 445 were other reads. The delay was > > dominated by reads that were queued after the write request and completed > > before it. > > RM == Read Meta > RA == Read Ahead (remapping, by the way, does not happen across >devices, just into partitions) > W and R you understood correctly. > Thanks for those corrections. I misread the meaning of the action identifiers section of the blkparse manual. I should have double checked the source. > >> > >> The thing is, we do want to make ext4 work well with cfq, and > >> prioritizing non-readahead read requests ahead of data writeback does > >> make sense. The issue is with is that metadata writes going through > >> the block device could in some cases effectively cause a priority > >> inversion when what had previously been an asynchronous writeback > >> starts blocking a foreground, user-visible process. > >> > >> At least, that's the theory; > > > > I *think* the data more or less confirms the theory but it'd be nice if > > someone else double checked in case I'm seeing what I want to see > > instead of what is actually there. > > Looks sane. You can also see a lot of "preempt"s in the blkparse > output, which indicates exactly what you're saying. Any sync request > gets priority over the async requests. > Good to know. > I'll also note that even though your I/O is going all over the place > (D2C is pretty bad, 14ms), most of the time is spent waiting for a > struct request allocation or between Queue and Merge: > > All Devices > > ALL MIN AVG MAX N > --- - - - --- > > Q2Q 0.1 0.000992259 8.898375882 2300861 > Q2G 0.00843 10.193261239 2064.079501935 1016463 < This is not normally my sandbox so do you mind spelling this out? IIUC, the time to allocate the struct request from the slab cache is just a small portion of this time. The bulk of the time is spent in get_request() waiting for congestion to clear on the request list for either the sync or async queue. Once a process goes to sleep on that waitqueue, it has to wait until enough requests on that queue have been serviced before it gets woken again at which point it gets priority access to prevent further starvation. This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here is that the async queue was congested and on average, we are waiting for 10 seconds for it to clear. The maximum value may be bogus for reasons explained later. Is that accurate? > G2I 0.00461 0.44702 3.237065090 1015803 > Q2M 0.00101 8.203147238
Re: Excessive stall times on ext4 in 3.9-rc2
On Mon 22-04-13 18:42:23, Jeff Moyer wrote: > Jan, if I were to come up with a way of promoting a particular async > queue to the front of the line, where would I put such a call in the > ext4/jbd2 code to be effective? As Ted wrote the simplies might be to put his directly in __lock_buffer(). Something like: diff --git a/fs/buffer.c b/fs/buffer.c index b4dcb34..e026a3e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -69,6 +69,12 @@ static int sleep_on_buffer(void *word) void __lock_buffer(struct buffer_head *bh) { + /* +* Likely under async writeback? Tell io scheduler we are +* now waiting for the IO... +*/ + if (PageWriteback(bh->b_page)) + io_now_sync(bh->b_bdev, bh->b_blocknr); wait_on_bit_lock(>b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); } Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Mon 22-04-13 18:42:23, Jeff Moyer wrote: Jan, if I were to come up with a way of promoting a particular async queue to the front of the line, where would I put such a call in the ext4/jbd2 code to be effective? As Ted wrote the simplies might be to put his directly in __lock_buffer(). Something like: diff --git a/fs/buffer.c b/fs/buffer.c index b4dcb34..e026a3e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -69,6 +69,12 @@ static int sleep_on_buffer(void *word) void __lock_buffer(struct buffer_head *bh) { + /* +* Likely under async writeback? Tell io scheduler we are +* now waiting for the IO... +*/ + if (PageWriteback(bh-b_page)) + io_now_sync(bh-b_bdev, bh-b_blocknr); wait_on_bit_lock(bh-b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); } Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote: 3. The blktrace indicates that reads can starve writes from flusher While there are people that can look at a blktrace and find problems like they are rain man, I'm more like an ADHD squirrel when looking at a trace. I wrote a script to look for what unrelated requests completed while an request got stalled for over a second. It seemed like something that a tool shoudl already exist for but I didn't find one unless btt can give the information somehow. Care to share that script? I would have preferred not to because it is an ugly hatchet job churned out in a few minutes. It's written in perl and uses the text output from blkparse making it slow. It uses an excessive amount of memory because I was taking shortcuts so is resource heavy. It ignores most of the information from blkparse and so there are gaps in what it reports. Even though it's dogshit, it was useful in this particular case so I added it to mmtests anyway. Be aware that it takes ages to run and you might want to break the blkparse output into pieces It's used something like blkparse -i blktrace-log blkparse-log cat blkparse-log | $PATH_TO_MMTESTS/subreport/blktrace-queue-watch.pl [snip] I recognise that the output will have a WTF reaction but the key observations to me are a) a single write request from flusher took over a second to complete b) at the time it was queued, it was mostly other writes that were in the queue at the same time c) The write request and the parallel writes were all asynchronous write requests D) at the time the request completed, there were a LARGE number of other requested queued and completed at the same time. Of the requests queued and completed in the meantime the breakdown was 22 RM 31 RA 82 W 445 R If I'm reading this correctly, it is saying that 22 reads were merged (RM), 31 reads were remapped to another device (RA) which is probably reads from the dm-crypt partition, 82 were writes (W) which is not far off the number of writes that were in the queue and 445 were other reads. The delay was dominated by reads that were queued after the write request and completed before it. RM == Read Meta RA == Read Ahead (remapping, by the way, does not happen across devices, just into partitions) W and R you understood correctly. Thanks for those corrections. I misread the meaning of the action identifiers section of the blkparse manual. I should have double checked the source. SNIP The thing is, we do want to make ext4 work well with cfq, and prioritizing non-readahead read requests ahead of data writeback does make sense. The issue is with is that metadata writes going through the block device could in some cases effectively cause a priority inversion when what had previously been an asynchronous writeback starts blocking a foreground, user-visible process. At least, that's the theory; I *think* the data more or less confirms the theory but it'd be nice if someone else double checked in case I'm seeing what I want to see instead of what is actually there. Looks sane. You can also see a lot of preempts in the blkparse output, which indicates exactly what you're saying. Any sync request gets priority over the async requests. Good to know. I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: All Devices ALL MIN AVG MAX N --- - - - --- Q2Q 0.1 0.000992259 8.898375882 2300861 Q2G 0.00843 10.193261239 2064.079501935 1016463 This is not normally my sandbox so do you mind spelling this out? IIUC, the time to allocate the struct request from the slab cache is just a small portion of this time. The bulk of the time is spent in get_request() waiting for congestion to clear on the request list for either the sync or async queue. Once a process goes to sleep on that waitqueue, it has to wait until enough requests on that queue have been serviced before it gets woken again at which point it gets priority access to prevent further starvation. This is the Queue To Get Reqiest (Q2G) delay. What we may be seeing here is that the async queue was congested and on average, we are waiting for 10 seconds for it to clear. The maximum value may be bogus for reasons explained later. Is that accurate? G2I 0.00461 0.44702 3.237065090 1015803 Q2M 0.00101 8.203147238 2064.079367557 1311662 I2D 0.02012 1.476824812 2064.089774419 1014890 M2D 0.03283
Re: Excessive stall times on ext4 in 3.9-rc2
On Sat, Apr 20, 2013 at 08:05:22PM -0400, Theodore Ts'o wrote: An alternate solution which I've been playing around adds buffer_head flags so we can indicate that a buffer contains metadata and/or should have I/O submitted with the REQ_PRIO flag set. I beefed up the reporting slightly and tested the patches comparing 3.9-rc6 vanilla with your patches. The full report with graphs are at http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/report.html 3.9.0-rc6 3.9.0-rc6 vanilla ext4tag Usermin 0.00 ( 0.00%)0.00 ( 0.00%) Usermean nan ( nan%) nan ( nan%) Userstddev nan ( nan%) nan ( nan%) Usermax 0.00 ( 0.00%)0.00 ( 0.00%) Userrange 0.00 ( 0.00%)0.00 ( 0.00%) System min 9.14 ( 0.00%)9.13 ( 0.11%) System mean 9.60 ( 0.00%)9.73 ( -1.33%) System stddev0.39 ( 0.00%)0.94 (-142.69%) System max 10.31 ( 0.00%) 11.58 (-12.32%) System range 1.17 ( 0.00%)2.45 (-109.40%) Elapsed min 665.54 ( 0.00%) 612.25 ( 8.01%) Elapsed mean775.35 ( 0.00%) 688.01 ( 11.26%) Elapsed stddev 69.11 ( 0.00%) 58.22 ( 15.75%) Elapsed max 858.40 ( 0.00%) 773.06 ( 9.94%) Elapsed range 192.86 ( 0.00%) 160.81 ( 16.62%) CPU min 3.00 ( 0.00%)3.00 ( 0.00%) CPU mean 3.60 ( 0.00%)4.20 (-16.67%) CPU stddev0.49 ( 0.00%)0.75 (-52.75%) CPU max 4.00 ( 0.00%)5.00 (-25.00%) CPU range 1.00 ( 0.00%)2.00 (-100.00%) The patches appear to improve the git checkout times slightly but this test is quite variable. The vmstat figures report some reclaim activity but if you look at the graphs further down you will see that the bulk of the kswapd reclaim scan and steal activity is at the start of the test when it's downloading and untarring a git tree to work with. (I also note that the mouse-over graph for direct reclaim efficiency is broken but it's not important right now). From iostat 3.9.0-rc6 3.9.0-rc6 vanilla ext4tag Mean dm-0-avgqz 1.181.19 Mean dm-0-await 17.30 16.50 Mean dm-0-r_await 17.30 16.50 Mean dm-0-w_await0.940.48 Mean sda-avgqz 650.29 719.81 Mean sda-await2501.33 2597.23 Mean sda-r_await30.01 24.91 Mean sda-w_await 11228.8011120.64 Max dm-0-avgqz 12.30 10.14 Max dm-0-await 42.65 52.23 Max dm-0-r_await 42.65 52.23 Max dm-0-w_await 541.00 263.83 Max sda-avgqz3811.93 3375.11 Max sda-await7178.61 7170.44 Max sda-r_await 384.37 297.85 Max sda-w_await 51353.9350338.25 There are no really obvious massive advantages to me there and if you look at the graphs for the avgqs, await etc over time, the patched kernel are not obviously better. The Wait CPU usage looks roughly the same too. On the more positive side, the dstate systemtap monitor script tells me that all processes were stalled for less time -- 9575 seconds versus 10910. The most severe event to stall on is sleep_on_buffer() as a result of ext4_bread. Vanilla kernel 3325677 ms stalled with 57 events Patched kernel 2411471 ms stalled with 42 events That's a pretty big drop but it gets bad again for the second worst stall -- wait_on_page_bit as a result of generic_file_buffered_write. Vanilla kernel 1336064 ms stalled with 109 events Patched kernel 2338781 ms stalled with 164 events So conceptually the patches make sense but the first set of tests do not indicate that they'll fix the problem and the stall times do not indicate that interactivity will be any better. I'll still apply them and boot them on my main work machine and see how they feel this evening. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote: That's a pretty big drop but it gets bad again for the second worst stall -- wait_on_page_bit as a result of generic_file_buffered_write. Vanilla kernel 1336064 ms stalled with 109 events Patched kernel 2338781 ms stalled with 164 events Do you have the stack trace for this stall? I'm wondering if this is caused by the waiting for stable pages in write_begin() , or something else. If it is blocking caused by stable page writeback that's interesting, since it would imply that something in your workload is trying to write to a page that has already been modified (i.e., appending to a log file, or updating a database file). Does that make sense given what your workload might be running? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 23, 2013 at 11:50:19AM -0400, Theodore Ts'o wrote: On Tue, Apr 23, 2013 at 04:33:05PM +0100, Mel Gorman wrote: That's a pretty big drop but it gets bad again for the second worst stall -- wait_on_page_bit as a result of generic_file_buffered_write. Vanilla kernel 1336064 ms stalled with 109 events Patched kernel 2338781 ms stalled with 164 events Do you have the stack trace for this stall? I'm wondering if this is caused by the waiting for stable pages in write_begin() , or something else. [81110238] wait_on_page_bit+0x78/0x80 [815af294] kretprobe_trampoline+0x0/0x4c [81110e84] generic_file_buffered_write+0x114/0x2a0 [8ccd] __generic_file_aio_write+0x1bd/0x3c0 [8f4a] generic_file_aio_write+0x7a/0xf0 [811ee639] ext4_file_write+0x99/0x420 [81174d87] do_sync_write+0xa7/0xe0 [81175447] vfs_write+0xa7/0x180 [811758cd] sys_write+0x4d/0x90 [815b3eed] system_call_fastpath+0x1a/0x1f [] 0x The processes that stalled in this particular trace are wget, latency-output, tar and tclsh. Most of these are sequential writers except for tar which is both a sequential reader and sequential writers. If it is blocking caused by stable page writeback that's interesting, since it would imply that something in your workload is trying to write to a page that has already been modified (i.e., appending to a log file, or updating a database file). Does that make sense given what your workload might be running? I doubt it is stable write consider the type of processes that are running. I would expect the bulk of the activity to be sequential readers or writers of multiple files. The summarised report from the raw data is now at http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-vanilla http://www.csn.ul.ie/~mel/postings/ext4tag-20130423/dstate-summary-ext4tag It's an aside but the worst of the stalls are incurred by systemd-tmpfile which were not a deliberate part of the test and yet another thing that I would not have caught unless I was running tests on my laptop. Looking closer at that thing, the default configuration is to run the service 15 minutes after boot and after that it runs once a day. It looks like the bulk of the scanning would be in /var/tmp/ looking at systemds own files (over 3000 of them) which I'm a little amused by. My normal test machines would not hit this because they are not systemd based but the existance of thing thing is worth noting. Any IO-based tests run on systemd-based distributions may give different results depending on whether this service triggered during the test or not. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote: > > Jan, if I were to come up with a way of promoting a particular async > queue to the front of the line, where would I put such a call in the > ext4/jbd2 code to be effective? Well, I thought we had discussed trying to bump a pending I/O automatically when there was an attempt to call lock_buffer() on the bh? That would be ideal, because we could keep the async writeback low priority until someone is trying to wait upon it, at which point obviously it should no longer be considered an async write call. Failing that, this is something I've been toying with what do you think? http://patchwork.ozlabs.org/patch/238192/ http://patchwork.ozlabs.org/patch/238257/ (The first patch in the series just makes sure that allocation bitmap reads are marked with the META/PRIO flags. It's not strictly speaking related to the problem discussed here, but for completeness: http://patchwork.ozlabs.org/patch/238193/) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Mel Gorman writes: > (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) Thanks. I've got a few comments and corrections for you below. > TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads > starve flusher writes. [snip] > 3. The blktrace indicates that reads can starve writes from flusher > >While there are people that can look at a blktrace and find problems >like they are rain man, I'm more like an ADHD squirrel when looking at >a trace. I wrote a script to look for what unrelated requests completed >while an request got stalled for over a second. It seemed like something >that a tool shoudl already exist for but I didn't find one unless btt >can give the information somehow. Care to share that script? [snip] > I recognise that the output will have a WTF reaction but the key > observations to me are > > a) a single write request from flusher took over a second to complete > b) at the time it was queued, it was mostly other writes that were in >the queue at the same time > c) The write request and the parallel writes were all asynchronous write >requests > D) at the time the request completed, there were a LARGE number of >other requested queued and completed at the same time. > > Of the requests queued and completed in the meantime the breakdown was > > 22 RM > 31 RA > 82 W > 445 R > > If I'm reading this correctly, it is saying that 22 reads were merged (RM), > 31 reads were remapped to another device (RA) which is probably reads from > the dm-crypt partition, 82 were writes (W) which is not far off the number > of writes that were in the queue and 445 were other reads. The delay was > dominated by reads that were queued after the write request and completed > before it. RM == Read Meta RA == Read Ahead (remapping, by the way, does not happen across devices, just into partitions) W and R you understood correctly. > That's saying that the 27128th request in the trace took over 7 seconds > to complete and was an asynchronous write from flusher. The contents of > the queue are displayed at that time and the breakdown of requests is > > 23 WS [JEM: write sync] > 86 RM [JEM: Read Meta] > 124 RA [JEM: Read Ahead] > 442 W >1931 R > > 7 seconds later when it was completed the breakdown of completed > requests was > > 25 WS > 114 RM > 155 RA > 408 W >2457 R > > In combination, that confirms for me that asynchronous writes from flush > are being starved by reads. When a process requires a buffer that is locked > by that asynchronous write from flusher, it stalls. > >> The thing is, we do want to make ext4 work well with cfq, and >> prioritizing non-readahead read requests ahead of data writeback does >> make sense. The issue is with is that metadata writes going through >> the block device could in some cases effectively cause a priority >> inversion when what had previously been an asynchronous writeback >> starts blocking a foreground, user-visible process. >> >> At least, that's the theory; > > I *think* the data more or less confirms the theory but it'd be nice if > someone else double checked in case I'm seeing what I want to see > instead of what is actually there. Looks sane. You can also see a lot of "preempt"s in the blkparse output, which indicates exactly what you're saying. Any sync request gets priority over the async requests. I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: All Devices ALL MIN AVG MAX N --- - - - --- Q2Q 0.1 0.000992259 8.898375882 2300861 Q2G 0.00843 10.193261239 2064.079501935 1016463 < G2I 0.00461 0.44702 3.237065090 1015803 Q2M 0.00101 8.203147238 2064.079367557 1311662 I2D 0.02012 1.476824812 2064.089774419 1014890 M2D 0.03283 6.994306138 283.573348664 1284872 D2C 0.61889 0.014438316 0.857811758 2291996 Q2C 0.72284 13.363007244 2064.092228625 2292191 Device Overhead DEV | Q2G G2I Q2M I2D D2C -- | - - - - - ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% -- | - - - - - Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% I'm not sure I believe that max value. 2064 seconds seems a bit high. Also, Q2M should not be anywhere near that big, so more investigation is required there. A quick look over
Re: Excessive stall times on ext4 in 3.9-rc2
(Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote: > On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: > > > If that is the case, one possible solution that comes to mind would be > > > to mark buffer_heads that contain metadata with a flag, so that the > > > flusher thread can write them back at the same priority as reads. > > > > Ext4 is already using REQ_META for this purpose. > > We're using REQ_META | REQ_PRIO for reads, not writes. > > > I'm surprised that no-one has suggested "change the IO elevator" > > yet. > > Well, testing to see if the stalls go away with the noop schedule is a > good thing to try just to validate the theory. > I still haven't tested with a different elevator. While this bug is relatively high priority for me, there are still are other issues in the way. TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads starve flusher writes. Now the ugliness and being a windbag. I collected blktrace and some other logs and they are available at http://www.csn.ul.ie/~mel/postings/stalls-20130419/log.tar.gz and there is a lot of stuff in there. The unix time the test started is in the first line of the file tests-timestamp-bisect . This can help figure out how far into the test some of the other timestamped logs are The kernel log with information from the lock_buffer debugging patch is in dmesg-bisect-gitcheckout . The information in it is race-prone and cannot be 100% trusted but it's still useful. iostat is in iostat-bisect-gitcheckout Here are a few observations I got from the data. 1. flushers appear to acquire buffer locks *very* quickly after an application writes. Look for lines that look like "foo failed trylock without holder released 0 ms ago acquired 0 ms ago by bar" There are a lot of entries like this jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 2124 tar failed trylock without holder released 0 ms ago acquired 0 ms ago by 239 flush-8:0 jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar I expected flushers to be writing back the buffers just released in about 5 seconds time, not immediately. It may indicate that when flushers wake to clean expired inodes that it keeps cleaning inodes as they are being dirtied. 2. The flush thread can prevent a process making forward progress for a long time. Take this as an example jbd2 stalled dev 8,8 for 8168 ms lock holdtime 20692 ms Last Owner 239 flush-8:0 Acquired Stack [] save_stack_trace+0x2a/0x50 [] set_lock_buffer_owner+0x86/0x90 [] __block_write_full_page+0x16e/0x360 [] block_write_full_page_endio+0xd3/0x110 [] block_write_full_page+0x10/0x20 [] blkdev_writepage+0x13/0x20 [] __writepage+0x12/0x40 [] write_cache_pages+0x206/0x460 [] generic_writepages+0x45/0x70 [] do_writepages+0x1b/0x30 [] __writeback_single_inode+0x40/0x1b0 [] writeback_sb_inodes+0x19a/0x350 [] __writeback_inodes_wb+0x96/0xc0 [] wb_writeback+0x27b/0x330 [] wb_do_writeback+0x190/0x1d0 [] bdi_writeback_thread+0x83/0x280 [] kthread+0xbb/0xc0 [] ret_from_fork+0x7c/0xb0 [] 0x This part is saying that we locked the buffer due to blkdev_writepage which I assume must be a metadata update. Based on where we lock the buffer, the only reason we would leave the buffer unlocked if this was an asynchronous write request leaving the buffer to be unlocked by end_buffer_async_write at some time in the future Last Owner Activity Stack: 239 flush-8:0 [] __blkdev_issue_zeroout+0x191/0x1a0 [] blkdev_issue_zeroout+0xe1/0xf0 [] ext4_ext_zeroout.isra.30+0x49/0x60 [] ext4_ext_convert_to_initialized+0x227/0x5f0 [] ext4_ext_handle_uninitialized_extents+0x2f3/0x3a0 [] ext4_ext_map_blocks+0x5d7/0xa00 [] ext4_map_blocks+0x2d5/0x470 [] mpage_da_map_and_submit+0xba/0x2f0 [] ext4_da_writepages+0x380/0x620 [] do_writepages+0x1b/0x30 [] __writeback_single_inode+0x40/0x1b0 [] writeback_sb_inodes+0x19a/0x350 [] __writeback_inodes_wb+0x96/0xc0 [] wb_writeback+0x27b/0x330 This part is indicating that
Re: Excessive stall times on ext4 in 3.9-rc2
(Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote: On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. We're using REQ_META | REQ_PRIO for reads, not writes. I'm surprised that no-one has suggested change the IO elevator yet. Well, testing to see if the stalls go away with the noop schedule is a good thing to try just to validate the theory. I still haven't tested with a different elevator. While this bug is relatively high priority for me, there are still are other issues in the way. TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads starve flusher writes. Now the ugliness and being a windbag. I collected blktrace and some other logs and they are available at http://www.csn.ul.ie/~mel/postings/stalls-20130419/log.tar.gz and there is a lot of stuff in there. The unix time the test started is in the first line of the file tests-timestamp-bisect . This can help figure out how far into the test some of the other timestamped logs are The kernel log with information from the lock_buffer debugging patch is in dmesg-bisect-gitcheckout . The information in it is race-prone and cannot be 100% trusted but it's still useful. iostat is in iostat-bisect-gitcheckout Here are a few observations I got from the data. 1. flushers appear to acquire buffer locks *very* quickly after an application writes. Look for lines that look like foo failed trylock without holder released 0 ms ago acquired 0 ms ago by bar There are a lot of entries like this jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 2124 tar failed trylock without holder released 0 ms ago acquired 0 ms ago by 239 flush-8:0 jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar jbd2 239 flush-8:0 failed trylock without holder released 0 ms ago acquired 0 ms ago by 2124 tar I expected flushers to be writing back the buffers just released in about 5 seconds time, not immediately. It may indicate that when flushers wake to clean expired inodes that it keeps cleaning inodes as they are being dirtied. 2. The flush thread can prevent a process making forward progress for a long time. Take this as an example jbd2 stalled dev 8,8 for 8168 ms lock holdtime 20692 ms Last Owner 239 flush-8:0 Acquired Stack [8100fd8a] save_stack_trace+0x2a/0x50 [811a3ad6] set_lock_buffer_owner+0x86/0x90 [811a72ee] __block_write_full_page+0x16e/0x360 [811a75b3] block_write_full_page_endio+0xd3/0x110 [811a7600] block_write_full_page+0x10/0x20 [811aa7f3] blkdev_writepage+0x13/0x20 [81119352] __writepage+0x12/0x40 [81119b56] write_cache_pages+0x206/0x460 [81119df5] generic_writepages+0x45/0x70 [8111accb] do_writepages+0x1b/0x30 [81199d60] __writeback_single_inode+0x40/0x1b0 [8119c40a] writeback_sb_inodes+0x19a/0x350 [8119c656] __writeback_inodes_wb+0x96/0xc0 [8119c8fb] wb_writeback+0x27b/0x330 [8119e300] wb_do_writeback+0x190/0x1d0 [8119e3c3] bdi_writeback_thread+0x83/0x280 [8106901b] kthread+0xbb/0xc0 [8159e1fc] ret_from_fork+0x7c/0xb0 [] 0x This part is saying that we locked the buffer due to blkdev_writepage which I assume must be a metadata update. Based on where we lock the buffer, the only reason we would leave the buffer unlocked if this was an asynchronous write request leaving the buffer to be unlocked by end_buffer_async_write at some time in the future Last Owner Activity Stack: 239 flush-8:0 [812aee61] __blkdev_issue_zeroout+0x191/0x1a0 [812aef51] blkdev_issue_zeroout+0xe1/0xf0 [8121abe9] ext4_ext_zeroout.isra.30+0x49/0x60 [8121ee47] ext4_ext_convert_to_initialized+0x227/0x5f0 [8121f8a3] ext4_ext_handle_uninitialized_extents+0x2f3/0x3a0 [8121ff57] ext4_ext_map_blocks+0x5d7/0xa00
Re: Excessive stall times on ext4 in 3.9-rc2
Mel Gorman mgor...@suse.de writes: (Adding Jeff Moyer to the cc as I'm told he is interested in the blktrace) Thanks. I've got a few comments and corrections for you below. TLDR: Flusher writes pages very quickly after processes dirty a buffer. Reads starve flusher writes. [snip] 3. The blktrace indicates that reads can starve writes from flusher While there are people that can look at a blktrace and find problems like they are rain man, I'm more like an ADHD squirrel when looking at a trace. I wrote a script to look for what unrelated requests completed while an request got stalled for over a second. It seemed like something that a tool shoudl already exist for but I didn't find one unless btt can give the information somehow. Care to share that script? [snip] I recognise that the output will have a WTF reaction but the key observations to me are a) a single write request from flusher took over a second to complete b) at the time it was queued, it was mostly other writes that were in the queue at the same time c) The write request and the parallel writes were all asynchronous write requests D) at the time the request completed, there were a LARGE number of other requested queued and completed at the same time. Of the requests queued and completed in the meantime the breakdown was 22 RM 31 RA 82 W 445 R If I'm reading this correctly, it is saying that 22 reads were merged (RM), 31 reads were remapped to another device (RA) which is probably reads from the dm-crypt partition, 82 were writes (W) which is not far off the number of writes that were in the queue and 445 were other reads. The delay was dominated by reads that were queued after the write request and completed before it. RM == Read Meta RA == Read Ahead (remapping, by the way, does not happen across devices, just into partitions) W and R you understood correctly. That's saying that the 27128th request in the trace took over 7 seconds to complete and was an asynchronous write from flusher. The contents of the queue are displayed at that time and the breakdown of requests is 23 WS [JEM: write sync] 86 RM [JEM: Read Meta] 124 RA [JEM: Read Ahead] 442 W 1931 R 7 seconds later when it was completed the breakdown of completed requests was 25 WS 114 RM 155 RA 408 W 2457 R In combination, that confirms for me that asynchronous writes from flush are being starved by reads. When a process requires a buffer that is locked by that asynchronous write from flusher, it stalls. The thing is, we do want to make ext4 work well with cfq, and prioritizing non-readahead read requests ahead of data writeback does make sense. The issue is with is that metadata writes going through the block device could in some cases effectively cause a priority inversion when what had previously been an asynchronous writeback starts blocking a foreground, user-visible process. At least, that's the theory; I *think* the data more or less confirms the theory but it'd be nice if someone else double checked in case I'm seeing what I want to see instead of what is actually there. Looks sane. You can also see a lot of preempts in the blkparse output, which indicates exactly what you're saying. Any sync request gets priority over the async requests. I'll also note that even though your I/O is going all over the place (D2C is pretty bad, 14ms), most of the time is spent waiting for a struct request allocation or between Queue and Merge: All Devices ALL MIN AVG MAX N --- - - - --- Q2Q 0.1 0.000992259 8.898375882 2300861 Q2G 0.00843 10.193261239 2064.079501935 1016463 G2I 0.00461 0.44702 3.237065090 1015803 Q2M 0.00101 8.203147238 2064.079367557 1311662 I2D 0.02012 1.476824812 2064.089774419 1014890 M2D 0.03283 6.994306138 283.573348664 1284872 D2C 0.61889 0.014438316 0.857811758 2291996 Q2C 0.72284 13.363007244 2064.092228625 2292191 Device Overhead DEV | Q2G G2I Q2M I2D D2C -- | - - - - - ( 8, 0) | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% -- | - - - - - Overall | 33.8259% 0.0001% 35.1275% 4.8932% 0.1080% I'm not sure I believe that max value. 2064 seconds seems a bit high. Also, Q2M should not be anywhere near that big, so more investigation is required there. A quick look over the data doesn't show any such delays (making me question the
Re: Excessive stall times on ext4 in 3.9-rc2
On Mon, Apr 22, 2013 at 06:42:23PM -0400, Jeff Moyer wrote: Jan, if I were to come up with a way of promoting a particular async queue to the front of the line, where would I put such a call in the ext4/jbd2 code to be effective? Well, I thought we had discussed trying to bump a pending I/O automatically when there was an attempt to call lock_buffer() on the bh? That would be ideal, because we could keep the async writeback low priority until someone is trying to wait upon it, at which point obviously it should no longer be considered an async write call. Failing that, this is something I've been toying with what do you think? http://patchwork.ozlabs.org/patch/238192/ http://patchwork.ozlabs.org/patch/238257/ (The first patch in the series just makes sure that allocation bitmap reads are marked with the META/PRIO flags. It's not strictly speaking related to the problem discussed here, but for completeness: http://patchwork.ozlabs.org/patch/238193/) - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
As an update to this thread, we brought up this issue at LSF/MM, and there is a thought that we should be able to solve this problem by having lock_buffer() check to see if the buffer is locked due to a write being queued, to have the priority of the write bumped up in the write queues to resolve the priority inversion. I believe Jeff Moyer was going to look into this, if I remember correctly. An alternate solution which I've been playing around adds buffer_head flags so we can indicate that a buffer contains metadata and/or should have I/O submitted with the REQ_PRIO flag set. Adding a buffer_head flag for at least BH_Meta is probably a good thing, since that way the blktrace will be properly annotated. Whether we should keep the BH_Prio flag or rely on lock_buffer() automatically raising the priority is, my feeling is that if lock_buffer() can do the right thing, we should probably do it via lock_buffer(). I have a feeling this might be decidedly non-trivial, though, so perhaps we should just doing via BH flags? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
As an update to this thread, we brought up this issue at LSF/MM, and there is a thought that we should be able to solve this problem by having lock_buffer() check to see if the buffer is locked due to a write being queued, to have the priority of the write bumped up in the write queues to resolve the priority inversion. I believe Jeff Moyer was going to look into this, if I remember correctly. An alternate solution which I've been playing around adds buffer_head flags so we can indicate that a buffer contains metadata and/or should have I/O submitted with the REQ_PRIO flag set. Adding a buffer_head flag for at least BH_Meta is probably a good thing, since that way the blktrace will be properly annotated. Whether we should keep the BH_Prio flag or rely on lock_buffer() automatically raising the priority is, my feeling is that if lock_buffer() can do the right thing, we should probably do it via lock_buffer(). I have a feeling this might be decidedly non-trivial, though, so perhaps we should just doing via BH flags? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote: > On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: > > > If that is the case, one possible solution that comes to mind would be > > > to mark buffer_heads that contain metadata with a flag, so that the > > > flusher thread can write them back at the same priority as reads. > > > > Ext4 is already using REQ_META for this purpose. > > We're using REQ_META | REQ_PRIO for reads, not writes. > > > I'm surprised that no-one has suggested "change the IO elevator" > > yet. > > Well, testing to see if the stalls go away with the noop schedule is a > good thing to try just to validate the theory. Exactly. > The thing is, we do want to make ext4 work well with cfq, and > prioritizing non-readahead read requests ahead of data writeback does > make sense. The issue is with is that metadata writes going through > the block device could in some cases effectively cause a priority > inversion when what had previously been an asynchronous writeback > starts blocking a foreground, user-visible process. Here's the historic problem with CFQ: it's scheduling algorithms change from release to release, and so what you tune the filesystem to for this release is likely to cause different behaviour in a few releases time. We've had this problem time and time again with CFQ+XFS, so we stopped trying to "tune" to a particular elevator long ago. The best you can do it tag the Io as appropriately as possible (e.g. metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ hasn't been broken since the last release > At least, that's the theory; we should confirm that this is indeed > what is causing the data stalls which Mel is reporting on HDD's before > we start figuring out how to fix this problem. *nod*. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: > > If that is the case, one possible solution that comes to mind would be > > to mark buffer_heads that contain metadata with a flag, so that the > > flusher thread can write them back at the same priority as reads. > > Ext4 is already using REQ_META for this purpose. We're using REQ_META | REQ_PRIO for reads, not writes. > I'm surprised that no-one has suggested "change the IO elevator" > yet. Well, testing to see if the stalls go away with the noop schedule is a good thing to try just to validate the theory. The thing is, we do want to make ext4 work well with cfq, and prioritizing non-readahead read requests ahead of data writeback does make sense. The issue is with is that metadata writes going through the block device could in some cases effectively cause a priority inversion when what had previously been an asynchronous writeback starts blocking a foreground, user-visible process. At least, that's the theory; we should confirm that this is indeed what is causing the data stalls which Mel is reporting on HDD's before we start figuring out how to fix this problem. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi all, On Thursday 11 April 2013 22:57:08 Theodore Ts'o wrote: > That's an interesting theory. If the workload is one which is very > heavy on reads and writes, that could explain the high latency. That > would explain why those of us who are using primarily SSD's are seeing > the problems, because would be reads are nice and fast. > > If that is the case, one possible solution that comes to mind would be > to mark buffer_heads that contain metadata with a flag, so that the > flusher thread can write them back at the same priority as reads. > > The only problem I can see with this hypothesis is that if this is the > explanation for what Mel and Jiri are seeing, it's something that > would have been around for a long time, and would affect ext3 as well > as ext4. That isn't quite consistent, however, with Mel's observation > that this is a probablem which has gotten worse in relatively > recently. Dropping in as a casual observer and having missed the start of the thread, risking that I will just muddle the water for you. I had a similar problem for quite a while with ext4, at least that was my conclusion since the fix was to migrate one filesystem to xfs which fixed it for me. Time period when I observed this was between 3.5 and 3.7 kernels. Situation was I had an ext4 filesystem (on top of LVM, which was on top of MD RAID 1, which was on top of two mechanical hard drives) which was dedicated to holding a large SVN check-out. Other filesystems were also ext4 on different logical volumes (but same spindles). Symptoms were long stalls of everything (including window management!) on a relatively heavily loaded desktop (which was KDE). Stalls would last anything from five to maybe even 30 seconds. Not sure exactly but long enough that you think the system has actually crashed. I couldn't even switch away to a different virtual terminal during the stall, nothing. Eventually I traced it down to kdesvn (subversion client) periodically refreshing (or something) it's metadata and hence generating some IO on that dedicated filesystem. That combined with some other desktop activity had an effect of stalling everything else. I thought it was very weird, but I suppose KDE and all the rest nowadays do to much IO in everything they do. Following a hunch I reformatted that filesystem as XFS which fixed the problem. I can't reproduce this now to run any tests so I know this is not very helpful now. But perhaps some of the info will be useful to someone. Tvrtko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote: > On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: > > I think it might be more enlightening if Mel traced which process in > > which funclion is holding the buffer lock. I suspect we'll find out that > > the flusher thread has submitted the buffer for IO as an async write and > > thus it takes a long time to complete in presence of reads which have > > higher priority. > > That's an interesting theory. If the workload is one which is very > heavy on reads and writes, that could explain the high latency. That > would explain why those of us who are using primarily SSD's are seeing > the problems, because would be reads are nice and fast. > > If that is the case, one possible solution that comes to mind would be > to mark buffer_heads that contain metadata with a flag, so that the > flusher thread can write them back at the same priority as reads. > > The only problem I can see with this hypothesis is that if this is the > explanation for what Mel and Jiri are seeing, it's something that > would have been around for a long time, and would affect ext3 as well > as ext4. That isn't quite consistent, however, with Mel's observation > that this is a probablem which has gotten worse in relatively > recently. > According to the tests I've run, multi-second stalls have been a problem for a while but never really bothered me. I'm not sure why it felt particularly bad around -rc2 or why it seems to be better now. Maybe I just had my cranky pants on. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 02:35:12PM -0400, Theodore Ts'o wrote: > On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote: > > > If we're stalling on lock_buffer(), that implies that buffer was being > > > written, and for some reason it was taking a very long time to > > > complete. > > > > > > > Yes. > > > > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and > > > record the time when a particular I/O encryption/decryption is getting > > > queued to the kcryptd workqueues, and when they finally squirt out. > > > > > > > That somewhat assumes that dm_crypt was at fault which is not unreasonable > > but I was skeptical as the workload on dm_crypt was opening a maildir > > and mostly reads. > > Hmm... well, I've reviewed all of the places in the ext4 and jbd2 > layer where we call lock_buffer(), and with one exception[1] we're not > holding the the bh locked any longer than necessary. There are a few > places where we grab a spinlock or two before we can do what we need > to do and then release the lock'ed buffer head, but the only time we > hold the bh locked for long periods of time is when we submit metadata > blocks for I/O. > Yeah, ok. This is not the answer I was hoping for but it's the answer I expected. > Could you code which checks the hold time of lock_buffer(), measuing > from when the lock is successfully grabbed, to see if you can see if I > missed some code path in ext4 or jbd2 where the bh is locked and then > there is some call to some function which needs to block for some > random reason? > > What I'd suggest is putting a timestamp in buffer_head > structure, which is set by lock_buffer once it is successfully grabbed > the lock, and then in unlock_buffer(), if it is held for more than a > second or some such, to dump out the stack trace. > I can do that but the results might lack meaning. What I could do instead is use a variation of the page owner tracking patch (current iteration at https://lkml.org/lkml/2012/12/7/487) to record a stack trace in lock_buffer and dump it from jbd2/transaction.c if it stalls for too long. I'll report if I find something useful. > Because at this point, either I'm missing something or I'm beginning > to suspect that your hard drive (or maybe something the block layer?) > is simply taking a long time to service an I/O request. It could be because the drive is a piece of crap. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 02:35:12PM -0400, Theodore Ts'o wrote: On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote: If we're stalling on lock_buffer(), that implies that buffer was being written, and for some reason it was taking a very long time to complete. Yes. It might be worthwhile to put a timestamp in struct dm_crypt_io, and record the time when a particular I/O encryption/decryption is getting queued to the kcryptd workqueues, and when they finally squirt out. That somewhat assumes that dm_crypt was at fault which is not unreasonable but I was skeptical as the workload on dm_crypt was opening a maildir and mostly reads. Hmm... well, I've reviewed all of the places in the ext4 and jbd2 layer where we call lock_buffer(), and with one exception[1] we're not holding the the bh locked any longer than necessary. There are a few places where we grab a spinlock or two before we can do what we need to do and then release the lock'ed buffer head, but the only time we hold the bh locked for long periods of time is when we submit metadata blocks for I/O. Yeah, ok. This is not the answer I was hoping for but it's the answer I expected. Could you code which checks the hold time of lock_buffer(), measuing from when the lock is successfully grabbed, to see if you can see if I missed some code path in ext4 or jbd2 where the bh is locked and then there is some call to some function which needs to block for some random reason? What I'd suggest is putting a timestamp in buffer_head structure, which is set by lock_buffer once it is successfully grabbed the lock, and then in unlock_buffer(), if it is held for more than a second or some such, to dump out the stack trace. I can do that but the results might lack meaning. What I could do instead is use a variation of the page owner tracking patch (current iteration at https://lkml.org/lkml/2012/12/7/487) to record a stack trace in lock_buffer and dump it from jbd2/transaction.c if it stalls for too long. I'll report if I find something useful. Because at this point, either I'm missing something or I'm beginning to suspect that your hard drive (or maybe something the block layer?) is simply taking a long time to service an I/O request. It could be because the drive is a piece of crap. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote: On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: I think it might be more enlightening if Mel traced which process in which funclion is holding the buffer lock. I suspect we'll find out that the flusher thread has submitted the buffer for IO as an async write and thus it takes a long time to complete in presence of reads which have higher priority. That's an interesting theory. If the workload is one which is very heavy on reads and writes, that could explain the high latency. That would explain why those of us who are using primarily SSD's are seeing the problems, because would be reads are nice and fast. If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. The only problem I can see with this hypothesis is that if this is the explanation for what Mel and Jiri are seeing, it's something that would have been around for a long time, and would affect ext3 as well as ext4. That isn't quite consistent, however, with Mel's observation that this is a probablem which has gotten worse in relatively recently. According to the tests I've run, multi-second stalls have been a problem for a while but never really bothered me. I'm not sure why it felt particularly bad around -rc2 or why it seems to be better now. Maybe I just had my cranky pants on. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi all, On Thursday 11 April 2013 22:57:08 Theodore Ts'o wrote: That's an interesting theory. If the workload is one which is very heavy on reads and writes, that could explain the high latency. That would explain why those of us who are using primarily SSD's are seeing the problems, because would be reads are nice and fast. If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. The only problem I can see with this hypothesis is that if this is the explanation for what Mel and Jiri are seeing, it's something that would have been around for a long time, and would affect ext3 as well as ext4. That isn't quite consistent, however, with Mel's observation that this is a probablem which has gotten worse in relatively recently. Dropping in as a casual observer and having missed the start of the thread, risking that I will just muddle the water for you. I had a similar problem for quite a while with ext4, at least that was my conclusion since the fix was to migrate one filesystem to xfs which fixed it for me. Time period when I observed this was between 3.5 and 3.7 kernels. Situation was I had an ext4 filesystem (on top of LVM, which was on top of MD RAID 1, which was on top of two mechanical hard drives) which was dedicated to holding a large SVN check-out. Other filesystems were also ext4 on different logical volumes (but same spindles). Symptoms were long stalls of everything (including window management!) on a relatively heavily loaded desktop (which was KDE). Stalls would last anything from five to maybe even 30 seconds. Not sure exactly but long enough that you think the system has actually crashed. I couldn't even switch away to a different virtual terminal during the stall, nothing. Eventually I traced it down to kdesvn (subversion client) periodically refreshing (or something) it's metadata and hence generating some IO on that dedicated filesystem. That combined with some other desktop activity had an effect of stalling everything else. I thought it was very weird, but I suppose KDE and all the rest nowadays do to much IO in everything they do. Following a hunch I reformatted that filesystem as XFS which fixed the problem. I can't reproduce this now to run any tests so I know this is not very helpful now. But perhaps some of the info will be useful to someone. Tvrtko -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. We're using REQ_META | REQ_PRIO for reads, not writes. I'm surprised that no-one has suggested change the IO elevator yet. Well, testing to see if the stalls go away with the noop schedule is a good thing to try just to validate the theory. The thing is, we do want to make ext4 work well with cfq, and prioritizing non-readahead read requests ahead of data writeback does make sense. The issue is with is that metadata writes going through the block device could in some cases effectively cause a priority inversion when what had previously been an asynchronous writeback starts blocking a foreground, user-visible process. At least, that's the theory; we should confirm that this is indeed what is causing the data stalls which Mel is reporting on HDD's before we start figuring out how to fix this problem. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Fri, Apr 12, 2013 at 11:19:52AM -0400, Theodore Ts'o wrote: On Fri, Apr 12, 2013 at 02:50:42PM +1000, Dave Chinner wrote: If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. We're using REQ_META | REQ_PRIO for reads, not writes. I'm surprised that no-one has suggested change the IO elevator yet. Well, testing to see if the stalls go away with the noop schedule is a good thing to try just to validate the theory. Exactly. The thing is, we do want to make ext4 work well with cfq, and prioritizing non-readahead read requests ahead of data writeback does make sense. The issue is with is that metadata writes going through the block device could in some cases effectively cause a priority inversion when what had previously been an asynchronous writeback starts blocking a foreground, user-visible process. Here's the historic problem with CFQ: it's scheduling algorithms change from release to release, and so what you tune the filesystem to for this release is likely to cause different behaviour in a few releases time. We've had this problem time and time again with CFQ+XFS, so we stopped trying to tune to a particular elevator long ago. The best you can do it tag the Io as appropriately as possible (e.g. metadata with REQ_META, sync IO with ?_SYNC, etc), and then hope CFQ hasn't been broken since the last release At least, that's the theory; we should confirm that this is indeed what is causing the data stalls which Mel is reporting on HDD's before we start figuring out how to fix this problem. *nod*. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote: > On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: > > I think it might be more enlightening if Mel traced which process in > > which funclion is holding the buffer lock. I suspect we'll find out that > > the flusher thread has submitted the buffer for IO as an async write and > > thus it takes a long time to complete in presence of reads which have > > higher priority. > > That's an interesting theory. If the workload is one which is very > heavy on reads and writes, that could explain the high latency. That > would explain why those of us who are using primarily SSD's are seeing > the problems, because would be reads are nice and fast. > > If that is the case, one possible solution that comes to mind would be > to mark buffer_heads that contain metadata with a flag, so that the > flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. I'm surprised that no-one has suggested "change the IO elevator" yet. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: > I think it might be more enlightening if Mel traced which process in > which funclion is holding the buffer lock. I suspect we'll find out that > the flusher thread has submitted the buffer for IO as an async write and > thus it takes a long time to complete in presence of reads which have > higher priority. That's an interesting theory. If the workload is one which is very heavy on reads and writes, that could explain the high latency. That would explain why those of us who are using primarily SSD's are seeing the problems, because would be reads are nice and fast. If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. The only problem I can see with this hypothesis is that if this is the explanation for what Mel and Jiri are seeing, it's something that would have been around for a long time, and would affect ext3 as well as ext4. That isn't quite consistent, however, with Mel's observation that this is a probablem which has gotten worse in relatively recently. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu 11-04-13 14:35:12, Ted Tso wrote: > On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote: > > > If we're stalling on lock_buffer(), that implies that buffer was being > > > written, and for some reason it was taking a very long time to > > > complete. > > > > > > > Yes. > > > > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and > > > record the time when a particular I/O encryption/decryption is getting > > > queued to the kcryptd workqueues, and when they finally squirt out. > > > > > > > That somewhat assumes that dm_crypt was at fault which is not unreasonable > > but I was skeptical as the workload on dm_crypt was opening a maildir > > and mostly reads. > > Hmm... well, I've reviewed all of the places in the ext4 and jbd2 > layer where we call lock_buffer(), and with one exception[1] we're not > holding the the bh locked any longer than necessary. There are a few > places where we grab a spinlock or two before we can do what we need > to do and then release the lock'ed buffer head, but the only time we > hold the bh locked for long periods of time is when we submit metadata > blocks for I/O. > > [1] There is one exception in ext4_xattr_release_block() where I > believe we should move the call to unlock_buffer(bh) before the call > to ext4_free_blocks(), since we've already elevanted the bh count and > ext4_free_blocks() does not need to have the bh locked. It's not > related to any of the stalls you've repored, though, as near as I can > tell (none of the stack traces include the ext4 xattr code, and this > would only affect external extended attribute blocks). > > > Could you code which checks the hold time of lock_buffer(), measuing > from when the lock is successfully grabbed, to see if you can see if I > missed some code path in ext4 or jbd2 where the bh is locked and then > there is some call to some function which needs to block for some > random reason? What I'd suggest is putting a timestamp in buffer_head > structure, which is set by lock_buffer once it is successfully grabbed > the lock, and then in unlock_buffer(), if it is held for more than a > second or some such, to dump out the stack trace. > > Because at this point, either I'm missing something or I'm beginning > to suspect that your hard drive (or maybe something the block layer?) > is simply taking a long time to service an I/O request. Putting in > this check should be able to very quickly determine what code path > and/or which subsystem we should be focused upon. I think it might be more enlightening if Mel traced which process in which funclion is holding the buffer lock. I suspect we'll find out that the flusher thread has submitted the buffer for IO as an async write and thus it takes a long time to complete in presence of reads which have higher priority. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote: > > If we're stalling on lock_buffer(), that implies that buffer was being > > written, and for some reason it was taking a very long time to > > complete. > > > > Yes. > > > It might be worthwhile to put a timestamp in struct dm_crypt_io, and > > record the time when a particular I/O encryption/decryption is getting > > queued to the kcryptd workqueues, and when they finally squirt out. > > > > That somewhat assumes that dm_crypt was at fault which is not unreasonable > but I was skeptical as the workload on dm_crypt was opening a maildir > and mostly reads. Hmm... well, I've reviewed all of the places in the ext4 and jbd2 layer where we call lock_buffer(), and with one exception[1] we're not holding the the bh locked any longer than necessary. There are a few places where we grab a spinlock or two before we can do what we need to do and then release the lock'ed buffer head, but the only time we hold the bh locked for long periods of time is when we submit metadata blocks for I/O. [1] There is one exception in ext4_xattr_release_block() where I believe we should move the call to unlock_buffer(bh) before the call to ext4_free_blocks(), since we've already elevanted the bh count and ext4_free_blocks() does not need to have the bh locked. It's not related to any of the stalls you've repored, though, as near as I can tell (none of the stack traces include the ext4 xattr code, and this would only affect external extended attribute blocks). Could you code which checks the hold time of lock_buffer(), measuing from when the lock is successfully grabbed, to see if you can see if I missed some code path in ext4 or jbd2 where the bh is locked and then there is some call to some function which needs to block for some random reason? What I'd suggest is putting a timestamp in buffer_head structure, which is set by lock_buffer once it is successfully grabbed the lock, and then in unlock_buffer(), if it is held for more than a second or some such, to dump out the stack trace. Because at this point, either I'm missing something or I'm beginning to suspect that your hard drive (or maybe something the block layer?) is simply taking a long time to service an I/O request. Putting in this check should be able to very quickly determine what code path and/or which subsystem we should be focused upon. Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 10, 2013 at 09:12:45AM -0400, Theodore Ts'o wrote: > On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote: > > During major activity there is likely to be "good" behaviour > > with stalls roughly every 30 seconds roughly corresponding to > > dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck > > when this happens. > > > > 237 ?00:00:00 flush-8:0 > > [] sleep_on_buffer+0x9/0x10 > > [] __lock_buffer+0x2e/0x30 > > [] do_get_write_access+0x43f/0x4b0 > > If we're stalling on lock_buffer(), that implies that buffer was being > written, and for some reason it was taking a very long time to > complete. > Yes. > It might be worthwhile to put a timestamp in struct dm_crypt_io, and > record the time when a particular I/O encryption/decryption is getting > queued to the kcryptd workqueues, and when they finally squirt out. > That somewhat assumes that dm_crypt was at fault which is not unreasonable but I was skeptical as the workload on dm_crypt was opening a maildir and mostly reads. I used a tracepoint in jbd2 to get an idea of what device the buffer_head was managing and dm did not show up on the list. This is what a trace-cmd log of the test told me. flush-8:0-240 [005] 236.655363: jbd2_lock_buffer_stall: dev 8,8 stall_ms 1096 awesome-1364 [005] 290.594396: jbd2_lock_buffer_stall: dev 8,6 stall_ms 7312 gnome-pty-helpe-2256 [005] 290.836952: jbd2_lock_buffer_stall: dev 8,8 stall_ms 7528 flush-8:0-240 [003] 304.012424: jbd2_lock_buffer_stall: dev 8,8 stall_ms 4472 gnome-terminal-2332 [005] 308.290879: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3060 awesome-1364 [006] 308.291318: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3048 flush-8:0-240 [005] 331.525996: jbd2_lock_buffer_stall: dev 8,5 stall_ms 8732 flush-8:0-240 [005] 332.353526: jbd2_lock_buffer_stall: dev 8,5 stall_ms 472 flush-8:0-240 [005] 345.341547: jbd2_lock_buffer_stall: dev 8,5 stall_ms 10024 gnome-terminal-2418 [005] 347.166876: jbd2_lock_buffer_stall: dev 8,6 stall_ms 11852 awesome-1364 [005] 347.167082: jbd2_lock_buffer_stall: dev 8,6 stall_ms 11844 flush-8:0-240 [005] 347.424520: jbd2_lock_buffer_stall: dev 8,5 stall_ms 2012 flush-8:0-240 [005] 347.583752: jbd2_lock_buffer_stall: dev 8,5 stall_ms 156 flush-8:0-240 [005] 390.079682: jbd2_lock_buffer_stall: dev 8,8 stall_ms 396 flush-8:0-240 [002] 407.882385: jbd2_lock_buffer_stall: dev 8,8 stall_ms 12244 flush-8:0-240 [005] 408.003976: jbd2_lock_buffer_stall: dev 8,8 stall_ms 124 gnome-terminal-2610 [005] 413.613365: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3400 awesome-1364 [006] 413.613605: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3736 flush-8:0-240 [002] 430.706475: jbd2_lock_buffer_stall: dev 8,5 stall_ms 9748 flush-8:0-240 [005] 458.188896: jbd2_lock_buffer_stall: dev 8,5 stall_ms 7748 flush-8:0-240 [005] 458.828143: jbd2_lock_buffer_stall: dev 8,5 stall_ms 348 flush-8:0-240 [006] 459.163814: jbd2_lock_buffer_stall: dev 8,5 stall_ms 252 flush-8:0-240 [005] 462.340173: jbd2_lock_buffer_stall: dev 8,5 stall_ms 3160 flush-8:0-240 [005] 469.917705: jbd2_lock_buffer_stall: dev 8,5 stall_ms 6340 flush-8:0-240 [005] 474.434206: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4512 tar-2315 [005] 510.043613: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4316 tclsh-1780 [005] 773.336488: jbd2_lock_buffer_stall: dev 8,5 stall_ms 736 git-3100 [005] 775.933506: jbd2_lock_buffer_stall: dev 8,5 stall_ms 3664 git-4763 [005] 864.093317: jbd2_lock_buffer_stall: dev 8,5 stall_ms 140 flush-8:0-240 [005] 864.242068: jbd2_lock_buffer_stall: dev 8,6 stall_ms 280 git-4763 [005] 864.264157: jbd2_lock_buffer_stall: dev 8,5 stall_ms 148 flush-8:0-240 [005] 865.24: jbd2_lock_buffer_stall: dev 8,5 stall_ms 464 git-4763 [000] 865.602469: jbd2_lock_buffer_stall: dev 8,5 stall_ms 300 flush-8:0-240 [005] 865.705448: jbd2_lock_buffer_stall: dev 8,5 stall_ms 500 flush-8:0-240 [005] 885.367576: jbd2_lock_buffer_stall: dev 8,8 stall_ms 11024 flush-8:0-240 [005] 895.339697: jbd2_lock_buffer_stall: dev 8,5 stall_ms 120 flush-8:0-240 [005] 895.765488: jbd2_lock_buffer_stall: dev 8,5 stall_ms 424 systemd-journal-265 [005] 915.687201: jbd2_lock_buffer_stall: dev 8,8 stall_ms 14844 flush-8:0-240 [005] 915.690529: jbd2_lock_buffer_stall: dev 8,6 stall_ms 19656 git-5442 [005] 1034.845674: jbd2_lock_buffer_stall: dev 8,5 stall_ms 344 git-5442 [005] 1035.157389: jbd2_lock_buffer_stall: dev 8,5 stall_ms 264 flush-8:0-240 [005] 1035.875478: jbd2_lock_buffer_stall: dev 8,8 stall_ms 1368
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 10, 2013 at 09:12:45AM -0400, Theodore Ts'o wrote: On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote: During major activity there is likely to be good behaviour with stalls roughly every 30 seconds roughly corresponding to dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck when this happens. 237 ?00:00:00 flush-8:0 [811a35b9] sleep_on_buffer+0x9/0x10 [811a35ee] __lock_buffer+0x2e/0x30 [8123a21f] do_get_write_access+0x43f/0x4b0 If we're stalling on lock_buffer(), that implies that buffer was being written, and for some reason it was taking a very long time to complete. Yes. It might be worthwhile to put a timestamp in struct dm_crypt_io, and record the time when a particular I/O encryption/decryption is getting queued to the kcryptd workqueues, and when they finally squirt out. That somewhat assumes that dm_crypt was at fault which is not unreasonable but I was skeptical as the workload on dm_crypt was opening a maildir and mostly reads. I used a tracepoint in jbd2 to get an idea of what device the buffer_head was managing and dm did not show up on the list. This is what a trace-cmd log of the test told me. flush-8:0-240 [005] 236.655363: jbd2_lock_buffer_stall: dev 8,8 stall_ms 1096 awesome-1364 [005] 290.594396: jbd2_lock_buffer_stall: dev 8,6 stall_ms 7312 gnome-pty-helpe-2256 [005] 290.836952: jbd2_lock_buffer_stall: dev 8,8 stall_ms 7528 flush-8:0-240 [003] 304.012424: jbd2_lock_buffer_stall: dev 8,8 stall_ms 4472 gnome-terminal-2332 [005] 308.290879: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3060 awesome-1364 [006] 308.291318: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3048 flush-8:0-240 [005] 331.525996: jbd2_lock_buffer_stall: dev 8,5 stall_ms 8732 flush-8:0-240 [005] 332.353526: jbd2_lock_buffer_stall: dev 8,5 stall_ms 472 flush-8:0-240 [005] 345.341547: jbd2_lock_buffer_stall: dev 8,5 stall_ms 10024 gnome-terminal-2418 [005] 347.166876: jbd2_lock_buffer_stall: dev 8,6 stall_ms 11852 awesome-1364 [005] 347.167082: jbd2_lock_buffer_stall: dev 8,6 stall_ms 11844 flush-8:0-240 [005] 347.424520: jbd2_lock_buffer_stall: dev 8,5 stall_ms 2012 flush-8:0-240 [005] 347.583752: jbd2_lock_buffer_stall: dev 8,5 stall_ms 156 flush-8:0-240 [005] 390.079682: jbd2_lock_buffer_stall: dev 8,8 stall_ms 396 flush-8:0-240 [002] 407.882385: jbd2_lock_buffer_stall: dev 8,8 stall_ms 12244 flush-8:0-240 [005] 408.003976: jbd2_lock_buffer_stall: dev 8,8 stall_ms 124 gnome-terminal-2610 [005] 413.613365: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3400 awesome-1364 [006] 413.613605: jbd2_lock_buffer_stall: dev 8,6 stall_ms 3736 flush-8:0-240 [002] 430.706475: jbd2_lock_buffer_stall: dev 8,5 stall_ms 9748 flush-8:0-240 [005] 458.188896: jbd2_lock_buffer_stall: dev 8,5 stall_ms 7748 flush-8:0-240 [005] 458.828143: jbd2_lock_buffer_stall: dev 8,5 stall_ms 348 flush-8:0-240 [006] 459.163814: jbd2_lock_buffer_stall: dev 8,5 stall_ms 252 flush-8:0-240 [005] 462.340173: jbd2_lock_buffer_stall: dev 8,5 stall_ms 3160 flush-8:0-240 [005] 469.917705: jbd2_lock_buffer_stall: dev 8,5 stall_ms 6340 flush-8:0-240 [005] 474.434206: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4512 tar-2315 [005] 510.043613: jbd2_lock_buffer_stall: dev 8,5 stall_ms 4316 tclsh-1780 [005] 773.336488: jbd2_lock_buffer_stall: dev 8,5 stall_ms 736 git-3100 [005] 775.933506: jbd2_lock_buffer_stall: dev 8,5 stall_ms 3664 git-4763 [005] 864.093317: jbd2_lock_buffer_stall: dev 8,5 stall_ms 140 flush-8:0-240 [005] 864.242068: jbd2_lock_buffer_stall: dev 8,6 stall_ms 280 git-4763 [005] 864.264157: jbd2_lock_buffer_stall: dev 8,5 stall_ms 148 flush-8:0-240 [005] 865.24: jbd2_lock_buffer_stall: dev 8,5 stall_ms 464 git-4763 [000] 865.602469: jbd2_lock_buffer_stall: dev 8,5 stall_ms 300 flush-8:0-240 [005] 865.705448: jbd2_lock_buffer_stall: dev 8,5 stall_ms 500 flush-8:0-240 [005] 885.367576: jbd2_lock_buffer_stall: dev 8,8 stall_ms 11024 flush-8:0-240 [005] 895.339697: jbd2_lock_buffer_stall: dev 8,5 stall_ms 120 flush-8:0-240 [005] 895.765488: jbd2_lock_buffer_stall: dev 8,5 stall_ms 424 systemd-journal-265 [005] 915.687201: jbd2_lock_buffer_stall: dev 8,8 stall_ms 14844 flush-8:0-240 [005] 915.690529: jbd2_lock_buffer_stall: dev 8,6 stall_ms 19656 git-5442 [005] 1034.845674: jbd2_lock_buffer_stall: dev 8,5 stall_ms 344 git-5442 [005] 1035.157389: jbd2_lock_buffer_stall: dev 8,5 stall_ms 264 flush-8:0-240 [005] 1035.875478: jbd2_lock_buffer_stall: dev 8,8
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote: If we're stalling on lock_buffer(), that implies that buffer was being written, and for some reason it was taking a very long time to complete. Yes. It might be worthwhile to put a timestamp in struct dm_crypt_io, and record the time when a particular I/O encryption/decryption is getting queued to the kcryptd workqueues, and when they finally squirt out. That somewhat assumes that dm_crypt was at fault which is not unreasonable but I was skeptical as the workload on dm_crypt was opening a maildir and mostly reads. Hmm... well, I've reviewed all of the places in the ext4 and jbd2 layer where we call lock_buffer(), and with one exception[1] we're not holding the the bh locked any longer than necessary. There are a few places where we grab a spinlock or two before we can do what we need to do and then release the lock'ed buffer head, but the only time we hold the bh locked for long periods of time is when we submit metadata blocks for I/O. [1] There is one exception in ext4_xattr_release_block() where I believe we should move the call to unlock_buffer(bh) before the call to ext4_free_blocks(), since we've already elevanted the bh count and ext4_free_blocks() does not need to have the bh locked. It's not related to any of the stalls you've repored, though, as near as I can tell (none of the stack traces include the ext4 xattr code, and this would only affect external extended attribute blocks). Could you code which checks the hold time of lock_buffer(), measuing from when the lock is successfully grabbed, to see if you can see if I missed some code path in ext4 or jbd2 where the bh is locked and then there is some call to some function which needs to block for some random reason? What I'd suggest is putting a timestamp in buffer_head structure, which is set by lock_buffer once it is successfully grabbed the lock, and then in unlock_buffer(), if it is held for more than a second or some such, to dump out the stack trace. Because at this point, either I'm missing something or I'm beginning to suspect that your hard drive (or maybe something the block layer?) is simply taking a long time to service an I/O request. Putting in this check should be able to very quickly determine what code path and/or which subsystem we should be focused upon. Thanks, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu 11-04-13 14:35:12, Ted Tso wrote: On Thu, Apr 11, 2013 at 06:04:02PM +0100, Mel Gorman wrote: If we're stalling on lock_buffer(), that implies that buffer was being written, and for some reason it was taking a very long time to complete. Yes. It might be worthwhile to put a timestamp in struct dm_crypt_io, and record the time when a particular I/O encryption/decryption is getting queued to the kcryptd workqueues, and when they finally squirt out. That somewhat assumes that dm_crypt was at fault which is not unreasonable but I was skeptical as the workload on dm_crypt was opening a maildir and mostly reads. Hmm... well, I've reviewed all of the places in the ext4 and jbd2 layer where we call lock_buffer(), and with one exception[1] we're not holding the the bh locked any longer than necessary. There are a few places where we grab a spinlock or two before we can do what we need to do and then release the lock'ed buffer head, but the only time we hold the bh locked for long periods of time is when we submit metadata blocks for I/O. [1] There is one exception in ext4_xattr_release_block() where I believe we should move the call to unlock_buffer(bh) before the call to ext4_free_blocks(), since we've already elevanted the bh count and ext4_free_blocks() does not need to have the bh locked. It's not related to any of the stalls you've repored, though, as near as I can tell (none of the stack traces include the ext4 xattr code, and this would only affect external extended attribute blocks). Could you code which checks the hold time of lock_buffer(), measuing from when the lock is successfully grabbed, to see if you can see if I missed some code path in ext4 or jbd2 where the bh is locked and then there is some call to some function which needs to block for some random reason? What I'd suggest is putting a timestamp in buffer_head structure, which is set by lock_buffer once it is successfully grabbed the lock, and then in unlock_buffer(), if it is held for more than a second or some such, to dump out the stack trace. Because at this point, either I'm missing something or I'm beginning to suspect that your hard drive (or maybe something the block layer?) is simply taking a long time to service an I/O request. Putting in this check should be able to very quickly determine what code path and/or which subsystem we should be focused upon. I think it might be more enlightening if Mel traced which process in which funclion is holding the buffer lock. I suspect we'll find out that the flusher thread has submitted the buffer for IO as an async write and thus it takes a long time to complete in presence of reads which have higher priority. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: I think it might be more enlightening if Mel traced which process in which funclion is holding the buffer lock. I suspect we'll find out that the flusher thread has submitted the buffer for IO as an async write and thus it takes a long time to complete in presence of reads which have higher priority. That's an interesting theory. If the workload is one which is very heavy on reads and writes, that could explain the high latency. That would explain why those of us who are using primarily SSD's are seeing the problems, because would be reads are nice and fast. If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. The only problem I can see with this hypothesis is that if this is the explanation for what Mel and Jiri are seeing, it's something that would have been around for a long time, and would affect ext3 as well as ext4. That isn't quite consistent, however, with Mel's observation that this is a probablem which has gotten worse in relatively recently. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Thu, Apr 11, 2013 at 10:57:08PM -0400, Theodore Ts'o wrote: On Thu, Apr 11, 2013 at 11:33:35PM +0200, Jan Kara wrote: I think it might be more enlightening if Mel traced which process in which funclion is holding the buffer lock. I suspect we'll find out that the flusher thread has submitted the buffer for IO as an async write and thus it takes a long time to complete in presence of reads which have higher priority. That's an interesting theory. If the workload is one which is very heavy on reads and writes, that could explain the high latency. That would explain why those of us who are using primarily SSD's are seeing the problems, because would be reads are nice and fast. If that is the case, one possible solution that comes to mind would be to mark buffer_heads that contain metadata with a flag, so that the flusher thread can write them back at the same priority as reads. Ext4 is already using REQ_META for this purpose. I'm surprised that no-one has suggested change the IO elevator yet. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote: > During major activity there is likely to be "good" behaviour > with stalls roughly every 30 seconds roughly corresponding to > dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck > when this happens. > > 237 ?00:00:00 flush-8:0 > [] sleep_on_buffer+0x9/0x10 > [] __lock_buffer+0x2e/0x30 > [] do_get_write_access+0x43f/0x4b0 If we're stalling on lock_buffer(), that implies that buffer was being written, and for some reason it was taking a very long time to complete. It might be worthwhile to put a timestamp in struct dm_crypt_io, and record the time when a particular I/O encryption/decryption is getting queued to the kcryptd workqueues, and when they finally squirt out. Something else that might be worth trying is to add WQ_HIGHPRI to the workqueue flags and see if that makes a difference. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: > On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote: > > I'm testing a page-reclaim-related series on my laptop that is partially > > aimed at fixing long stalls when doing metadata-intensive operations on > > low memory such as a git checkout. I've been running 3.9-rc2 with the > > series applied but found that the interactive performance was awful even > > when there was plenty of free memory. > > Can you try 3.9-rc4 or later and see if the problem still persists? > There were a number of ext4 issues especially around low memory > performance which weren't resolved until -rc4. > I experimented with this for a while. -rc6 "feels" much better where -rc2 felt it would stall for prolonged periods of time but it could be my imagination too. It does appear that queue depth and await times are slowly increasing for various reasons. It's always been the case for me that metadata intensive and write activities in the background (opening maildir + cache cold git checkout) would stall the machine for periods of time. This time around, I timed how long it takes gnome-terminal to open, run find on a directory and exit again while a cache cold git checkout and a maildir folder were running v3.0.66 count time 471 5 23 10 11 15 14 20 4 25 8 30 3 35 v3.7 636 5 20 10 13 15 11 20 7 25 1 30 3 35 1 40 1 45 v3.8 count time 394 5 10 10 12 15 8 20 9 25 6 30 2 35 3 40 v3.9-rc6 count time 481 5 14 10 9 15 12 20 8 25 4 30 2 35 3 40 1 45 1 50 1 140 This shows that kernel 3.7 was able to open the terminal in 5 seconds or less 636 times during the test. Very broadly speaking, v3.0.66 is snappier and generally able to open the terminal and do some work faster. v3.9-rc6 is sometimes much slower such as when it took 140 seconds to open the terminal but not consistently slow enough to allow it to be reliably bisected. Further, whatever my perceptions are telling me, the fact is that git checkouts are not obviously worse. However, queue depth and IO wait times are higher but it's gradual and would not obviously make a very bad impression. See here; v3.0.66 checkout:278 depth:387.36 await: 878.97 launch:29.39 max_launch:34.20 v3.7 checkout:268 depth:439.96 await: 971.39 launch:29.46 max_launch:40.42 v3.8 checkout:275 depth:598.12 await:1280.62 launch:31.95 max_launch:38.50 v3.9-rc6 checkout:266 depth:540.74 await:1182.10 launch:45.39 max_launch:138.14 Cache cold git checkout times are roughly comparable but average queue depth has been increasing and average wait times in v3.8 and v3.9-rc6 are higher in comparison to v3.0.66. The average time it takes to launch a terminal and do something with it is also increasing. Unfortunately, these results are not always perfectly reproducible and it cannot be reliably bisected. That said, the worst IO wait times (in milliseconds) are getting higher await r_await w_await v3.0.66 5811.2439.19 28309.72 v3.7 7508.7946.36 36318.96 v3.8 7083.3547.55 35305.46 v3.9-rc2 9211.1435.25 34560.08 v3.9-rc6 7499.5395.21122780.43 Worst-case small read times have almost doubled. A worst case write delay was 122 seconds in v3.9-rc6! The average wait times are also not painting a pretty picture await r_await w_await v3.0.66 878.97 7.79 6975.51 v3.7 971.39 7.84 7745.57 v3.8 1280.63 7.75 10306.62 v3.9-rc2 1280.37 7.55 7687.20 v3.9-rc6 1182.11 8.11 13869.67 That is indicating that average wait times have almost doubled since v3.7. Even though -rc2 felt bad, it's not obviously reflected in the await figures which is partially what makes bisecting this difficult. At least you can get an impression of the wait times from this smoothened graph showing await times from iostat http://www.csn.ul.ie/~mel/postings/interactivity-20130410/await-times-smooth.png Again, while one can see the wait times are worse, it's not generally worse enough to pinpoint it to a single change. Other observations On my laptop, pm-utils was setting dirty_background_ratio to 5% and dirty_ratio to 10% away from the expected defaults of 10% and 20%. Any of the changes related to dirty balancing could have affected how often processes get dirty rate-limited. During major activity there is likely to be "good" behaviour with stalls roughly every 30 seconds roughly corresponding to dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck when this happens. 237 ?00:00:00 flush-8:0 [] sleep_on_buffer+0x9/0x10 [] __lock_buffer+0x2e/0x30 [] do_get_write_access+0x43f/0x4b0 []
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote: I'm testing a page-reclaim-related series on my laptop that is partially aimed at fixing long stalls when doing metadata-intensive operations on low memory such as a git checkout. I've been running 3.9-rc2 with the series applied but found that the interactive performance was awful even when there was plenty of free memory. Can you try 3.9-rc4 or later and see if the problem still persists? There were a number of ext4 issues especially around low memory performance which weren't resolved until -rc4. I experimented with this for a while. -rc6 feels much better where -rc2 felt it would stall for prolonged periods of time but it could be my imagination too. It does appear that queue depth and await times are slowly increasing for various reasons. It's always been the case for me that metadata intensive and write activities in the background (opening maildir + cache cold git checkout) would stall the machine for periods of time. This time around, I timed how long it takes gnome-terminal to open, run find on a directory and exit again while a cache cold git checkout and a maildir folder were running v3.0.66 count time 471 5 23 10 11 15 14 20 4 25 8 30 3 35 v3.7 636 5 20 10 13 15 11 20 7 25 1 30 3 35 1 40 1 45 v3.8 count time 394 5 10 10 12 15 8 20 9 25 6 30 2 35 3 40 v3.9-rc6 count time 481 5 14 10 9 15 12 20 8 25 4 30 2 35 3 40 1 45 1 50 1 140 This shows that kernel 3.7 was able to open the terminal in 5 seconds or less 636 times during the test. Very broadly speaking, v3.0.66 is snappier and generally able to open the terminal and do some work faster. v3.9-rc6 is sometimes much slower such as when it took 140 seconds to open the terminal but not consistently slow enough to allow it to be reliably bisected. Further, whatever my perceptions are telling me, the fact is that git checkouts are not obviously worse. However, queue depth and IO wait times are higher but it's gradual and would not obviously make a very bad impression. See here; v3.0.66 checkout:278 depth:387.36 await: 878.97 launch:29.39 max_launch:34.20 v3.7 checkout:268 depth:439.96 await: 971.39 launch:29.46 max_launch:40.42 v3.8 checkout:275 depth:598.12 await:1280.62 launch:31.95 max_launch:38.50 v3.9-rc6 checkout:266 depth:540.74 await:1182.10 launch:45.39 max_launch:138.14 Cache cold git checkout times are roughly comparable but average queue depth has been increasing and average wait times in v3.8 and v3.9-rc6 are higher in comparison to v3.0.66. The average time it takes to launch a terminal and do something with it is also increasing. Unfortunately, these results are not always perfectly reproducible and it cannot be reliably bisected. That said, the worst IO wait times (in milliseconds) are getting higher await r_await w_await v3.0.66 5811.2439.19 28309.72 v3.7 7508.7946.36 36318.96 v3.8 7083.3547.55 35305.46 v3.9-rc2 9211.1435.25 34560.08 v3.9-rc6 7499.5395.21122780.43 Worst-case small read times have almost doubled. A worst case write delay was 122 seconds in v3.9-rc6! The average wait times are also not painting a pretty picture await r_await w_await v3.0.66 878.97 7.79 6975.51 v3.7 971.39 7.84 7745.57 v3.8 1280.63 7.75 10306.62 v3.9-rc2 1280.37 7.55 7687.20 v3.9-rc6 1182.11 8.11 13869.67 That is indicating that average wait times have almost doubled since v3.7. Even though -rc2 felt bad, it's not obviously reflected in the await figures which is partially what makes bisecting this difficult. At least you can get an impression of the wait times from this smoothened graph showing await times from iostat http://www.csn.ul.ie/~mel/postings/interactivity-20130410/await-times-smooth.png Again, while one can see the wait times are worse, it's not generally worse enough to pinpoint it to a single change. Other observations On my laptop, pm-utils was setting dirty_background_ratio to 5% and dirty_ratio to 10% away from the expected defaults of 10% and 20%. Any of the changes related to dirty balancing could have affected how often processes get dirty rate-limited. During major activity there is likely to be good behaviour with stalls roughly every 30 seconds roughly corresponding to dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck when this happens. 237 ?00:00:00 flush-8:0 [811a35b9] sleep_on_buffer+0x9/0x10 [811a35ee] __lock_buffer+0x2e/0x30 [8123a21f]
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 10, 2013 at 11:56:08AM +0100, Mel Gorman wrote: During major activity there is likely to be good behaviour with stalls roughly every 30 seconds roughly corresponding to dirty_expire_centiseconds. As you'd expect, the flusher thread is stuck when this happens. 237 ?00:00:00 flush-8:0 [811a35b9] sleep_on_buffer+0x9/0x10 [811a35ee] __lock_buffer+0x2e/0x30 [8123a21f] do_get_write_access+0x43f/0x4b0 If we're stalling on lock_buffer(), that implies that buffer was being written, and for some reason it was taking a very long time to complete. It might be worthwhile to put a timestamp in struct dm_crypt_io, and record the time when a particular I/O encryption/decryption is getting queued to the kcryptd workqueues, and when they finally squirt out. Something else that might be worth trying is to add WQ_HIGHPRI to the workqueue flags and see if that makes a difference. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote: > > semantic error: while resolving probe point: identifier 'kprobe' at > > /tmp/stapdjN4_l:18:7 > > source: probe kprobe.function("get_request_wait") > > ^ > > Pass 2: analysis failed. [man error::pass2] > > Unexpected exit of STAP script at ./watch-dstate.pl line 296. > > I have no clue what to do next. Can you give me a hint? Is there any reason why the error message couldn't be simplified, to something as "kernel symbol not found"? I wasn't sure if the problem was that there was some incompatibility between a recent change with kprobe and systemtap, or parse failure in the systemtap script, etc. > Systemtap could endavour to list roughly-matching functions that do > exist, if you think that's be helpful. If the goal is ease of use, I suspect the more important thing that systemtap could do is to make its error messages more easily understandable, instead of pointing the user to read a man page where the user then has to figure out which one of a number of failure scenarios were caused by a particularly opaque error message. (The man page doesn't even say that "semantic error while resolving probe point" means that a kernel function doesn't exist -- especially complaining about the kprobe identifier points the user in the wrong direction.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi, Mel - > > [...] git kernel developers > > should use git systemtap, as has always been the case. [...] > > At one point in the past this used to be the case but then systemtap had to > be compiled as part of automated tests across different kernel versions. It > could have been worked around in various ways or even installed manually > when machines were deployed but stap-fix.sh generally took less time to > keep working. OK, if that works for you. Keep in mind though that newer versions of systemtap retain backward-compatibility for ancient versions of the kernel, so git systemtap should work on those older versions just fine. > [...] > Yes, this was indeed the problem. The next version of watch-dstate.pl > treated get_request_wait() as a function that may or may not exist. It > uses /proc/kallsyms to figure it out. ... or you can use the "?" punctuation in the script to have systemtap adapt: probe kprobe.function("get_request_wait") ? { ... } - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote: > > Hi - > > > tytso wrote: > > > So I tried to reproduce the problem, and so I installed systemtap > > (bleeding edge, since otherwise it won't work with development > > kernel), and then rebuilt a kernel with all of the necessary CONFIG > > options enabled: > > > > CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS, > > CONFIG_MODULES, CONFIG_MODULE_UNLOAD > > [...] > > That sounds about right. > > > > I then pulled down mmtests, and tried running watch-dstate.pl, which > > is what I assume you were using [...] > > I just took a look at the mmtests, particularly the stap-fix.sh stuff. > The heroics therein are really not called for. git kernel developers > should use git systemtap, as has always been the case. All > compatibility hacks in stap-fix.sh have already been merged, in many > cases for months. > At one point in the past this used to be the case but then systemtap had to be compiled as part of automated tests across different kernel versions. It could have been worked around in various ways or even installed manually when machines were deployed but stap-fix.sh generally took less time to keep working. > > > [...] > > semantic error: while resolving probe point: identifier 'kprobe' at > > /tmp/stapdjN4_l:18:7 > > source: probe kprobe.function("get_request_wait") > > ^ > > Pass 2: analysis failed. [man error::pass2] > > Unexpected exit of STAP script at ./watch-dstate.pl line 296. > > I have no clue what to do next. Can you give me a hint? > > You should see the error::pass2 man page, which refers to > error::reporting, which refers to involving stap folks and running > stap-report to gather needed info. > > But in this case, that's unnecessary: the problem is most likely that > the get_request_wait function does not actually exist any longer, since > > commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6 > Author: Tejun Heo > Date: Mon Jun 4 20:40:55 2012 -0700 > > block: refactor get_request[_wait]() > Yes, this was indeed the problem. The next version of watch-dstate.pl treated get_request_wait() as a function that may or may not exist. It uses /proc/kallsyms to figure it out. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote: Hi - tytso wrote: So I tried to reproduce the problem, and so I installed systemtap (bleeding edge, since otherwise it won't work with development kernel), and then rebuilt a kernel with all of the necessary CONFIG options enabled: CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS, CONFIG_MODULES, CONFIG_MODULE_UNLOAD [...] That sounds about right. I then pulled down mmtests, and tried running watch-dstate.pl, which is what I assume you were using [...] I just took a look at the mmtests, particularly the stap-fix.sh stuff. The heroics therein are really not called for. git kernel developers should use git systemtap, as has always been the case. All compatibility hacks in stap-fix.sh have already been merged, in many cases for months. At one point in the past this used to be the case but then systemtap had to be compiled as part of automated tests across different kernel versions. It could have been worked around in various ways or even installed manually when machines were deployed but stap-fix.sh generally took less time to keep working. [...] semantic error: while resolving probe point: identifier 'kprobe' at /tmp/stapdjN4_l:18:7 source: probe kprobe.function(get_request_wait) ^ Pass 2: analysis failed. [man error::pass2] Unexpected exit of STAP script at ./watch-dstate.pl line 296. I have no clue what to do next. Can you give me a hint? You should see the error::pass2 man page, which refers to error::reporting, which refers to involving stap folks and running stap-report to gather needed info. But in this case, that's unnecessary: the problem is most likely that the get_request_wait function does not actually exist any longer, since commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6 Author: Tejun Heo t...@kernel.org Date: Mon Jun 4 20:40:55 2012 -0700 block: refactor get_request[_wait]() Yes, this was indeed the problem. The next version of watch-dstate.pl treated get_request_wait() as a function that may or may not exist. It uses /proc/kallsyms to figure it out. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi, Mel - [...] git kernel developers should use git systemtap, as has always been the case. [...] At one point in the past this used to be the case but then systemtap had to be compiled as part of automated tests across different kernel versions. It could have been worked around in various ways or even installed manually when machines were deployed but stap-fix.sh generally took less time to keep working. OK, if that works for you. Keep in mind though that newer versions of systemtap retain backward-compatibility for ancient versions of the kernel, so git systemtap should work on those older versions just fine. [...] Yes, this was indeed the problem. The next version of watch-dstate.pl treated get_request_wait() as a function that may or may not exist. It uses /proc/kallsyms to figure it out. ... or you can use the ? punctuation in the script to have systemtap adapt: probe kprobe.function(get_request_wait) ? { ... } - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sun, Apr 07, 2013 at 05:59:06PM -0400, Frank Ch. Eigler wrote: semantic error: while resolving probe point: identifier 'kprobe' at /tmp/stapdjN4_l:18:7 source: probe kprobe.function(get_request_wait) ^ Pass 2: analysis failed. [man error::pass2] Unexpected exit of STAP script at ./watch-dstate.pl line 296. I have no clue what to do next. Can you give me a hint? Is there any reason why the error message couldn't be simplified, to something as kernel symbol not found? I wasn't sure if the problem was that there was some incompatibility between a recent change with kprobe and systemtap, or parse failure in the systemtap script, etc. Systemtap could endavour to list roughly-matching functions that do exist, if you think that's be helpful. If the goal is ease of use, I suspect the more important thing that systemtap could do is to make its error messages more easily understandable, instead of pointing the user to read a man page where the user then has to figure out which one of a number of failure scenarios were caused by a particularly opaque error message. (The man page doesn't even say that semantic error while resolving probe point means that a kernel function doesn't exist -- especially complaining about the kprobe identifier points the user in the wrong direction.) - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi - tytso wrote: > So I tried to reproduce the problem, and so I installed systemtap > (bleeding edge, since otherwise it won't work with development > kernel), and then rebuilt a kernel with all of the necessary CONFIG > options enabled: > > CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS, > CONFIG_MODULES, CONFIG_MODULE_UNLOAD > [...] That sounds about right. > I then pulled down mmtests, and tried running watch-dstate.pl, which > is what I assume you were using [...] I just took a look at the mmtests, particularly the stap-fix.sh stuff. The heroics therein are really not called for. git kernel developers should use git systemtap, as has always been the case. All compatibility hacks in stap-fix.sh have already been merged, in many cases for months. > [...] > semantic error: while resolving probe point: identifier 'kprobe' at > /tmp/stapdjN4_l:18:7 > source: probe kprobe.function("get_request_wait") > ^ > Pass 2: analysis failed. [man error::pass2] > Unexpected exit of STAP script at ./watch-dstate.pl line 296. > I have no clue what to do next. Can you give me a hint? You should see the error::pass2 man page, which refers to error::reporting, which refers to involving stap folks and running stap-report to gather needed info. But in this case, that's unnecessary: the problem is most likely that the get_request_wait function does not actually exist any longer, since commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6 Author: Tejun Heo Date: Mon Jun 4 20:40:55 2012 -0700 block: refactor get_request[_wait]() Systemtap could endavour to list roughly-matching functions that do exist, if you think that's be helpful. - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi - tytso wrote: So I tried to reproduce the problem, and so I installed systemtap (bleeding edge, since otherwise it won't work with development kernel), and then rebuilt a kernel with all of the necessary CONFIG options enabled: CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS, CONFIG_MODULES, CONFIG_MODULE_UNLOAD [...] That sounds about right. I then pulled down mmtests, and tried running watch-dstate.pl, which is what I assume you were using [...] I just took a look at the mmtests, particularly the stap-fix.sh stuff. The heroics therein are really not called for. git kernel developers should use git systemtap, as has always been the case. All compatibility hacks in stap-fix.sh have already been merged, in many cases for months. [...] semantic error: while resolving probe point: identifier 'kprobe' at /tmp/stapdjN4_l:18:7 source: probe kprobe.function(get_request_wait) ^ Pass 2: analysis failed. [man error::pass2] Unexpected exit of STAP script at ./watch-dstate.pl line 296. I have no clue what to do next. Can you give me a hint? You should see the error::pass2 man page, which refers to error::reporting, which refers to involving stap folks and running stap-report to gather needed info. But in this case, that's unnecessary: the problem is most likely that the get_request_wait function does not actually exist any longer, since commit a06e05e6afab70b4b23c0a7975aaeae24b195cd6 Author: Tejun Heo t...@kernel.org Date: Mon Jun 4 20:40:55 2012 -0700 block: refactor get_request[_wait]() Systemtap could endavour to list roughly-matching functions that do exist, if you think that's be helpful. - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sat, Apr 06, 2013 at 09:29:48AM +0200, Jiri Slaby wrote: > > I'm not sure, as I am using -next like for ever. But sure, there was a > kernel which didn't ahve this problem. Any chance you could try rolling back to 3.2 or 3.5 to see if you can get a starting point? Even a high-level bisection search would be helpful to give us a hint. >Ok, > dd if=/dev/zero of=xxx >is enough instead of "kernel update". Was the dd running in the VM or in the host OS? Basically, is running the VM required? > Nothing, just VM (kernel update from console) and mplayer2 on the host. > This is more-or-less reproducible with these two. No browser or anything else running that might be introducing a stream of fsync()'s? > jbd2/sda5-8-10969 [000] 403.679552: jbd2_run_stats: dev >259,655360 tid 348895 wait 0 request_delay 0 running 5000 locked 1040 >flushing 0 logging 112 handle_count 148224 blocks 1 blocks_logged 2 > jbd2/md2-8-959 [000] 408.111339: jbd2_run_stats: dev 9,2 >tid 382991 wait 0 request_delay 0 running 5156 locked 2268 flushing 0 >logging 124 handle_count 5 blocks 1 blocks_logged 2 OK, so this is interesting. The commit is stalling for 1 second in the the transaction commit on sda5, and then very shortly thereafter for 2.2 seconds on md2, while we are trying to lock down the transaction. What that means is that we are waiting for all of the transaction handles opened against that particular commit to complete before we can let the transaction commit proceed. Is md2 sharing the same disk spindle as sda5? And to which disk were you doing the "dd if=/dev/zero of=/dev/XXX" command? If I had to guess what's going on, the disk is accepting a huge amount of writes to its track buffer, and then occasionally it is going out to lunch trying to write all of this data to the disk platter. This is not (always) happening when we do the commit (with its attended cache flush command), but in a few cases, we are doing a read command which is getting stalled. There are a few cases where we start a transaction handle, and then discover that we need to read in a disk block, and if that read stalls for a long period of time, it will hold the transaction handle open, and this will in turn stall the commit. If you were to grab a blocktrace, I suspect that is what you will find; that it's actually a read command which is stalling at some point, correlated with when we are trying to start transaction commit. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/06/2013 09:37 AM, Jiri Slaby wrote: > On 04/06/2013 09:29 AM, Jiri Slaby wrote: >> On 04/06/2013 01:16 AM, Theodore Ts'o wrote: >>> On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in "Your system is too SLOW to play this!" by mplayer and frame dropping. >>> >>> What was the first kernel where you didn't have the problem? Were you >>> using the 3.8 kernel earlier, and did you see the interactivity >>> problems there? >> >> I'm not sure, as I am using -next like for ever. But sure, there was a >> kernel which didn't ahve this problem. >> >>> What else was running in on your desktop at the same time? >> >> Nothing, just VM (kernel update from console) and mplayer2 on the host. >> This is more-or-less reproducible with these two. > > Ok, > dd if=/dev/zero of=xxx > is enough instead of "kernel update". > > Writeback mount doesn't help. > >>> How was >>> the file system mounted, >> >> Both are actually a single device /dev/sda5: >> /dev/sda5 on /win type ext4 (rw,noatime,data=ordered) >> >> Should I try writeback? >> >>> and can you send me the output of dumpe2fs -h >>> /dev/XXX? >> >> dumpe2fs 1.42.7 (21-Jan-2013) >> Filesystem volume name: >> Last mounted on: /win >> Filesystem UUID: cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b >> Filesystem magic number: 0xEF53 >> Filesystem revision #:1 (dynamic) >> Filesystem features: has_journal ext_attr resize_inode dir_index >> filetype needs_recovery extent flex_bg sparse_super large_file huge_file >> uninit_bg dir_nlink extra_isize >> Filesystem flags: signed_directory_hash >> Default mount options:user_xattr acl >> Filesystem state: clean >> Errors behavior: Continue >> Filesystem OS type: Linux >> Inode count: 30507008 >> Block count: 122012416 >> Reserved block count: 0 >> Free blocks: 72021328 >> Free inodes: 30474619 >> First block: 0 >> Block size: 4096 >> Fragment size:4096 >> Reserved GDT blocks: 994 >> Blocks per group: 32768 >> Fragments per group: 32768 >> Inodes per group: 8192 >> Inode blocks per group: 512 >> RAID stride: 32747 >> Flex block group size:16 >> Filesystem created: Fri Sep 7 20:44:21 2012 >> Last mount time: Thu Apr 4 12:22:01 2013 >> Last write time: Thu Apr 4 12:22:01 2013 >> Mount count: 256 >> Maximum mount count: -1 >> Last checked: Sat Sep 8 21:13:28 2012 >> Check interval: 0 () >> Lifetime writes: 1011 GB >> Reserved blocks uid: 0 (user root) >> Reserved blocks gid: 0 (group root) >> First inode: 11 >> Inode size: 256 >> Required extra isize: 28 >> Desired extra isize: 28 >> Journal inode:8 >> Default directory hash: half_md4 >> Directory Hash Seed: b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e >> Journal backup: inode blocks >> Journal features: journal_incompat_revoke >> Journal size: 128M >> Journal length: 32768 >> Journal sequence: 0x00054dc7 >> Journal start:8193 >> >>> Oh, and what options were you using to when you kicked off >>> the VM? >> >> qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet >> -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img >> >>> The other thing that would be useful was to enable the jbd2_run_stats >>> tracepoint and to send the output of the trace log when you notice the >>> interactivity problems. >> >> Ok, I will try. Inline here, as well as attached: # tracer: nop # # entries-in-buffer/entries-written: 46/46 #P:2 # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | jbd2/sda5-8-10969 [000] 387.054319: jbd2_run_stats: dev 259,655360 tid 348892 wait 0 request_delay 0 running 5728 locked 0 flushing 0 logging 28 handle_count 10 blocks 1 blocks_logged 2 jbd2/sda5-8-10969 [000] 392.594132: jbd2_run_stats: dev 259,655360 tid 348893 wait 0 request_delay 0 running 5300 locked 0 flushing 0 logging 64 handle_count 75944 blocks 1 blocks_logged 2 jbd2/md2-8-959 [000] 396.249990: jbd2_run_stats: dev 9,2 tid 382990 wait 0 request_delay 0 running 5500 locked 0 flushing 0 logging 220 handle_count 3 blocks 1 blocks_logged 2 jbd2/md1-8-1826 [000] 397.205670: jbd2_run_stats: dev 9,1 tid 1081270 wait 0 request_delay 0 running 5760 locked 0 flushing 0 logging 200
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/06/2013 09:29 AM, Jiri Slaby wrote: > On 04/06/2013 01:16 AM, Theodore Ts'o wrote: >> On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: >>> Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but >>> it still sucks. Updating a kernel in a VM still results in "Your system >>> is too SLOW to play this!" by mplayer and frame dropping. >> >> What was the first kernel where you didn't have the problem? Were you >> using the 3.8 kernel earlier, and did you see the interactivity >> problems there? > > I'm not sure, as I am using -next like for ever. But sure, there was a > kernel which didn't ahve this problem. > >> What else was running in on your desktop at the same time? > > Nothing, just VM (kernel update from console) and mplayer2 on the host. > This is more-or-less reproducible with these two. Ok, dd if=/dev/zero of=xxx is enough instead of "kernel update". Writeback mount doesn't help. >> How was >> the file system mounted, > > Both are actually a single device /dev/sda5: > /dev/sda5 on /win type ext4 (rw,noatime,data=ordered) > > Should I try writeback? > >> and can you send me the output of dumpe2fs -h >> /dev/XXX? > > dumpe2fs 1.42.7 (21-Jan-2013) > Filesystem volume name: > Last mounted on: /win > Filesystem UUID: cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b > Filesystem magic number: 0xEF53 > Filesystem revision #:1 (dynamic) > Filesystem features: has_journal ext_attr resize_inode dir_index > filetype needs_recovery extent flex_bg sparse_super large_file huge_file > uninit_bg dir_nlink extra_isize > Filesystem flags: signed_directory_hash > Default mount options:user_xattr acl > Filesystem state: clean > Errors behavior: Continue > Filesystem OS type: Linux > Inode count: 30507008 > Block count: 122012416 > Reserved block count: 0 > Free blocks: 72021328 > Free inodes: 30474619 > First block: 0 > Block size: 4096 > Fragment size:4096 > Reserved GDT blocks: 994 > Blocks per group: 32768 > Fragments per group: 32768 > Inodes per group: 8192 > Inode blocks per group: 512 > RAID stride: 32747 > Flex block group size:16 > Filesystem created: Fri Sep 7 20:44:21 2012 > Last mount time: Thu Apr 4 12:22:01 2013 > Last write time: Thu Apr 4 12:22:01 2013 > Mount count: 256 > Maximum mount count: -1 > Last checked: Sat Sep 8 21:13:28 2012 > Check interval: 0 () > Lifetime writes: 1011 GB > Reserved blocks uid: 0 (user root) > Reserved blocks gid: 0 (group root) > First inode: 11 > Inode size: 256 > Required extra isize: 28 > Desired extra isize: 28 > Journal inode:8 > Default directory hash: half_md4 > Directory Hash Seed: b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e > Journal backup: inode blocks > Journal features: journal_incompat_revoke > Journal size: 128M > Journal length: 32768 > Journal sequence: 0x00054dc7 > Journal start:8193 > >> Oh, and what options were you using to when you kicked off >> the VM? > > qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet > -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img > >> The other thing that would be useful was to enable the jbd2_run_stats >> tracepoint and to send the output of the trace log when you notice the >> interactivity problems. > > Ok, I will try. > > thanks, > -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/06/2013 01:16 AM, Theodore Ts'o wrote: > On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: >> Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but >> it still sucks. Updating a kernel in a VM still results in "Your system >> is too SLOW to play this!" by mplayer and frame dropping. > > What was the first kernel where you didn't have the problem? Were you > using the 3.8 kernel earlier, and did you see the interactivity > problems there? I'm not sure, as I am using -next like for ever. But sure, there was a kernel which didn't ahve this problem. > What else was running in on your desktop at the same time? Nothing, just VM (kernel update from console) and mplayer2 on the host. This is more-or-less reproducible with these two. > How was > the file system mounted, Both are actually a single device /dev/sda5: /dev/sda5 on /win type ext4 (rw,noatime,data=ordered) Should I try writeback? > and can you send me the output of dumpe2fs -h > /dev/XXX? dumpe2fs 1.42.7 (21-Jan-2013) Filesystem volume name: Last mounted on: /win Filesystem UUID: cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30507008 Block count: 122012416 Reserved block count: 0 Free blocks: 72021328 Free inodes: 30474619 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 32747 Flex block group size:16 Filesystem created: Fri Sep 7 20:44:21 2012 Last mount time: Thu Apr 4 12:22:01 2013 Last write time: Thu Apr 4 12:22:01 2013 Mount count: 256 Maximum mount count: -1 Last checked: Sat Sep 8 21:13:28 2012 Check interval: 0 () Lifetime writes: 1011 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x00054dc7 Journal start:8193 > Oh, and what options were you using to when you kicked off > the VM? qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img > The other thing that would be useful was to enable the jbd2_run_stats > tracepoint and to send the output of the trace log when you notice the > interactivity problems. Ok, I will try. thanks, -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/06/2013 01:16 AM, Theodore Ts'o wrote: On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in Your system is too SLOW to play this! by mplayer and frame dropping. What was the first kernel where you didn't have the problem? Were you using the 3.8 kernel earlier, and did you see the interactivity problems there? I'm not sure, as I am using -next like for ever. But sure, there was a kernel which didn't ahve this problem. What else was running in on your desktop at the same time? Nothing, just VM (kernel update from console) and mplayer2 on the host. This is more-or-less reproducible with these two. How was the file system mounted, Both are actually a single device /dev/sda5: /dev/sda5 on /win type ext4 (rw,noatime,data=ordered) Should I try writeback? and can you send me the output of dumpe2fs -h /dev/XXX? dumpe2fs 1.42.7 (21-Jan-2013) Filesystem volume name: none Last mounted on: /win Filesystem UUID: cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30507008 Block count: 122012416 Reserved block count: 0 Free blocks: 72021328 Free inodes: 30474619 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 32747 Flex block group size:16 Filesystem created: Fri Sep 7 20:44:21 2012 Last mount time: Thu Apr 4 12:22:01 2013 Last write time: Thu Apr 4 12:22:01 2013 Mount count: 256 Maximum mount count: -1 Last checked: Sat Sep 8 21:13:28 2012 Check interval: 0 (none) Lifetime writes: 1011 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x00054dc7 Journal start:8193 Oh, and what options were you using to when you kicked off the VM? qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img The other thing that would be useful was to enable the jbd2_run_stats tracepoint and to send the output of the trace log when you notice the interactivity problems. Ok, I will try. thanks, -- js suse labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/06/2013 09:29 AM, Jiri Slaby wrote: On 04/06/2013 01:16 AM, Theodore Ts'o wrote: On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in Your system is too SLOW to play this! by mplayer and frame dropping. What was the first kernel where you didn't have the problem? Were you using the 3.8 kernel earlier, and did you see the interactivity problems there? I'm not sure, as I am using -next like for ever. But sure, there was a kernel which didn't ahve this problem. What else was running in on your desktop at the same time? Nothing, just VM (kernel update from console) and mplayer2 on the host. This is more-or-less reproducible with these two. Ok, dd if=/dev/zero of=xxx is enough instead of kernel update. Writeback mount doesn't help. How was the file system mounted, Both are actually a single device /dev/sda5: /dev/sda5 on /win type ext4 (rw,noatime,data=ordered) Should I try writeback? and can you send me the output of dumpe2fs -h /dev/XXX? dumpe2fs 1.42.7 (21-Jan-2013) Filesystem volume name: none Last mounted on: /win Filesystem UUID: cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30507008 Block count: 122012416 Reserved block count: 0 Free blocks: 72021328 Free inodes: 30474619 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 32747 Flex block group size:16 Filesystem created: Fri Sep 7 20:44:21 2012 Last mount time: Thu Apr 4 12:22:01 2013 Last write time: Thu Apr 4 12:22:01 2013 Mount count: 256 Maximum mount count: -1 Last checked: Sat Sep 8 21:13:28 2012 Check interval: 0 (none) Lifetime writes: 1011 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x00054dc7 Journal start:8193 Oh, and what options were you using to when you kicked off the VM? qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img The other thing that would be useful was to enable the jbd2_run_stats tracepoint and to send the output of the trace log when you notice the interactivity problems. Ok, I will try. thanks, -- js suse labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/06/2013 09:37 AM, Jiri Slaby wrote: On 04/06/2013 09:29 AM, Jiri Slaby wrote: On 04/06/2013 01:16 AM, Theodore Ts'o wrote: On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in Your system is too SLOW to play this! by mplayer and frame dropping. What was the first kernel where you didn't have the problem? Were you using the 3.8 kernel earlier, and did you see the interactivity problems there? I'm not sure, as I am using -next like for ever. But sure, there was a kernel which didn't ahve this problem. What else was running in on your desktop at the same time? Nothing, just VM (kernel update from console) and mplayer2 on the host. This is more-or-less reproducible with these two. Ok, dd if=/dev/zero of=xxx is enough instead of kernel update. Writeback mount doesn't help. How was the file system mounted, Both are actually a single device /dev/sda5: /dev/sda5 on /win type ext4 (rw,noatime,data=ordered) Should I try writeback? and can you send me the output of dumpe2fs -h /dev/XXX? dumpe2fs 1.42.7 (21-Jan-2013) Filesystem volume name: none Last mounted on: /win Filesystem UUID: cd4bf4d2-bc32-4777-a437-ee24c4ee5f1b Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30507008 Block count: 122012416 Reserved block count: 0 Free blocks: 72021328 Free inodes: 30474619 First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 32747 Flex block group size:16 Filesystem created: Fri Sep 7 20:44:21 2012 Last mount time: Thu Apr 4 12:22:01 2013 Last write time: Thu Apr 4 12:22:01 2013 Mount count: 256 Maximum mount count: -1 Last checked: Sat Sep 8 21:13:28 2012 Check interval: 0 (none) Lifetime writes: 1011 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: b6ad3f8b-72ce-49d6-92cb-abccd7dbe98e Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x00054dc7 Journal start:8193 Oh, and what options were you using to when you kicked off the VM? qemu-kvm -k en-us -smp 2 -m 1200 -soundhw hda -usb -usbdevice tablet -net user -net nic,model=e1000 -serial pty -balloon virtio -hda x.img The other thing that would be useful was to enable the jbd2_run_stats tracepoint and to send the output of the trace log when you notice the interactivity problems. Ok, I will try. Inline here, as well as attached: # tracer: nop # # entries-in-buffer/entries-written: 46/46 #P:2 # # _-= irqs-off # / _= need-resched #| / _---= hardirq/softirq #|| / _--= preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | jbd2/sda5-8-10969 [000] 387.054319: jbd2_run_stats: dev 259,655360 tid 348892 wait 0 request_delay 0 running 5728 locked 0 flushing 0 logging 28 handle_count 10 blocks 1 blocks_logged 2 jbd2/sda5-8-10969 [000] 392.594132: jbd2_run_stats: dev 259,655360 tid 348893 wait 0 request_delay 0 running 5300 locked 0 flushing 0 logging 64 handle_count 75944 blocks 1 blocks_logged 2 jbd2/md2-8-959 [000] 396.249990: jbd2_run_stats: dev 9,2 tid 382990 wait 0 request_delay 0 running 5500 locked 0 flushing 0 logging 220 handle_count 3 blocks 1 blocks_logged 2 jbd2/md1-8-1826 [000] 397.205670: jbd2_run_stats: dev 9,1 tid 1081270 wait 0 request_delay 0 running 5760 locked 0 flushing 0 logging 200 handle_count 2 blocks 0 blocks_logged 0 jbd2/sda5-8-10969 [000] 397.563660: jbd2_run_stats: dev 259,655360 tid 348894 wait 0 request_delay 0 running 5000 locked 0 flushing 0 logging 32 handle_count 89397 blocks
Re: Excessive stall times on ext4 in 3.9-rc2
On Sat, Apr 06, 2013 at 09:29:48AM +0200, Jiri Slaby wrote: I'm not sure, as I am using -next like for ever. But sure, there was a kernel which didn't ahve this problem. Any chance you could try rolling back to 3.2 or 3.5 to see if you can get a starting point? Even a high-level bisection search would be helpful to give us a hint. Ok, dd if=/dev/zero of=xxx is enough instead of kernel update. Was the dd running in the VM or in the host OS? Basically, is running the VM required? Nothing, just VM (kernel update from console) and mplayer2 on the host. This is more-or-less reproducible with these two. No browser or anything else running that might be introducing a stream of fsync()'s? jbd2/sda5-8-10969 [000] 403.679552: jbd2_run_stats: dev 259,655360 tid 348895 wait 0 request_delay 0 running 5000 locked 1040 flushing 0 logging 112 handle_count 148224 blocks 1 blocks_logged 2 jbd2/md2-8-959 [000] 408.111339: jbd2_run_stats: dev 9,2 tid 382991 wait 0 request_delay 0 running 5156 locked 2268 flushing 0 logging 124 handle_count 5 blocks 1 blocks_logged 2 OK, so this is interesting. The commit is stalling for 1 second in the the transaction commit on sda5, and then very shortly thereafter for 2.2 seconds on md2, while we are trying to lock down the transaction. What that means is that we are waiting for all of the transaction handles opened against that particular commit to complete before we can let the transaction commit proceed. Is md2 sharing the same disk spindle as sda5? And to which disk were you doing the dd if=/dev/zero of=/dev/XXX command? If I had to guess what's going on, the disk is accepting a huge amount of writes to its track buffer, and then occasionally it is going out to lunch trying to write all of this data to the disk platter. This is not (always) happening when we do the commit (with its attended cache flush command), but in a few cases, we are doing a read command which is getting stalled. There are a few cases where we start a transaction handle, and then discover that we need to read in a disk block, and if that read stalls for a long period of time, it will hold the transaction handle open, and this will in turn stall the commit. If you were to grab a blocktrace, I suspect that is what you will find; that it's actually a read command which is stalling at some point, correlated with when we are trying to start transaction commit. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: > Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but > it still sucks. Updating a kernel in a VM still results in "Your system > is too SLOW to play this!" by mplayer and frame dropping. What was the first kernel where you didn't have the problem? Were you using the 3.8 kernel earlier, and did you see the interactivity problems there? What else was running in on your desktop at the same time? How was the file system mounted, and can you send me the output of dumpe2fs -h /dev/XXX? Oh, and what options were you using to when you kicked off the VM? The other thing that would be useful was to enable the jbd2_run_stats tracepoint and to send the output of the trace log when you notice the interactivity problems. Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/03/2013 12:19 PM, Mel Gorman wrote: > On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote: >> On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: >>> >>> Can you try 3.9-rc4 or later and see if the problem still persists? >>> There were a number of ext4 issues especially around low memory >>> performance which weren't resolved until -rc4. >> >> Actually, sorry, I took a closer look and I'm not as sure going to >> -rc4 is going to help (although we did have some ext4 patches to fix a >> number of bugs that flowed in as late as -rc4). >> > > I'm running with -rc5 now. I have not noticed much interactivity problems > as such but the stall detection script reported that mutt stalled for > 20 seconds opening an inbox and imapd blocked for 59 seconds doing path > lookups, imaps blocked again for 12 seconds doing an atime update, an RSS > reader blocked for 3.5 seconds writing a file. etc. > > There has been no reclaim activity in the system yet and 2G is still free > so it's very unlikely to be a page or slab reclaim problem. Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in "Your system is too SLOW to play this!" by mplayer and frame dropping. 3.5G out of 6G memory used, the rest is I/O cache. I have 7200RPM disks in my desktop. -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/03/2013 12:19 PM, Mel Gorman wrote: On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote: On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: Can you try 3.9-rc4 or later and see if the problem still persists? There were a number of ext4 issues especially around low memory performance which weren't resolved until -rc4. Actually, sorry, I took a closer look and I'm not as sure going to -rc4 is going to help (although we did have some ext4 patches to fix a number of bugs that flowed in as late as -rc4). I'm running with -rc5 now. I have not noticed much interactivity problems as such but the stall detection script reported that mutt stalled for 20 seconds opening an inbox and imapd blocked for 59 seconds doing path lookups, imaps blocked again for 12 seconds doing an atime update, an RSS reader blocked for 3.5 seconds writing a file. etc. There has been no reclaim activity in the system yet and 2G is still free so it's very unlikely to be a page or slab reclaim problem. Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in Your system is too SLOW to play this! by mplayer and frame dropping. 3.5G out of 6G memory used, the rest is I/O cache. I have 7200RPM disks in my desktop. -- js suse labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Sat, Apr 06, 2013 at 12:18:11AM +0200, Jiri Slaby wrote: Ok, so now I'm runnning 3.9.0-rc5-next-20130404, it's not that bad, but it still sucks. Updating a kernel in a VM still results in Your system is too SLOW to play this! by mplayer and frame dropping. What was the first kernel where you didn't have the problem? Were you using the 3.8 kernel earlier, and did you see the interactivity problems there? What else was running in on your desktop at the same time? How was the file system mounted, and can you send me the output of dumpe2fs -h /dev/XXX? Oh, and what options were you using to when you kicked off the VM? The other thing that would be useful was to enable the jbd2_run_stats tracepoint and to send the output of the trace log when you notice the interactivity problems. Thanks, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 07:16:13PM -0400, Theodore Ts'o wrote: > I've tried doing some quick timing, and if it is a performance > regression, it's not a recent one --- or I haven't been able to > reproduce what Mel is seeing. I tried the following commands while > booted into 3.2, 3.8, and 3.9-rc3 kernels: > > time git clone ... > rm .git/index ; time git reset > FWIW, I had run a number if git checkout based tests over time and none of them revealed anything useful. Granted it was on other machines but I don't think it's git on its own. It's a combination that leads to this problem. Maybe it's really an IO scheduler problem and I need to figure out what combination triggers it. > > > Mel, how bad is various git commands that you are trying? Have you > tried using time to get estimates of how long a git clone or other git > operation is taking? > Unfortunately, the milage varies considerably and it's not always possible to time the operation. It may be that one occasion that opening a mail takes an abnormal length time with git operations occasionally making it far worse. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 03, 2013 at 08:05:30AM -0400, Theodore Ts'o wrote: > On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote: > > > > I'm running with -rc5 now. I have not noticed much interactivity problems > > as such but the stall detection script reported that mutt stalled for > > 20 seconds opening an inbox and imapd blocked for 59 seconds doing path > > lookups, imaps blocked again for 12 seconds doing an atime update, an RSS > > reader blocked for 3.5 seconds writing a file. etc. > > If imaps blocked for 12 seconds during an atime update, combined with > everything else, at a guess it got caught by something holding up a > journal commit. It's a possibility. I apologise but I forgot that mail is stored on a crypted partition on this machine. It's formatted ext4 but dmcrypt could be making this problem worse if it's stalling ext4 waiting to encrypt/decrypt data due to either a scheduler or workqueue change. > Could you try enabling the jbd2_run_stats tracepoint > and grabbing the trace log? This will give you statistics on how long > (in milliseconds) each of the various phases of a jbd2 commit is > taking, i.e.: > > jbd2/sdb1-8-327 [002] 39681.874661: jbd2_run_stats: dev 8,17 tid > 7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 > handle_count 75 blocks 8 blocks_logged 9 > jbd2/sdb1-8-327 [003] 39682.514153: jbd2_run_stats: dev 8,17 tid > 7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 > handle_count 39 blocks 12 blocks_logged 13 > jbd2/sdb1-8-327 [000] 39687.665609: jbd2_run_stats: dev 8,17 tid > 7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 > handle_count 60 blocks 13 blocks_logged 14 > jbd2/sdb1-8-327 [000] 39693.200453: jbd2_run_stats: dev 8,17 tid > 7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 > handle_count 53 blocks 10 blocks_logged 11 > jbd2/sdb1-8-327 [001] 39695.061657: jbd2_run_stats: dev 8,17 tid > 7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 > handle_count 124 blocks 19 blocks_logged 20 > Attached as well as the dstate summary that was recorded at the same time. It's not quite as compelling but I'll keep the monitor running and see if something falls out. I didn't find anything useful in the existing mmtests tests that could be used to bisect this but not many of them are focused on IO. > In the above sample each journal commit is running for no more than 5 > seconds or so (since that's the default jbd2 commit timeout; if a > transaction is running for less than 5 seconds, then either we ran out > of room in the journal, and the blocks_logged number will be high, or > a commit was forced by something such as an fsync call). > I didn't see anything majorly compelling in the jbd tracepoints but I'm not 100% sure I'm looking for the right thing either. I also recorded /proc/latency_stat and there were some bad sync latencies from the file as you can see here 3 4481 1586 jbd2_log_wait_commit ext4_sync_file vfs_fsync sys_msync system_call_fastpath 3 11325 4373 sleep_on_page wait_on_page_bit kretprobe_trampoline filemap_write_and_wait_range ext4_sync_file vfs_fsync sys_msync system_call_fastpath 85 1130707 14904 jbd2_journal_stop jbd2_journal_force_commit ext4_force_commit ext4_sync_file do_fsync sys_fsync system_call_fastpath 1 2161073 2161073 start_this_handle jbd2__journal_start.part.8 jbd2__journal_start __ext4_journal_start_sb ext4_da_writepages do_writepages __filemap_fdatawrite_range filemap_write_and_wait_range ext4_sync_file do_fsync sys_fsync system_call_fastpath 118 7798435 596184 jbd2_log_wait_commit jbd2_journal_stop jbd2_journal_force_commit ext4_force_commit ext4_sync_file do_fsync sys_fsync system_call_fastpath 599 15496449 3405822 sleep_on_page wait_on_page_bit kretprobe_trampoline filemap_write_and_wait_range ext4_sync_file do_fsync sys_fsync system_call_fastpath 405 28572881 2619592 jbd2_log_wait_commit ext4_sync_file do_fsync sys_fsync system_call_fastpath > If an atime update is getting blocked by 12 seconds, then it would be > interesting to see if a journal commit is running for significantly > longer than 5 seconds, or if one of the other commit phases is taking > significant amounts of time. (On the example above they are all > taking no time, since I ran this on a relatively uncontended system; > only a single git operation taking place.) > > Something else that might be worth trying is grabbing a lock_stat > report and see if something is sitting on an ext4 or jbd2 mutex for a > long time. > Ok, if nothing useful falls out in this session I'll enable lock debugging. latency_stat on its own would not be enough to conclude that a problem was related to lock contention. > Finally, as I mentioned I tried some rather simplistic tests and I > didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5 > kernel. Assuming you
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote: > > I'm running with -rc5 now. I have not noticed much interactivity problems > as such but the stall detection script reported that mutt stalled for > 20 seconds opening an inbox and imapd blocked for 59 seconds doing path > lookups, imaps blocked again for 12 seconds doing an atime update, an RSS > reader blocked for 3.5 seconds writing a file. etc. If imaps blocked for 12 seconds during an atime update, combined with everything else, at a guess it got caught by something holding up a journal commit. Could you try enabling the jbd2_run_stats tracepoint and grabbing the trace log? This will give you statistics on how long (in milliseconds) each of the various phases of a jbd2 commit is taking, i.e.: jbd2/sdb1-8-327 [002] 39681.874661: jbd2_run_stats: dev 8,17 tid 7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 handle_count 75 blocks 8 blocks_logged 9 jbd2/sdb1-8-327 [003] 39682.514153: jbd2_run_stats: dev 8,17 tid 7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 handle_count 39 blocks 12 blocks_logged 13 jbd2/sdb1-8-327 [000] 39687.665609: jbd2_run_stats: dev 8,17 tid 7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 handle_count 60 blocks 13 blocks_logged 14 jbd2/sdb1-8-327 [000] 39693.200453: jbd2_run_stats: dev 8,17 tid 7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 handle_count 53 blocks 10 blocks_logged 11 jbd2/sdb1-8-327 [001] 39695.061657: jbd2_run_stats: dev 8,17 tid 7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 handle_count 124 blocks 19 blocks_logged 20 In the above sample each journal commit is running for no more than 5 seconds or so (since that's the default jbd2 commit timeout; if a transaction is running for less than 5 seconds, then either we ran out of room in the journal, and the blocks_logged number will be high, or a commit was forced by something such as an fsync call). If an atime update is getting blocked by 12 seconds, then it would be interesting to see if a journal commit is running for significantly longer than 5 seconds, or if one of the other commit phases is taking significant amounts of time. (On the example above they are all taking no time, since I ran this on a relatively uncontended system; only a single git operation taking place.) Something else that might be worth trying is grabbing a lock_stat report and see if something is sitting on an ext4 or jbd2 mutex for a long time. Finally, as I mentioned I tried some rather simplistic tests and I didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5 kernel. Assuming you can get a version of systemtap that simultaneously works on 3.2 and 3.9-rc5 :-P, and chance you could do a quick experiment and see if you're seeing a difference on your setup? Thanks!! - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote: > On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: > > > > Can you try 3.9-rc4 or later and see if the problem still persists? > > There were a number of ext4 issues especially around low memory > > performance which weren't resolved until -rc4. > > Actually, sorry, I took a closer look and I'm not as sure going to > -rc4 is going to help (although we did have some ext4 patches to fix a > number of bugs that flowed in as late as -rc4). > I'm running with -rc5 now. I have not noticed much interactivity problems as such but the stall detection script reported that mutt stalled for 20 seconds opening an inbox and imapd blocked for 59 seconds doing path lookups, imaps blocked again for 12 seconds doing an atime update, an RSS reader blocked for 3.5 seconds writing a file. etc. There has been no reclaim activity in the system yet and 2G is still free so it's very unlikely to be a page or slab reclaim problem. > Can you send us the patch that you used to get record these long stall > times? No patch but it depends on systemtap which you are already aware is a wreck to work with and frequently breaks between kernel versions for a variety of reasons. Minimally, it is necessary to revert commit ba6fdda4 (profiling: Remove unused timer hook) to get systemtap working. I've reported this problem to the patch author and the systemtap mailing list. Other workarounds are necessary so I updated mmtests in git and at http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.10-mmtests-0.01.tar.gz . Download and untar it 1. stap can be "fixed" by running bin/stap-fix.sh . It will try and run a one-liner stap script and if that fails it'll try very crude workarounds. Your milage may vary considerably 2. If you want to run the monitor script yourself, it's sudo monitors/watch-dstate.pl | tee /tmp/foo.log but be aware the log may be truncated due to buffeering. Optionally you can avoid the buffered write problem by running mmtests as sudo ./run-mmtests.sh --config configs/config-monitor-interactive stall-debug and the log will be in work/log/dstate-stall-debug-monitor.gz 3. Summarise the report with cat /tmp/foo.log | subreport/stap-dstate-frequency I'll be digging through other mmtests results shortly to see if I already have a better reproduction case that is eligible for bisection but those results are based on different machines so no guarantees of success. > And I assume you're using a laptop drive? 5400RPM or 7200RPM? > Yes, laptop drive, 7200RPM. CFQ scheduler. Drive queue depth is 32. /dev/sda: ATA device, with non-removable media Model Number: ST9320423AS Serial Number: 5VH5M0LY Firmware Revision: 0003LVM1 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 625142448 Logical Sector size: 512 bytes Physical Sector size: 512 bytes device size with M = 1024*1024: 305245 MBytes device size with M = 1000*1000: 320072 MBytes (320 GB) cache/buffer size = 16384 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 128 Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *DOWNLOAD_MICROCODE *Advanced Power Management feature set SET_MAX security extension *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *64-bit World wide name
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:14:36AM -0400, Theodore Ts'o wrote: On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: Can you try 3.9-rc4 or later and see if the problem still persists? There were a number of ext4 issues especially around low memory performance which weren't resolved until -rc4. Actually, sorry, I took a closer look and I'm not as sure going to -rc4 is going to help (although we did have some ext4 patches to fix a number of bugs that flowed in as late as -rc4). I'm running with -rc5 now. I have not noticed much interactivity problems as such but the stall detection script reported that mutt stalled for 20 seconds opening an inbox and imapd blocked for 59 seconds doing path lookups, imaps blocked again for 12 seconds doing an atime update, an RSS reader blocked for 3.5 seconds writing a file. etc. There has been no reclaim activity in the system yet and 2G is still free so it's very unlikely to be a page or slab reclaim problem. Can you send us the patch that you used to get record these long stall times? No patch but it depends on systemtap which you are already aware is a wreck to work with and frequently breaks between kernel versions for a variety of reasons. Minimally, it is necessary to revert commit ba6fdda4 (profiling: Remove unused timer hook) to get systemtap working. I've reported this problem to the patch author and the systemtap mailing list. Other workarounds are necessary so I updated mmtests in git and at http://www.csn.ul.ie/~mel/projects/mmtests/mmtests-0.10-mmtests-0.01.tar.gz . Download and untar it 1. stap can be fixed by running bin/stap-fix.sh . It will try and run a one-liner stap script and if that fails it'll try very crude workarounds. Your milage may vary considerably 2. If you want to run the monitor script yourself, it's sudo monitors/watch-dstate.pl | tee /tmp/foo.log but be aware the log may be truncated due to buffeering. Optionally you can avoid the buffered write problem by running mmtests as sudo ./run-mmtests.sh --config configs/config-monitor-interactive stall-debug and the log will be in work/log/dstate-stall-debug-monitor.gz 3. Summarise the report with cat /tmp/foo.log | subreport/stap-dstate-frequency I'll be digging through other mmtests results shortly to see if I already have a better reproduction case that is eligible for bisection but those results are based on different machines so no guarantees of success. And I assume you're using a laptop drive? 5400RPM or 7200RPM? Yes, laptop drive, 7200RPM. CFQ scheduler. Drive queue depth is 32. /dev/sda: ATA device, with non-removable media Model Number: ST9320423AS Serial Number: 5VH5M0LY Firmware Revision: 0003LVM1 Transport: Serial Standards: Used: unknown (minor revision code 0x0029) Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBAuser addressable sectors: 268435455 LBA48 user addressable sectors: 625142448 Logical Sector size: 512 bytes Physical Sector size: 512 bytes device size with M = 1024*1024: 305245 MBytes device size with M = 1000*1000: 320072 MBytes (320 GB) cache/buffer size = 16384 KBytes Nominal Media Rotation Rate: 7200 Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Advanced power management level: 128 Recommended acoustic management value: 254, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: *SMART feature set Security Mode feature set *Power Management feature set *Write cache *Look-ahead *Host Protected Area feature set *WRITE_BUFFER command *READ_BUFFER command *DOWNLOAD_MICROCODE *Advanced Power Management feature set SET_MAX security extension *48-bit Address feature set *Device Configuration Overlay feature set *Mandatory FLUSH_CACHE *FLUSH_CACHE_EXT *SMART error logging *SMART self-test *General Purpose Logging feature set *64-bit World wide name *
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote: I'm running with -rc5 now. I have not noticed much interactivity problems as such but the stall detection script reported that mutt stalled for 20 seconds opening an inbox and imapd blocked for 59 seconds doing path lookups, imaps blocked again for 12 seconds doing an atime update, an RSS reader blocked for 3.5 seconds writing a file. etc. If imaps blocked for 12 seconds during an atime update, combined with everything else, at a guess it got caught by something holding up a journal commit. Could you try enabling the jbd2_run_stats tracepoint and grabbing the trace log? This will give you statistics on how long (in milliseconds) each of the various phases of a jbd2 commit is taking, i.e.: jbd2/sdb1-8-327 [002] 39681.874661: jbd2_run_stats: dev 8,17 tid 7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 handle_count 75 blocks 8 blocks_logged 9 jbd2/sdb1-8-327 [003] 39682.514153: jbd2_run_stats: dev 8,17 tid 7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 handle_count 39 blocks 12 blocks_logged 13 jbd2/sdb1-8-327 [000] 39687.665609: jbd2_run_stats: dev 8,17 tid 7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 handle_count 60 blocks 13 blocks_logged 14 jbd2/sdb1-8-327 [000] 39693.200453: jbd2_run_stats: dev 8,17 tid 7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 handle_count 53 blocks 10 blocks_logged 11 jbd2/sdb1-8-327 [001] 39695.061657: jbd2_run_stats: dev 8,17 tid 7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 handle_count 124 blocks 19 blocks_logged 20 In the above sample each journal commit is running for no more than 5 seconds or so (since that's the default jbd2 commit timeout; if a transaction is running for less than 5 seconds, then either we ran out of room in the journal, and the blocks_logged number will be high, or a commit was forced by something such as an fsync call). If an atime update is getting blocked by 12 seconds, then it would be interesting to see if a journal commit is running for significantly longer than 5 seconds, or if one of the other commit phases is taking significant amounts of time. (On the example above they are all taking no time, since I ran this on a relatively uncontended system; only a single git operation taking place.) Something else that might be worth trying is grabbing a lock_stat report and see if something is sitting on an ext4 or jbd2 mutex for a long time. Finally, as I mentioned I tried some rather simplistic tests and I didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5 kernel. Assuming you can get a version of systemtap that simultaneously works on 3.2 and 3.9-rc5 :-P, and chance you could do a quick experiment and see if you're seeing a difference on your setup? Thanks!! - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Wed, Apr 03, 2013 at 08:05:30AM -0400, Theodore Ts'o wrote: On Wed, Apr 03, 2013 at 11:19:25AM +0100, Mel Gorman wrote: I'm running with -rc5 now. I have not noticed much interactivity problems as such but the stall detection script reported that mutt stalled for 20 seconds opening an inbox and imapd blocked for 59 seconds doing path lookups, imaps blocked again for 12 seconds doing an atime update, an RSS reader blocked for 3.5 seconds writing a file. etc. If imaps blocked for 12 seconds during an atime update, combined with everything else, at a guess it got caught by something holding up a journal commit. It's a possibility. I apologise but I forgot that mail is stored on a crypted partition on this machine. It's formatted ext4 but dmcrypt could be making this problem worse if it's stalling ext4 waiting to encrypt/decrypt data due to either a scheduler or workqueue change. Could you try enabling the jbd2_run_stats tracepoint and grabbing the trace log? This will give you statistics on how long (in milliseconds) each of the various phases of a jbd2 commit is taking, i.e.: jbd2/sdb1-8-327 [002] 39681.874661: jbd2_run_stats: dev 8,17 tid 7163786 wait 0 request_delay 0 running 3530 locked 0 flushing 0 logging 0 handle_count 75 blocks 8 blocks_logged 9 jbd2/sdb1-8-327 [003] 39682.514153: jbd2_run_stats: dev 8,17 tid 7163787 wait 0 request_delay 0 running 640 locked 0 flushing 0 logging 0 handle_count 39 blocks 12 blocks_logged 13 jbd2/sdb1-8-327 [000] 39687.665609: jbd2_run_stats: dev 8,17 tid 7163788 wait 0 request_delay 0 running 5150 locked 0 flushing 0 logging 0 handle_count 60 blocks 13 blocks_logged 14 jbd2/sdb1-8-327 [000] 39693.200453: jbd2_run_stats: dev 8,17 tid 7163789 wait 0 request_delay 0 running 4840 locked 0 flushing 0 logging 0 handle_count 53 blocks 10 blocks_logged 11 jbd2/sdb1-8-327 [001] 39695.061657: jbd2_run_stats: dev 8,17 tid 7163790 wait 0 request_delay 0 running 1860 locked 0 flushing 0 logging 0 handle_count 124 blocks 19 blocks_logged 20 Attached as well as the dstate summary that was recorded at the same time. It's not quite as compelling but I'll keep the monitor running and see if something falls out. I didn't find anything useful in the existing mmtests tests that could be used to bisect this but not many of them are focused on IO. In the above sample each journal commit is running for no more than 5 seconds or so (since that's the default jbd2 commit timeout; if a transaction is running for less than 5 seconds, then either we ran out of room in the journal, and the blocks_logged number will be high, or a commit was forced by something such as an fsync call). I didn't see anything majorly compelling in the jbd tracepoints but I'm not 100% sure I'm looking for the right thing either. I also recorded /proc/latency_stat and there were some bad sync latencies from the file as you can see here 3 4481 1586 jbd2_log_wait_commit ext4_sync_file vfs_fsync sys_msync system_call_fastpath 3 11325 4373 sleep_on_page wait_on_page_bit kretprobe_trampoline filemap_write_and_wait_range ext4_sync_file vfs_fsync sys_msync system_call_fastpath 85 1130707 14904 jbd2_journal_stop jbd2_journal_force_commit ext4_force_commit ext4_sync_file do_fsync sys_fsync system_call_fastpath 1 2161073 2161073 start_this_handle jbd2__journal_start.part.8 jbd2__journal_start __ext4_journal_start_sb ext4_da_writepages do_writepages __filemap_fdatawrite_range filemap_write_and_wait_range ext4_sync_file do_fsync sys_fsync system_call_fastpath 118 7798435 596184 jbd2_log_wait_commit jbd2_journal_stop jbd2_journal_force_commit ext4_force_commit ext4_sync_file do_fsync sys_fsync system_call_fastpath 599 15496449 3405822 sleep_on_page wait_on_page_bit kretprobe_trampoline filemap_write_and_wait_range ext4_sync_file do_fsync sys_fsync system_call_fastpath 405 28572881 2619592 jbd2_log_wait_commit ext4_sync_file do_fsync sys_fsync system_call_fastpath If an atime update is getting blocked by 12 seconds, then it would be interesting to see if a journal commit is running for significantly longer than 5 seconds, or if one of the other commit phases is taking significant amounts of time. (On the example above they are all taking no time, since I ran this on a relatively uncontended system; only a single git operation taking place.) Something else that might be worth trying is grabbing a lock_stat report and see if something is sitting on an ext4 or jbd2 mutex for a long time. Ok, if nothing useful falls out in this session I'll enable lock debugging. latency_stat on its own would not be enough to conclude that a problem was related to lock contention. Finally, as I mentioned I tried some rather simplistic tests and I didn't notice any difference between a 3.2 kernel and a 3.8/3.9-rc5 kernel. Assuming you can get a version of systemtap that simultaneously works on
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 07:16:13PM -0400, Theodore Ts'o wrote: I've tried doing some quick timing, and if it is a performance regression, it's not a recent one --- or I haven't been able to reproduce what Mel is seeing. I tried the following commands while booted into 3.2, 3.8, and 3.9-rc3 kernels: time git clone ... rm .git/index ; time git reset FWIW, I had run a number if git checkout based tests over time and none of them revealed anything useful. Granted it was on other machines but I don't think it's git on its own. It's a combination that leads to this problem. Maybe it's really an IO scheduler problem and I need to figure out what combination triggers it. SNIP Mel, how bad is various git commands that you are trying? Have you tried using time to get estimates of how long a git clone or other git operation is taking? Unfortunately, the milage varies considerably and it's not always possible to time the operation. It may be that one occasion that opening a mail takes an abnormal length time with git operations occasionally making it far worse. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
I've tried doing some quick timing, and if it is a performance regression, it's not a recent one --- or I haven't been able to reproduce what Mel is seeing. I tried the following commands while booted into 3.2, 3.8, and 3.9-rc3 kernels: time git clone ... rm .git/index ; time git reset I did this a number of git repo's; including one that was freshly cloned, and one that had around 3 dozen patches applied via git am (so there were a bunch of loose objects). And I tried doing this on an SSD and a 5400rpm HDD, and I did it with all of the in-memory cache flushed via "git 3 > /proc/sys/vm/drop_caches". The worst case was doing a "time git reset" after deleting the .git/index file after applying all of Kent Overstreet's recent AIO patches that had been sent out for review. It took around 55 seconds, on 3.2, 3.8 and 3.9-rc3. That is pretty horrible, but for me that's the reason why I use SSD's. Mel, how bad is various git commands that you are trying? Have you tried using time to get estimates of how long a git clone or other git operation is taking? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
So I tried to reproduce the problem, and so I installed systemtap (bleeding edge, since otherwise it won't work with development kernel), and then rebuilt a kernel with all of the necessary CONFIG options enabled: CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS, CONFIG_MODULES, CONFIG_MODULE_UNLOAD I then pulled down mmtests, and tried running watch-dstate.pl, which is what I sasume you were using, and I got a reminder of why I've tried very hard to use systemtap: semantic error: while resolving probe point: identifier 'kprobe' at /tmp/stapdjN4_l:18:7 source: probe kprobe.function("get_request_wait") ^ semantic error: no match semantic error: while resolving probe point: identifier 'kprobe' at :74:8 source: }probe kprobe.function("get_request_wait").return ^ Pass 2: analysis failed. [man error::pass2] Unexpected exit of STAP script at ./watch-dstate.pl line 296. I have no clue what to do next. Can you give me a hint? Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:03:36PM +0800, Zheng Liu wrote: > Hi Mel, > > Thanks for reporting it. > > On 04/02/2013 10:27 PM, Mel Gorman wrote: > > I'm testing a page-reclaim-related series on my laptop that is partially > > aimed at fixing long stalls when doing metadata-intensive operations on > > low memory such as a git checkout. I've been running 3.9-rc2 with the > > series applied but found that the interactive performance was awful even > > when there was plenty of free memory. > > > > I activated a monitor from mmtests that logs when a process is stuck for > > a long time in D state and found that there are a lot of stalls in ext4. > > The report first states that processes have been stalled for a total of > > 6498 seconds on IO which seems like a lot. Here is a breakdown of the > > recorded events. > > In this merge window, we add a status tree as a extent cache. Meanwhile > a es_cache shrinker is registered to try to reclaim from this cache when > we are under a high memory pressure. Ok. > So I suspect that the root cause > is this shrinker. Could you please tell me how to reproduce this > problem? If I understand correctly, I can run mmtest to reproduce this > problem, right? > This is normal desktop usage with some development thrown in, nothing spectacular but nothing obviously reproducible either unfortuantely. I just noticed that some git operations were taking abnormally long, mutt was very slow opening mail, applications like mozilla were very slow to launch etc. and dug a little further. I haven't checked if regression tests under mmtests captured something similar yet. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: > > Can you try 3.9-rc4 or later and see if the problem still persists? > There were a number of ext4 issues especially around low memory > performance which weren't resolved until -rc4. Actually, sorry, I took a closer look and I'm not as sure going to -rc4 is going to help (although we did have some ext4 patches to fix a number of bugs that flowed in as late as -rc4). Can you send us the patch that you used to get record these long stall times? And I assume you're using a laptop drive? 5400RPM or 7200RPM? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote: > I'm testing a page-reclaim-related series on my laptop that is partially > aimed at fixing long stalls when doing metadata-intensive operations on > low memory such as a git checkout. I've been running 3.9-rc2 with the > series applied but found that the interactive performance was awful even > when there was plenty of free memory. Can you try 3.9-rc4 or later and see if the problem still persists? There were a number of ext4 issues especially around low memory performance which weren't resolved until -rc4. Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi Mel, Thanks for reporting it. On 04/02/2013 10:27 PM, Mel Gorman wrote: > I'm testing a page-reclaim-related series on my laptop that is partially > aimed at fixing long stalls when doing metadata-intensive operations on > low memory such as a git checkout. I've been running 3.9-rc2 with the > series applied but found that the interactive performance was awful even > when there was plenty of free memory. > > I activated a monitor from mmtests that logs when a process is stuck for > a long time in D state and found that there are a lot of stalls in ext4. > The report first states that processes have been stalled for a total of > 6498 seconds on IO which seems like a lot. Here is a breakdown of the > recorded events. In this merge window, we add a status tree as a extent cache. Meanwhile a es_cache shrinker is registered to try to reclaim from this cache when we are under a high memory pressure. So I suspect that the root cause is this shrinker. Could you please tell me how to reproduce this problem? If I understand correctly, I can run mmtest to reproduce this problem, right? Thanks, - Zheng -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/02/2013 04:27 PM, Mel Gorman wrote: > I'm testing a page-reclaim-related series on my laptop that is partially > aimed at fixing long stalls when doing metadata-intensive operations on > low memory such as a git checkout. I've been running 3.9-rc2 with the > series applied but found that the interactive performance was awful even > when there was plenty of free memory. > > I activated a monitor from mmtests that logs when a process is stuck for > a long time in D state and found that there are a lot of stalls in ext4. > The report first states that processes have been stalled for a total of > 6498 seconds on IO which seems like a lot. Here is a breakdown of the > recorded events. Just a note that I am indeed using ext4 on the affected machine for all filesystems I have except for an efi partition... -- js suse labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On 04/02/2013 04:27 PM, Mel Gorman wrote: I'm testing a page-reclaim-related series on my laptop that is partially aimed at fixing long stalls when doing metadata-intensive operations on low memory such as a git checkout. I've been running 3.9-rc2 with the series applied but found that the interactive performance was awful even when there was plenty of free memory. I activated a monitor from mmtests that logs when a process is stuck for a long time in D state and found that there are a lot of stalls in ext4. The report first states that processes have been stalled for a total of 6498 seconds on IO which seems like a lot. Here is a breakdown of the recorded events. Just a note that I am indeed using ext4 on the affected machine for all filesystems I have except for an efi partition... -- js suse labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
Hi Mel, Thanks for reporting it. On 04/02/2013 10:27 PM, Mel Gorman wrote: I'm testing a page-reclaim-related series on my laptop that is partially aimed at fixing long stalls when doing metadata-intensive operations on low memory such as a git checkout. I've been running 3.9-rc2 with the series applied but found that the interactive performance was awful even when there was plenty of free memory. I activated a monitor from mmtests that logs when a process is stuck for a long time in D state and found that there are a lot of stalls in ext4. The report first states that processes have been stalled for a total of 6498 seconds on IO which seems like a lot. Here is a breakdown of the recorded events. In this merge window, we add a status tree as a extent cache. Meanwhile a es_cache shrinker is registered to try to reclaim from this cache when we are under a high memory pressure. So I suspect that the root cause is this shrinker. Could you please tell me how to reproduce this problem? If I understand correctly, I can run mmtest to reproduce this problem, right? Thanks, - Zheng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 03:27:17PM +0100, Mel Gorman wrote: I'm testing a page-reclaim-related series on my laptop that is partially aimed at fixing long stalls when doing metadata-intensive operations on low memory such as a git checkout. I've been running 3.9-rc2 with the series applied but found that the interactive performance was awful even when there was plenty of free memory. Can you try 3.9-rc4 or later and see if the problem still persists? There were a number of ext4 issues especially around low memory performance which weren't resolved until -rc4. Thanks, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:06:51AM -0400, Theodore Ts'o wrote: Can you try 3.9-rc4 or later and see if the problem still persists? There were a number of ext4 issues especially around low memory performance which weren't resolved until -rc4. Actually, sorry, I took a closer look and I'm not as sure going to -rc4 is going to help (although we did have some ext4 patches to fix a number of bugs that flowed in as late as -rc4). Can you send us the patch that you used to get record these long stall times? And I assume you're using a laptop drive? 5400RPM or 7200RPM? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
On Tue, Apr 02, 2013 at 11:03:36PM +0800, Zheng Liu wrote: Hi Mel, Thanks for reporting it. On 04/02/2013 10:27 PM, Mel Gorman wrote: I'm testing a page-reclaim-related series on my laptop that is partially aimed at fixing long stalls when doing metadata-intensive operations on low memory such as a git checkout. I've been running 3.9-rc2 with the series applied but found that the interactive performance was awful even when there was plenty of free memory. I activated a monitor from mmtests that logs when a process is stuck for a long time in D state and found that there are a lot of stalls in ext4. The report first states that processes have been stalled for a total of 6498 seconds on IO which seems like a lot. Here is a breakdown of the recorded events. In this merge window, we add a status tree as a extent cache. Meanwhile a es_cache shrinker is registered to try to reclaim from this cache when we are under a high memory pressure. Ok. So I suspect that the root cause is this shrinker. Could you please tell me how to reproduce this problem? If I understand correctly, I can run mmtest to reproduce this problem, right? This is normal desktop usage with some development thrown in, nothing spectacular but nothing obviously reproducible either unfortuantely. I just noticed that some git operations were taking abnormally long, mutt was very slow opening mail, applications like mozilla were very slow to launch etc. and dug a little further. I haven't checked if regression tests under mmtests captured something similar yet. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
So I tried to reproduce the problem, and so I installed systemtap (bleeding edge, since otherwise it won't work with development kernel), and then rebuilt a kernel with all of the necessary CONFIG options enabled: CONFIG_DEBUG_INFO, CONFIG_KPROBES, CONFIG_RELAY, CONFIG_DEBUG_FS, CONFIG_MODULES, CONFIG_MODULE_UNLOAD I then pulled down mmtests, and tried running watch-dstate.pl, which is what I sasume you were using, and I got a reminder of why I've tried very hard to use systemtap: semantic error: while resolving probe point: identifier 'kprobe' at /tmp/stapdjN4_l:18:7 source: probe kprobe.function(get_request_wait) ^ semantic error: no match semantic error: while resolving probe point: identifier 'kprobe' at :74:8 source: }probe kprobe.function(get_request_wait).return ^ Pass 2: analysis failed. [man error::pass2] Unexpected exit of STAP script at ./watch-dstate.pl line 296. I have no clue what to do next. Can you give me a hint? Thanks, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Excessive stall times on ext4 in 3.9-rc2
I've tried doing some quick timing, and if it is a performance regression, it's not a recent one --- or I haven't been able to reproduce what Mel is seeing. I tried the following commands while booted into 3.2, 3.8, and 3.9-rc3 kernels: time git clone ... rm .git/index ; time git reset I did this a number of git repo's; including one that was freshly cloned, and one that had around 3 dozen patches applied via git am (so there were a bunch of loose objects). And I tried doing this on an SSD and a 5400rpm HDD, and I did it with all of the in-memory cache flushed via git 3 /proc/sys/vm/drop_caches. The worst case was doing a time git reset after deleting the .git/index file after applying all of Kent Overstreet's recent AIO patches that had been sent out for review. It took around 55 seconds, on 3.2, 3.8 and 3.9-rc3. That is pretty horrible, but for me that's the reason why I use SSD's. Mel, how bad is various git commands that you are trying? Have you tried using time to get estimates of how long a git clone or other git operation is taking? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/