Re: [PATCH] block/025: test discard sector alignement and sector size overflow
On Thu, Nov 15, 2018 at 12:00:17PM +0800, Ming Lei wrote: > This test covers the following two issues: > > 1) discard sector need to be aligned with logical block size > > 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing > with discard sector size > > Signed-off-by: Ming Lei > --- > tests/block/025 | 37 + > tests/block/025.out | 2 ++ > 2 files changed, 39 insertions(+) > create mode 100755 tests/block/025 > create mode 100644 tests/block/025.out Applied, thanks Ming.
Re: [GIT PULL] Block fixes for 4.20-rc6
The pull request you sent on Wed, 5 Dec 2018 13:25:47 -0700: > git://git.kernel.dk/linux-block.git tags/for-linus-20181205 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/4eaaa2b99e30305f4bc677f4abfe56c1f8b39670 Thank you! -- Deet-doot-dot, I am a bot. https://korg.wiki.kernel.org/userdoc/prtracker
[GIT PULL] Block fixes for 4.20-rc6
Hi Linus, A bit earlier in the week as usual, but there's a fix here that should go in sooner rather than later. Under a combination of circumstance, the direct issue path in blk-mq could corrupt data. This wasn't easy to hit, but the ones that are affected by it, seem to hit it pretty easily. Full explanation in the patch. None of the regular filesystem and storage testing has triggered it, even though it's been around since 4.19-rc1. Outside of that, whitelist trim tweak for certain Samsung devices for libata. Please pull! git://git.kernel.dk/linux-block.git tags/for-linus-20181205 Jens Axboe (1): blk-mq: fix corruption with direct issue Juha-Matti Tilli (1): libata: whitelist all SAMSUNG MZ7KM* solid-state disks block/blk-mq.c| 26 +- drivers/ata/libata-core.c | 1 + 2 files changed, 26 insertions(+), 1 deletion(-) -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/5/18 12:09 PM, Guenter Roeck wrote: > On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote: > [ ... ] >> >>> Also, it seems to me that even with this problem fixed, blk-mq may not >>> be ready for primetime after all. With that in mind, maybe commit >>> d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a >>> bit premature. Should that be reverted ? >> >> I have to strongly disagree with that, the timing is just unfortunate. >> There are literally millions of machines running blk-mq/scsi-mq, and >> this is the only hickup we've had. So I want to put this one to rest >> once and for all, there's absolutely no reason not to continue with >> what we've planned. >> > > Guess we have to agree to disagree. In my opinion, for infrastructure > as critical as this, a single hickup is one hickup too many. Not that > I would describe this as hickup in the first place; I would describe > it as major disaster. Don't get me wrong, I don't mean to use hickup in a diminishing fashion, this was by all means a disaster for the ones hit by it. But if you look at the scope of how many folks are using blk-mq/scsi-mq and have been for years, we're really talking about a tiny tiny percentage here. This could just as easily have happened with the old IO stack. The bug was a freak accident, and even with full knowledge of why it happened, I'm still having an extraordinarily hard time triggering it at will on my test boxes. As with any disaster, it's usually a combination of multiple things that go wrong, and this one is no different. The folks that hit this generally hit it pretty easily, and (by far) the majority would never hit it. Bugs happen, whether you like it or not. They happen in file systems, memory management, and they happen in storage. Things are continually developed, and that sometimes introduces bugs. We do our best to ensure that doesn't happen, but sometimes freak accidents like this happen. I think my track record of decades of work speaks for itself there, it's not like this is a frequent occurence. And if this particular issue wasn't well understood and instead just reverted the offending commits, then I would agree with you. But that's not the case. I'm very confident in the stability, among other things, of blk-mq and the drivers that utilize it. Most of the storage drivers are using it today, and have been for a long time. -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote: [ ... ] > > > Also, it seems to me that even with this problem fixed, blk-mq may not > > be ready for primetime after all. With that in mind, maybe commit > > d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a > > bit premature. Should that be reverted ? > > I have to strongly disagree with that, the timing is just unfortunate. > There are literally millions of machines running blk-mq/scsi-mq, and > this is the only hickup we've had. So I want to put this one to rest > once and for all, there's absolutely no reason not to continue with > what we've planned. > Guess we have to agree to disagree. In my opinion, for infrastructure as critical as this, a single hickup is one hickup too many. Not that I would describe this as hickup in the first place; I would describe it as major disaster. I also don't think the timing is unfortunate. If anything, it proves my point. Thanks, Guenter
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/5/18 10:55 AM, Guenter Roeck wrote: > On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote: >> On 12/4/18 6:38 PM, Guenter Roeck wrote: >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Signed-off-by: Jens Axboe >>> >>> Tested-by: Guenter Roeck >>> >>> ... on two systems affected by the problem. >> >> Thanks for testing! And for being persistent in reproducing and >> providing clues for getting this nailed. >> > > My pleasure. > > I see that there is some discussion about this patch. > > Unfortunately, everyone running a 4.19 or later kernel is at serious > risk of data corruption. Given that, if this patch doesn't make it > upstream for one reason or another, would it be possible to at least > revert the two patches introducing the problem until this is sorted > out for good ? If this is not acceptable either, maybe mark blk-mq > as broken ? After all, it _is_ broken. This is even more true if it > turns out that a problem may exist since 4.1, as suggested in the > discussion. It is queued up, it'll go upstream later today. > Also, it seems to me that even with this problem fixed, blk-mq may not > be ready for primetime after all. With that in mind, maybe commit > d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a > bit premature. Should that be reverted ? I have to strongly disagree with that, the timing is just unfortunate. There are literally millions of machines running blk-mq/scsi-mq, and this is the only hickup we've had. So I want to put this one to rest once and for all, there's absolutely no reason not to continue with what we've planned. -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote: > On 12/4/18 6:38 PM, Guenter Roeck wrote: > > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >> we queue the request up normally. However, the SCSI layer may have > >> already setup SG tables etc for this particular command. If we later > >> merge with this request, then the old tables are no longer valid. Once > >> we issue the IO, we only read/write the original part of the request, > >> not the new state of it. > >> > >> This causes data corruption, and is most often noticed with the file > >> system complaining about the just read data being invalid: > >> > >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: > >> comm dpkg-query: bad extra_isize 24937 (inode size 256) > >> > >> because most of it is garbage... > >> > >> This doesn't happen from the normal issue path, as we will simply defer > >> the request to the hardware queue dispatch list if we fail. Once it's on > >> the dispatch list, we never merge with it. > >> > >> Fix this from the direct issue path by flagging the request as > >> REQ_NOMERGE so we don't change the size of it before issue. > >> > >> See also: > >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >> > >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case > >> of 'none'") > >> Signed-off-by: Jens Axboe > > > > Tested-by: Guenter Roeck > > > > ... on two systems affected by the problem. > > Thanks for testing! And for being persistent in reproducing and > providing clues for getting this nailed. > My pleasure. I see that there is some discussion about this patch. Unfortunately, everyone running a 4.19 or later kernel is at serious risk of data corruption. Given that, if this patch doesn't make it upstream for one reason or another, would it be possible to at least revert the two patches introducing the problem until this is sorted out for good ? If this is not acceptable either, maybe mark blk-mq as broken ? After all, it _is_ broken. This is even more true if it turns out that a problem may exist since 4.1, as suggested in the discussion. Also, it seems to me that even with this problem fixed, blk-mq may not be ready for primetime after all. With that in mind, maybe commit d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a bit premature. Should that be reverted ? Thanks, Guenter
Re: [PATCH 0/3] Unify the throttling code for wbt and io-latency
On 12/4/18 10:59 AM, Josef Bacik wrote: > Originally when I wrote io-latency and the rq_qos code to provide a common > base > between wbt and io-latency I left out the throttling part. These were > basically > the same, but slightly different in both cases. The difference was enough and > the code wasn't too complicated that I just copied it into io-latency and > modified it for what I needed and carried on. > > Since then Jens has fixed a few issues with wakeups with the niave approach. > Before you could easily cycle waiters back to the end of the line if they were > woken up without the ability to actually do their IO yet. But because this > was > only in wbt we didn't get it in io-latency. > > Resolve this by creating a unified interface for doing the throttling, and > then > just handle the differences between the two users with user specific > callbacks. > This allows us to have one place where we have to mess with wakeups, and gives > each user the ability to be their own special snowflake. > > Jens, I based this on for-next from 12/03, let me know if you want a different > base. I tested this with my blktests test. Thanks, Applies fine to for-4.21/block, which is what I care about. Looks good to me, applied, thanks. -- Jens Axboe
Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT
On Wed, 2018-12-05 at 23:37 +0800, Weiping Zhang wrote: > @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req) >* than an existing one, modify the timer. Round up to next nearest >* second. >*/ > - expiry = blk_rq_timeout(round_jiffies_up(expiry)); > + expiry = round_jiffies_up(expiry); If you would have read the comment above this code, you would have known that this patch does not do what you think it does and additionally that it introduces a regression. Bart.
[PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT
Since io_timeout was added to sysfs, the user can tune timeouts by that attribute, so kill this limitation. Signed-off-by: Weiping Zhang --- block/blk-timeout.c | 13 + block/blk.h | 4 2 files changed, 1 insertion(+), 16 deletions(-) diff --git a/block/blk-timeout.c b/block/blk-timeout.c index 124c26128bf6..6b2b0f7e5929 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -89,17 +89,6 @@ void blk_abort_request(struct request *req) } EXPORT_SYMBOL_GPL(blk_abort_request); -unsigned long blk_rq_timeout(unsigned long timeout) -{ - unsigned long maxt; - - maxt = round_jiffies_up(jiffies + BLK_MAX_TIMEOUT); - if (time_after(timeout, maxt)) - timeout = maxt; - - return timeout; -} - /** * blk_add_timer - Start timeout timer for a single request * @req: request that is about to start running. @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req) * than an existing one, modify the timer. Round up to next nearest * second. */ - expiry = blk_rq_timeout(round_jiffies_up(expiry)); + expiry = round_jiffies_up(expiry); if (!timer_pending(>timeout) || time_before(expiry, q->timeout.expires)) { diff --git a/block/blk.h b/block/blk.h index 848278c52030..f0d0a18fb276 100644 --- a/block/blk.h +++ b/block/blk.h @@ -7,9 +7,6 @@ #include #include "blk-mq.h" -/* Max future timer expiry for timeouts */ -#define BLK_MAX_TIMEOUT(5 * HZ) - #ifdef CONFIG_DEBUG_FS extern struct dentry *blk_debugfs_root; #endif @@ -151,7 +148,6 @@ static inline bool bio_integrity_endio(struct bio *bio) } #endif /* CONFIG_BLK_DEV_INTEGRITY */ -unsigned long blk_rq_timeout(unsigned long timeout); void blk_add_timer(struct request *req); bool bio_attempt_front_merge(struct request_queue *q, struct request *req, -- 2.14.1
[PATCH 2/2] block: add BLK_DEF_TIMEOUT
Add an obvious definition for default request timeout. Signed-off-by: Weiping Zhang --- block/blk-mq.c | 2 +- block/blk.h| 3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 900550594651..53337d5c37af 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2845,7 +2845,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, goto err_hctxs; INIT_WORK(>timeout_work, blk_mq_timeout_work); - blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); + blk_queue_rq_timeout(q, set->timeout ? set->timeout : BLK_DEF_TIMEOUT); q->tag_set = set; diff --git a/block/blk.h b/block/blk.h index f0d0a18fb276..7905b952b7b3 100644 --- a/block/blk.h +++ b/block/blk.h @@ -7,6 +7,9 @@ #include #include "blk-mq.h" +/* the default request timeout */ +#define BLK_DEF_TIMEOUT (30 * HZ) + #ifdef CONFIG_DEBUG_FS extern struct dentry *blk_debugfs_root; #endif -- 2.14.1
[PATCH 0/2] get rid of BLK_MAX_TIMEOUT
Get rid of BLK_MAX_TIMEOUT, since we want use io_timeout to control the maximun timeouts of a request, so remove this limitation. Weiping Zhang (2): block: get rid of BLK_MAX_TIMEOUT block: add BLK_DEF_TIMEOUT block/blk-mq.c | 2 +- block/blk-timeout.c | 13 + block/blk.h | 5 ++--- 3 files changed, 4 insertions(+), 16 deletions(-) -- 2.14.1
[PATCH 2/2] blktests: block/025: an io.latency test
This is a test to verify io.latency is working properly. It does this by first running a fio job by itself to figure out how fast it runs. Then we calculate some thresholds, set up 2 cgroups, a fast and a slow cgroup, and then run the same job in both groups simultaneously. We should see the slow group get throttled until the first cgroup is able to finish, and then the slow cgroup will be allowed to finish. Signed-off-by: Josef Bacik --- tests/block/025 | 133 tests/block/025.out | 1 + 2 files changed, 134 insertions(+) create mode 100644 tests/block/025 create mode 100644 tests/block/025.out diff --git a/tests/block/025 b/tests/block/025 new file mode 100644 index ..cdaa35c5e335 --- /dev/null +++ b/tests/block/025 @@ -0,0 +1,133 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# +# Test io.latency to make sure it's protecting the higher priority group +# properly. + +. tests/block/rc + +DESCRIPTION="test the io.latency interface to make sure it's working right" + +requires() { + _have_cgroup2_controller_file io io.latency && _have_fio && \ + _have_program jq +} + +_fio_results_key() { + _job=$1 + _key=$2 + _resultfile=$3 + + jq '.jobs[] | select(.jobname == "'${_job}'") | .'${_key} ${_resultfile} +} + +test_device() { + local fio_config_single fio_config_double fio_results fio_args qd + + fio_config_single=${TMPDIR}/single.fio + fio_config_double=${TMPDIR}/double.fio + fio_results=${TMPDIR}/025.json + fio_args="--output-format=json --output=${fio_results}" + qd=$(cat ${TEST_DEV_SYSFS}/queue/nr_requests) + + cat << EOF > "${fio_config_single}" + [fast] + filename=${TEST_DEV} + direct=1 + allrandrepeat=1 + readwrite=randrw + size=4G + ioengine=libaio + iodepth=${qd} + fallocate=none + randseed=12345 +EOF + + cat << EOF > "${fio_config_double}" + [global] + filename=${TEST_DEV} + direct=1 + allrandrepeat=1 + readwrite=randrw + size=4G + ioengine=libaio + iodepth=${qd} + fallocate=none + randseed=12345 + + [fast] + cgroup=blktests/fast + + [slow] + cgroup=blktests/slow +EOF + # We run the test once so we have an idea of how fast this workload will + # go with nobody else doing IO on the device. + if ! fio ${fio_args} ${fio_config_single}; then + echo "fio exited with status $?" + return 1 + fi + + _time_taken=$(_fio_results_key fast job_runtime ${fio_results}) + if [ "${_time_taken}" = "" ]; then + echo "fio doesn't report job_runtime" + return 1 + fi + + echo "normal time taken ${_time_taken}" >> $FULL + + # There's no way to predict how the two workloads are going to affect + # each other, so we weant to set thresholds to something reasonable so + # we can verify io.latency is doing something. This means we set 15% + # for the fast cgroup, just to give us enough wiggle room as throttling + # doesn't happen immediately. But if we have a super fast disk we could + # run both groups really fast and make it under our fast threshold, so + # we need to set a threshold for the slow group at 50%. We assume that + # if it was faster than 50% of the fast threshold then we probably + # didn't throttle and we can assume io.latency is broken. + _fast_thresh=$((${_time_taken} + ${_time_taken} * 15 / 100)) + _slow_thresh=$((${_time_taken} + ${_time_taken} * 50 / 100)) + echo "fast threshold time is ${_fast_thresh}" >> $FULL + echo "slow threshold time is ${_slow_thresh}" >> $FULL + + # Create the cgroup files + _dir=$(_cgroup2_base_dir)/blktests + echo "+io" > ${_dir}/cgroup.subtree_control + mkdir ${_dir}/fast + mkdir ${_dir}/slow + + # We set the target to 1usec because we could have a fast device that is + # capable of remarkable IO latencies that would skew the test. It needs + # to be low enough that we do actually throttle the slow group, + # otherwise the test will fail when there's nothing wrong. + _major=$((0x$(stat -c "%t" ${TEST_DEV}))) + _minor=$((0x$(stat -c "%T" ${TEST_DEV}))) + echo "${_major}:${_minor} is our device" >> $FULL + if ! echo "${_major}:${_minor} target=1" > ${_dir}/fast/io.latency; then + echo "Failed to set our latency target" + return 1 + fi + + if ! fio ${fio_args} ${fio_config_double}; then + echo "fio exited with status $?" + return 1 + fi + + _fast_time=$(_fio_results_key fast job_runtime ${fio_results}) + echo "Fast time ${_fast_time}" >> $FULL + _slow_time=$(_fio_results_key slow job_runtime ${fio_results}) + echo
[PATCH 1/2] blktests: add cgroup2 infrastructure
In order to test io.latency and other cgroup related things we need some supporting helpers to setup and tear down cgroup2. This adds support for checking that we can even configure cgroup2 things, set them up if need be, and then add the cleanup stuff to the main cleanup function so everything is always in a clean state. Signed-off-by: Josef Bacik --- check | 2 ++ common/rc | 48 2 files changed, 50 insertions(+) diff --git a/check b/check index ebd87c097e25..1c9dbc518fa1 100755 --- a/check +++ b/check @@ -294,6 +294,8 @@ _cleanup() { done unset RESTORE_CPUS_ONLINE fi + + _cleanup_cgroup2 } _call_test() { diff --git a/common/rc b/common/rc index 8a892bcd5fde..a785f2329687 100644 --- a/common/rc +++ b/common/rc @@ -202,3 +202,51 @@ _test_dev_in_hotplug_slot() { _filter_xfs_io_error() { sed -e 's/^\(.*\)64\(: .*$\)/\1\2/' } + +_cgroup2_base_dir() +{ + grep cgroup2 /proc/mounts | awk '{ print $2 }' +} + +_cleanup_cgroup2() +{ + _dir=$(_cgroup2_base_dir)/blktests + [ -d "${_dir}" ] || return + + for i in $(find ${_dir} -type d | tac) + do + rmdir $i + done +} + +_have_cgroup2() +{ + if ! grep -q 'cgroup2' /proc/mounts; then + SKIP_REASON="This test requires cgroup2" + return 1 + fi + return 0 +} + +_have_cgroup2_controller_file() +{ + _have_cgroup2 || return 1 + + _controller=$1 + _file=$2 + _dir=$(_cgroup2_base_dir) + + if ! grep -q ${_controller} ${_dir}/cgroup.controllers; then + SKIP_REASON="No support for ${_controller} cgroup controller" + return 1 + fi + + mkdir ${_dir}/blktests + echo "+${_controller}" > ${_dir}/cgroup.subtree_control + if [ ! -f ${_dir}/blktests/${_file} ]; then + _cleanup_cgroup2 + SKIP_REASON="Cgroup file ${_file} doesn't exist" + return 1 + fi + return 0 +} -- 2.14.3
[PATCH 0/2][V2] io.latency test for blktests
v1->v2: - dropped my python library, TIL about jq. - fixed the spelling mistakes in the test. -- Original message -- This patchset is to add a test to verify io.latency is working properly, and to add all the supporting code to run that test. First is the cgroup2 infrastructure which is fairly straightforward. Just verifies we have cgroup2, and gives us the helpers to check and make sure we have the right controllers in place for the test. The second patch brings over some python scripts I put in xfstests for parsing the fio json output. I looked at the existing fio performance stuff in blktests, but we only capture bw stuff, which is wonky with this particular test because once the fast group is finished the slow group is allowed to go as fast as it wants. So I needed this to pull out actual jobtime spent. This will give us flexibility to pull out other fio performance data in the future. The final patch is the test itself. It simply runs a job by itself to get a baseline view of the disk performance. Then it creates 2 cgroups, one fast and one slow, and runs the same job simultaneously in both groups. The result should be that the fast group takes just slightly longer time than the baseline (I use a 15% threshold to be safe), and that the slow one takes considerably longer. Thanks, Josef
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/5/18 7:41 AM, Christoph Hellwig wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: > > Looks good as the first quick fix: > > Reviewed-by: Christoph Hellwig > > But as we discussed I think we need to be even more careful. including > the only read and write check below in the thread or some higher level > approach. I agree, I'm reviewing the patches from Jianchao now and will do something cleaner on top of that. -- Jens Axboe
Re: [PATCH v3] block: add documentation for io_timeout
Weiping Zhang 于2018年12月5日周三 下午10:49写道: > > Christoph Hellwig 于2018年12月5日周三 下午10:40写道: > > > > Can you please also send a patch to not show this attribute for > > drivers without a timeout handler? Thanks! > > Is there a simple way do that ? Shall we return -ENOTSUPP when user read/write this attribute when driver has no timeout handler ? > Do you means checking q->mq_ops->timeout is NULL or not ? > > static void blk_mq_rq_timed_out(struct request *req, bool reserved) > { > req->rq_flags |= RQF_TIMED_OUT; > If it's NULL, only re-start a timer for this request in blk_add_timer. > Is there some defect ? if show this attribute and the driver doesn't > support timeout handler ? > > if (req->q->mq_ops->timeout) { > enum blk_eh_timer_return ret; > > ret = req->q->mq_ops->timeout(req, reserved); > if (ret == BLK_EH_DONE) > return; > WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER); > } > > blk_add_timer(req); > } > > > > On Wed, Dec 05, 2018 at 10:17:06PM +0800, Weiping Zhang wrote: > > > Add documentation for /sys/block//queue/io_timeout. > > > > > > Signed-off-by: Weiping Zhang > > > --- > > > Documentation/ABI/testing/sysfs-block | 10 ++ > > > Documentation/block/queue-sysfs.txt | 7 +++ > > > 2 files changed, 17 insertions(+) > > > > > > diff --git a/Documentation/ABI/testing/sysfs-block > > > b/Documentation/ABI/testing/sysfs-block > > > index dea212db9df3..f254a374710a 100644 > > > --- a/Documentation/ABI/testing/sysfs-block > > > +++ b/Documentation/ABI/testing/sysfs-block > > > @@ -271,3 +271,13 @@ Description: > > > size of 512B sectors of the zones of the device, with > > > the eventual exception of the last zone of the device > > > which may be smaller. > > > + > > > +What:/sys/block//queue/io_timeout > > > +Date:November 2018 > > > +Contact: Weiping Zhang > > > +Description: > > > + io_timeout is a request’s timeouts at block layer in > > > + milliseconds. When the underlying driver starts processing > > > + a request, the generic block layer will start a timer, if > > > + this request cannot be completed in io_timeout milliseconds, > > > + a timeout event will occur. > > > diff --git a/Documentation/block/queue-sysfs.txt > > > b/Documentation/block/queue-sysfs.txt > > > index 2c1e67058fd3..f0c9bbce73fd 100644 > > > --- a/Documentation/block/queue-sysfs.txt > > > +++ b/Documentation/block/queue-sysfs.txt > > > @@ -67,6 +67,13 @@ If set to a value larger than 0, the kernel will put > > > the process issuing > > > IO to sleep for this amont of microseconds before entering classic > > > polling. > > > > > > +io_timeout (RW) > > > +--- > > > +This is a request’s timeouts at block layer in milliseconds. When the > > > +underlying driver starts processing a request, the generic block layer > > > +will start a timer, if this request cannot be completed in io_timeout > > > +milliseconds, a timeout event will occur. > > > + > > > iostats (RW) > > > - > > > This file is used to control (on/off) the iostats accounting of the > > > -- > > > 2.14.1 > > > > > ---end quoted text---
Re: [PATCH v3] block: add documentation for io_timeout
Christoph Hellwig 于2018年12月5日周三 下午10:40写道: > > Can you please also send a patch to not show this attribute for > drivers without a timeout handler? Thanks! > Do you means checking q->mq_ops->timeout is NULL or not ? static void blk_mq_rq_timed_out(struct request *req, bool reserved) { req->rq_flags |= RQF_TIMED_OUT; If it's NULL, only re-start a timer for this request in blk_add_timer. Is there some defect ? if show this attribute and the driver doesn't support timeout handler ? if (req->q->mq_ops->timeout) { enum blk_eh_timer_return ret; ret = req->q->mq_ops->timeout(req, reserved); if (ret == BLK_EH_DONE) return; WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER); } blk_add_timer(req); } > On Wed, Dec 05, 2018 at 10:17:06PM +0800, Weiping Zhang wrote: > > Add documentation for /sys/block//queue/io_timeout. > > > > Signed-off-by: Weiping Zhang > > --- > > Documentation/ABI/testing/sysfs-block | 10 ++ > > Documentation/block/queue-sysfs.txt | 7 +++ > > 2 files changed, 17 insertions(+) > > > > diff --git a/Documentation/ABI/testing/sysfs-block > > b/Documentation/ABI/testing/sysfs-block > > index dea212db9df3..f254a374710a 100644 > > --- a/Documentation/ABI/testing/sysfs-block > > +++ b/Documentation/ABI/testing/sysfs-block > > @@ -271,3 +271,13 @@ Description: > > size of 512B sectors of the zones of the device, with > > the eventual exception of the last zone of the device > > which may be smaller. > > + > > +What:/sys/block//queue/io_timeout > > +Date:November 2018 > > +Contact: Weiping Zhang > > +Description: > > + io_timeout is a request’s timeouts at block layer in > > + milliseconds. When the underlying driver starts processing > > + a request, the generic block layer will start a timer, if > > + this request cannot be completed in io_timeout milliseconds, > > + a timeout event will occur. > > diff --git a/Documentation/block/queue-sysfs.txt > > b/Documentation/block/queue-sysfs.txt > > index 2c1e67058fd3..f0c9bbce73fd 100644 > > --- a/Documentation/block/queue-sysfs.txt > > +++ b/Documentation/block/queue-sysfs.txt > > @@ -67,6 +67,13 @@ If set to a value larger than 0, the kernel will put the > > process issuing > > IO to sleep for this amont of microseconds before entering classic > > polling. > > > > +io_timeout (RW) > > +--- > > +This is a request’s timeouts at block layer in milliseconds. When the > > +underlying driver starts processing a request, the generic block layer > > +will start a timer, if this request cannot be completed in io_timeout > > +milliseconds, a timeout event will occur. > > + > > iostats (RW) > > - > > This file is used to control (on/off) the iostats accounting of the > > -- > > 2.14.1 > > > ---end quoted text---
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: Looks good as the first quick fix: Reviewed-by: Christoph Hellwig But as we discussed I think we need to be even more careful. including the only read and write check below in the thread or some higher level approach.
[PATCH v3] block: add documentation for io_timeout
Add documentation for /sys/block//queue/io_timeout. Signed-off-by: Weiping Zhang --- Documentation/ABI/testing/sysfs-block | 10 ++ Documentation/block/queue-sysfs.txt | 7 +++ 2 files changed, 17 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block index dea212db9df3..f254a374710a 100644 --- a/Documentation/ABI/testing/sysfs-block +++ b/Documentation/ABI/testing/sysfs-block @@ -271,3 +271,13 @@ Description: size of 512B sectors of the zones of the device, with the eventual exception of the last zone of the device which may be smaller. + +What: /sys/block//queue/io_timeout +Date: November 2018 +Contact: Weiping Zhang +Description: + io_timeout is a request’s timeouts at block layer in + milliseconds. When the underlying driver starts processing + a request, the generic block layer will start a timer, if + this request cannot be completed in io_timeout milliseconds, + a timeout event will occur. diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index 2c1e67058fd3..f0c9bbce73fd 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -67,6 +67,13 @@ If set to a value larger than 0, the kernel will put the process issuing IO to sleep for this amont of microseconds before entering classic polling. +io_timeout (RW) +--- +This is a request’s timeouts at block layer in milliseconds. When the +underlying driver starts processing a request, the generic block layer +will start a timer, if this request cannot be completed in io_timeout +milliseconds, a timeout event will occur. + iostats (RW) - This file is used to control (on/off) the iostats accounting of the -- 2.14.1
Re: [PATCH 15/26] aio: support for IO polling
On 12/5/18 2:56 AM, Benny Halevy wrote: >> +static int aio_iopoll_getevents(struct kioctx *ctx, >> +struct io_event __user *event, >> +unsigned int *nr_events, long min, long max) >> +{ >> +struct aio_kiocb *iocb; >> +int to_poll, polled, ret; >> + >> +/* >> + * Check if we already have done events that satisfy what we need >> + */ >> +if (!list_empty(>poll_completing)) { >> +ret = aio_iopoll_reap(ctx, event, nr_events, max); >> +if (ret < 0) >> +return ret; >> +/* if min is zero, still go through a poll check */ > > A function-level comment about that would be more visible. Yes, that might be a better idea. >> +if (min && *nr_events >= min) >> >> +return 0; > > if ((min && *nr_events >= min) || *nr_events >= max) ? > > Otherwise, if poll_completing or poll_submitted are not empty, > besides getting to the "Shouldn't happen" case in aio_iopoll_reap, > it looks like we're gonna waste some cycles and never call f_op->iopoll Agree, that would be a better place to catch it. I'll make those two changes, thanks! >> + >> +/* >> + * Find up to 'max' worth of events to poll for, including the >> + * events we already successfully polled >> + */ >> +polled = to_poll = 0; >> +list_for_each_entry(iocb, >poll_completing, ki_list) { >> +/* >> + * Poll for needed events with spin == true, anything after >> + * that we just check if we have more, up to max. >> + */ >> +bool spin = polled + *nr_events >= min; > > I'm confused. Shouldn't spin == true if polled + *nr_events < min? Heh, that does look off, and it is. I think the correct condition is: bool spin = !polled || *nr_events < min; and I'm not sure why that got broken. I just tested it, slight performance win as expected correcting this. It ties in with the above nr_events check - it's perfectly valid to pass min_nr == 0 and just do a poll check. Conversely, if we already spun, just do non-spin checks on followup poll checks. >> +static int __aio_iopoll_check(struct kioctx *ctx, struct io_event __user >> *event, >> + unsigned int *nr_events, long min_nr, long max_nr) >> +{ >> +int ret = 0; >> + >> +while (!*nr_events || !need_resched()) { >> +int tmin = 0; >> + >> +if (*nr_events < min_nr) >> +tmin = min_nr - *nr_events; > > Otherwise, shouldn't the function just return 0? > > Or perhaps you explicitly want to go through the tmin==0 case > in aio_iopoll_getevents if *nr_events == min_nr (or min_nr==0)? No, we need to go through poll, if min_nr == 0, but only if min_nr == 0 and !nr_events. But we only get here for nr_events != 0, so should be fine as-is. Thanks for your review, very useful! I'll make the above changes right now. -- Jens Axboe
Re: [PATCH 15/26] aio: support for IO polling
On Tue, 2018-12-04 at 16:37 -0700, Jens Axboe wrote: > Add polled variants of PREAD/PREADV and PWRITE/PWRITEV. These act > like their non-polled counterparts, except we expect to poll for > completion of them. The polling happens at io_getevent() time, and > works just like non-polled IO. > > To setup an io_context for polled IO, the application must call > io_setup2() with IOCTX_FLAG_IOPOLL as one of the flags. It is illegal > to mix and match polled and non-polled IO on an io_context. > > Polled IO doesn't support the user mapped completion ring. Events > must be reaped through the io_getevents() system call. For non-irq > driven poll devices, there's no way to support completion reaping > from userspace by just looking at the ring. The application itself > is the one that pulls completion entries. > > Signed-off-by: Jens Axboe > --- > fs/aio.c | 393 +++ > include/uapi/linux/aio_abi.h | 3 + > 2 files changed, 360 insertions(+), 36 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index bb6f07ca6940..5d34317c2929 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -153,6 +153,18 @@ struct kioctx { > atomic_treqs_available; > } cacheline_aligned_in_smp; > > + /* iopoll submission state */ > + struct { > + spinlock_t poll_lock; > + struct list_head poll_submitted; > + } cacheline_aligned_in_smp; > + > + /* iopoll completion state */ > + struct { > + struct list_head poll_completing; > + struct mutex getevents_lock; > + } cacheline_aligned_in_smp; > + > struct { > spinlock_t ctx_lock; > struct list_head active_reqs; /* used for cancellation */ > @@ -205,14 +217,27 @@ struct aio_kiocb { > __u64 ki_user_data; /* user's data for completion */ > > struct list_headki_list;/* the aio core uses this > - * for cancellation */ > + * for cancellation, or for > + * polled IO */ > + > + unsigned long ki_flags; > +#define IOCB_POLL_COMPLETED 0 /* polled IO has completed */ > +#define IOCB_POLL_EAGAIN 1 /* polled submission got EAGAIN */ > + > refcount_t ki_refcnt; > > - /* > - * If the aio_resfd field of the userspace iocb is not zero, > - * this is the underlying eventfd context to deliver events to. > - */ > - struct eventfd_ctx *ki_eventfd; > + union { > + /* > + * If the aio_resfd field of the userspace iocb is not zero, > + * this is the underlying eventfd context to deliver events to. > + */ > + struct eventfd_ctx *ki_eventfd; > + > + /* > + * For polled IO, stash completion info here > + */ > + struct io_event ki_ev; > + }; > }; > > /*-- sysctl variables*/ > @@ -233,6 +258,7 @@ static const unsigned int iocb_page_shift = > ilog2(PAGE_SIZE / sizeof(struct iocb)); > > static void aio_useriocb_unmap(struct kioctx *); > +static void aio_iopoll_reap_events(struct kioctx *); > > static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) > { > @@ -471,11 +497,15 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned > int nr_events) > int i; > struct file *file; > > - /* Compensate for the ring buffer's head/tail overlap entry */ > - nr_events += 2; /* 1 is required, 2 for good luck */ > - > + /* > + * Compensate for the ring buffer's head/tail overlap entry. > + * IO polling doesn't require any io event entries > + */ > size = sizeof(struct aio_ring); > - size += sizeof(struct io_event) * nr_events; > + if (!(ctx->flags & IOCTX_FLAG_IOPOLL)) { > + nr_events += 2; /* 1 is required, 2 for good luck */ > + size += sizeof(struct io_event) * nr_events; > + } > > nr_pages = PFN_UP(size); > if (nr_pages < 0) > @@ -758,6 +788,11 @@ static struct kioctx *io_setup_flags(unsigned long ctxid, > > INIT_LIST_HEAD(>active_reqs); > > + spin_lock_init(>poll_lock); > + INIT_LIST_HEAD(>poll_submitted); > + INIT_LIST_HEAD(>poll_completing); > + mutex_init(>getevents_lock); > + > if (percpu_ref_init(>users, free_ioctx_users, 0, GFP_KERNEL)) > goto err; > > @@ -829,11 +864,15 @@ static int kill_ioctx(struct mm_struct *mm, struct > kioctx *ctx, > { > struct kioctx_table *table; > > + mutex_lock(>getevents_lock); > spin_lock(>ioctx_lock); > if (atomic_xchg(>dead, 1)) { > spin_unlock(>ioctx_lock); > + mutex_unlock(>getevents_lock); > return -EINVAL; > } > +
Re: [PATCH 2/3] blktests: add python scripts for parsing fio json output
On 04/12/2018 18:47, Josef Bacik wrote: > +For example > +"write" : { > +"io_bytes" : 313360384, > +"bw" : 1016, > +} > + > +Get's collapsed to > + > +"write_io_bytes" : 313360384, > +"write_bw": 1016, I'd definitively prefer to not pull in python as a dependency. The above example can be done in shell (with the jq utility) as follows: jthumshirn@linux-x5ow:~$ cat test.json |\ jq '. | {write_io_bytes: .write.io_bytes, write_bw: .write.bw}' { "write_io_bytes": 313360384, "write_bw": 1016 } And on a plus side jq is small as well: jthumshirn@linux-x5ow:~$ zypper info jq Loading repository data... Reading installed packages... Information for package jq: --- Repository : openSUSE:Leap:15.0 Name : jq Version: 1.5-lp150.1.10 Arch : x86_64 Vendor : openSUSE Installed Size : 93.4 KiB Installed : Yes [...] -- Johannes ThumshirnSUSE Labs Filesystems jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850