Re: [PATCH] block/025: test discard sector alignement and sector size overflow

2018-12-05 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 12:00:17PM +0800, Ming Lei wrote:
> This test covers the following two issues:
> 
> 1) discard sector need to be aligned with logical block size
> 
> 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing
> with discard sector size
> 
> Signed-off-by: Ming Lei 
> ---
>  tests/block/025 | 37 +
>  tests/block/025.out |  2 ++
>  2 files changed, 39 insertions(+)
>  create mode 100755 tests/block/025
>  create mode 100644 tests/block/025.out

Applied, thanks Ming.


Re: [GIT PULL] Block fixes for 4.20-rc6

2018-12-05 Thread pr-tracker-bot
The pull request you sent on Wed, 5 Dec 2018 13:25:47 -0700:

> git://git.kernel.dk/linux-block.git tags/for-linus-20181205

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4eaaa2b99e30305f4bc677f4abfe56c1f8b39670

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


[GIT PULL] Block fixes for 4.20-rc6

2018-12-05 Thread Jens Axboe
Hi Linus,

A bit earlier in the week as usual, but there's a fix here that should
go in sooner rather than later. Under a combination of circumstance, the
direct issue path in blk-mq could corrupt data. This wasn't easy to hit,
but the ones that are affected by it, seem to hit it pretty easily. Full
explanation in the patch. None of the regular filesystem and storage
testing has triggered it, even though it's been around since 4.19-rc1.

Outside of that, whitelist trim tweak for certain Samsung devices for
libata.

Please pull!


  git://git.kernel.dk/linux-block.git tags/for-linus-20181205



Jens Axboe (1):
  blk-mq: fix corruption with direct issue

Juha-Matti Tilli (1):
  libata: whitelist all SAMSUNG MZ7KM* solid-state disks

 block/blk-mq.c| 26 +-
 drivers/ata/libata-core.c |  1 +
 2 files changed, 26 insertions(+), 1 deletion(-)

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-05 Thread Jens Axboe
On 12/5/18 12:09 PM, Guenter Roeck wrote:
> On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote:
> [ ... ]
>>
>>> Also, it seems to me that even with this problem fixed, blk-mq may not
>>> be ready for primetime after all. With that in mind, maybe commit
>>> d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
>>> bit premature. Should that be reverted ?
>>
>> I have to strongly disagree with that, the timing is just unfortunate.
>> There are literally millions of machines running blk-mq/scsi-mq, and
>> this is the only hickup we've had. So I want to put this one to rest
>> once and for all, there's absolutely no reason not to continue with
>> what we've planned.
>>
> 
> Guess we have to agree to disagree. In my opinion, for infrastructure
> as critical as this, a single hickup is one hickup too many. Not that
> I would describe this as hickup in the first place; I would describe
> it as major disaster.

Don't get me wrong, I don't mean to use hickup in a diminishing fashion,
this was by all means a disaster for the ones hit by it. But if you look
at the scope of how many folks are using blk-mq/scsi-mq and have been
for years, we're really talking about a tiny tiny percentage here. This
could just as easily have happened with the old IO stack. The bug was a
freak accident, and even with full knowledge of why it happened, I'm
still having an extraordinarily hard time triggering it at will on my
test boxes. As with any disaster, it's usually a combination of multiple
things that go wrong, and this one is no different. The folks that hit
this generally hit it pretty easily, and (by far) the majority would
never hit it.

Bugs happen, whether you like it or not. They happen in file systems,
memory management, and they happen in storage. Things are continually
developed, and that sometimes introduces bugs. We do our best to ensure
that doesn't happen, but sometimes freak accidents like this happen. I
think my track record of decades of work speaks for itself there, it's
not like this is a frequent occurence. And if this particular issue
wasn't well understood and instead just reverted the offending commits,
then I would agree with you. But that's not the case. I'm very confident
in the stability, among other things, of blk-mq and the drivers that
utilize it. Most of the storage drivers are using it today, and have
been for a long time.

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-05 Thread Guenter Roeck
On Wed, Dec 05, 2018 at 10:59:21AM -0700, Jens Axboe wrote:
[ ... ]
> 
> > Also, it seems to me that even with this problem fixed, blk-mq may not
> > be ready for primetime after all. With that in mind, maybe commit
> > d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
> > bit premature. Should that be reverted ?
> 
> I have to strongly disagree with that, the timing is just unfortunate.
> There are literally millions of machines running blk-mq/scsi-mq, and
> this is the only hickup we've had. So I want to put this one to rest
> once and for all, there's absolutely no reason not to continue with
> what we've planned.
> 

Guess we have to agree to disagree. In my opinion, for infrastructure as
critical as this, a single hickup is one hickup too many. Not that I would
describe this as hickup in the first place; I would describe it as major
disaster. I also don't think the timing is unfortunate. If anything,
it proves my point.

Thanks,
Guenter


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-05 Thread Jens Axboe
On 12/5/18 10:55 AM, Guenter Roeck wrote:
> On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote:
>> On 12/4/18 6:38 PM, Guenter Roeck wrote:
>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
 If we attempt a direct issue to a SCSI device, and it returns BUSY, then
 we queue the request up normally. However, the SCSI layer may have
 already setup SG tables etc for this particular command. If we later
 merge with this request, then the old tables are no longer valid. Once
 we issue the IO, we only read/write the original part of the request,
 not the new state of it.

 This causes data corruption, and is most often noticed with the file
 system complaining about the just read data being invalid:

 [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
 comm dpkg-query: bad extra_isize 24937 (inode size 256)

 because most of it is garbage...

 This doesn't happen from the normal issue path, as we will simply defer
 the request to the hardware queue dispatch list if we fail. Once it's on
 the dispatch list, we never merge with it.

 Fix this from the direct issue path by flagging the request as
 REQ_NOMERGE so we don't change the size of it before issue.

 See also:
   https://bugzilla.kernel.org/show_bug.cgi?id=201685

 Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
 of 'none'")
 Signed-off-by: Jens Axboe 
>>>
>>> Tested-by: Guenter Roeck 
>>>
>>> ... on two systems affected by the problem.
>>
>> Thanks for testing! And for being persistent in reproducing and
>> providing clues for getting this nailed.
>>
> 
> My pleasure.
> 
> I see that there is some discussion about this patch.
> 
> Unfortunately, everyone running a 4.19 or later kernel is at serious
> risk of data corruption. Given that, if this patch doesn't make it
> upstream for one reason or another, would it be possible to at least
> revert the two patches introducing the problem until this is sorted
> out for good ? If this is not acceptable either, maybe mark blk-mq
> as broken ? After all, it _is_ broken. This is even more true if it
> turns out that a problem may exist since 4.1, as suggested in the
> discussion.

It is queued up, it'll go upstream later today.

> Also, it seems to me that even with this problem fixed, blk-mq may not
> be ready for primetime after all. With that in mind, maybe commit
> d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
> bit premature. Should that be reverted ?

I have to strongly disagree with that, the timing is just unfortunate.
There are literally millions of machines running blk-mq/scsi-mq, and
this is the only hickup we've had. So I want to put this one to rest
once and for all, there's absolutely no reason not to continue with
what we've planned.

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-05 Thread Guenter Roeck
On Tue, Dec 04, 2018 at 07:25:05PM -0700, Jens Axboe wrote:
> On 12/4/18 6:38 PM, Guenter Roeck wrote:
> > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >> we queue the request up normally. However, the SCSI layer may have
> >> already setup SG tables etc for this particular command. If we later
> >> merge with this request, then the old tables are no longer valid. Once
> >> we issue the IO, we only read/write the original part of the request,
> >> not the new state of it.
> >>
> >> This causes data corruption, and is most often noticed with the file
> >> system complaining about the just read data being invalid:
> >>
> >> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
> >> comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>
> >> because most of it is garbage...
> >>
> >> This doesn't happen from the normal issue path, as we will simply defer
> >> the request to the hardware queue dispatch list if we fail. Once it's on
> >> the dispatch list, we never merge with it.
> >>
> >> Fix this from the direct issue path by flagging the request as
> >> REQ_NOMERGE so we don't change the size of it before issue.
> >>
> >> See also:
> >>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>
> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
> >> of 'none'")
> >> Signed-off-by: Jens Axboe 
> > 
> > Tested-by: Guenter Roeck 
> > 
> > ... on two systems affected by the problem.
> 
> Thanks for testing! And for being persistent in reproducing and
> providing clues for getting this nailed.
> 

My pleasure.

I see that there is some discussion about this patch.

Unfortunately, everyone running a 4.19 or later kernel is at serious
risk of data corruption. Given that, if this patch doesn't make it
upstream for one reason or another, would it be possible to at least
revert the two patches introducing the problem until this is sorted
out for good ? If this is not acceptable either, maybe mark blk-mq
as broken ? After all, it _is_ broken. This is even more true if it
turns out that a problem may exist since 4.1, as suggested in the
discussion.

Also, it seems to me that even with this problem fixed, blk-mq may not
be ready for primetime after all. With that in mind, maybe commit
d5038a13eca72 ("scsi: core: switch to scsi-mq by default") was a
bit premature. Should that be reverted ?

Thanks,
Guenter


Re: [PATCH 0/3] Unify the throttling code for wbt and io-latency

2018-12-05 Thread Jens Axboe
On 12/4/18 10:59 AM, Josef Bacik wrote:
> Originally when I wrote io-latency and the rq_qos code to provide a common 
> base
> between wbt and io-latency I left out the throttling part.  These were 
> basically
> the same, but slightly different in both cases.  The difference was enough and
> the code wasn't too complicated that I just copied it into io-latency and
> modified it for what I needed and carried on.
> 
> Since then Jens has fixed a few issues with wakeups with the niave approach.
> Before you could easily cycle waiters back to the end of the line if they were
> woken up without the ability to actually do their IO yet.  But because this 
> was
> only in wbt we didn't get it in io-latency.
> 
> Resolve this by creating a unified interface for doing the throttling, and 
> then
> just handle the differences between the two users with user specific 
> callbacks.
> This allows us to have one place where we have to mess with wakeups, and gives
> each user the ability to be their own special snowflake.
> 
> Jens, I based this on for-next from 12/03, let me know if you want a different
> base.  I tested this with my blktests test.  Thanks,

Applies fine to for-4.21/block, which is what I care about. Looks good to me,
applied, thanks.

-- 
Jens Axboe



Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT

2018-12-05 Thread Bart Van Assche
On Wed, 2018-12-05 at 23:37 +0800, Weiping Zhang wrote:
> @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req)
>* than an existing one, modify the timer. Round up to next nearest
>* second.
>*/
> - expiry = blk_rq_timeout(round_jiffies_up(expiry));
> + expiry = round_jiffies_up(expiry);

If you would have read the comment above this code, you would have known
that this patch does not do what you think it does and additionally that
it introduces a regression.

Bart.


[PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT

2018-12-05 Thread Weiping Zhang
Since io_timeout was added to sysfs, the user can tune timeouts
by that attribute, so kill this limitation.

Signed-off-by: Weiping Zhang 
---
 block/blk-timeout.c | 13 +
 block/blk.h |  4 
 2 files changed, 1 insertion(+), 16 deletions(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 124c26128bf6..6b2b0f7e5929 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -89,17 +89,6 @@ void blk_abort_request(struct request *req)
 }
 EXPORT_SYMBOL_GPL(blk_abort_request);
 
-unsigned long blk_rq_timeout(unsigned long timeout)
-{
-   unsigned long maxt;
-
-   maxt = round_jiffies_up(jiffies + BLK_MAX_TIMEOUT);
-   if (time_after(timeout, maxt))
-   timeout = maxt;
-
-   return timeout;
-}
-
 /**
  * blk_add_timer - Start timeout timer for a single request
  * @req:   request that is about to start running.
@@ -130,7 +119,7 @@ void blk_add_timer(struct request *req)
 * than an existing one, modify the timer. Round up to next nearest
 * second.
 */
-   expiry = blk_rq_timeout(round_jiffies_up(expiry));
+   expiry = round_jiffies_up(expiry);
 
if (!timer_pending(>timeout) ||
time_before(expiry, q->timeout.expires)) {
diff --git a/block/blk.h b/block/blk.h
index 848278c52030..f0d0a18fb276 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,9 +7,6 @@
 #include 
 #include "blk-mq.h"
 
-/* Max future timer expiry for timeouts */
-#define BLK_MAX_TIMEOUT(5 * HZ)
-
 #ifdef CONFIG_DEBUG_FS
 extern struct dentry *blk_debugfs_root;
 #endif
@@ -151,7 +148,6 @@ static inline bool bio_integrity_endio(struct bio *bio)
 }
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
-unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
 
 bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
-- 
2.14.1



[PATCH 2/2] block: add BLK_DEF_TIMEOUT

2018-12-05 Thread Weiping Zhang
Add an obvious definition for default request timeout.

Signed-off-by: Weiping Zhang 
---
 block/blk-mq.c | 2 +-
 block/blk.h| 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900550594651..53337d5c37af 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2845,7 +2845,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
goto err_hctxs;
 
INIT_WORK(>timeout_work, blk_mq_timeout_work);
-   blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
+   blk_queue_rq_timeout(q, set->timeout ? set->timeout : BLK_DEF_TIMEOUT);
 
q->tag_set = set;
 
diff --git a/block/blk.h b/block/blk.h
index f0d0a18fb276..7905b952b7b3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,9 @@
 #include 
 #include "blk-mq.h"
 
+/* the default request timeout */
+#define BLK_DEF_TIMEOUT (30 * HZ)
+
 #ifdef CONFIG_DEBUG_FS
 extern struct dentry *blk_debugfs_root;
 #endif
-- 
2.14.1



[PATCH 0/2] get rid of BLK_MAX_TIMEOUT

2018-12-05 Thread Weiping Zhang
Get rid of BLK_MAX_TIMEOUT, since we want use io_timeout to control
the maximun timeouts of a request, so remove this limitation.

Weiping Zhang (2):
  block: get rid of BLK_MAX_TIMEOUT
  block: add BLK_DEF_TIMEOUT

 block/blk-mq.c  |  2 +-
 block/blk-timeout.c | 13 +
 block/blk.h |  5 ++---
 3 files changed, 4 insertions(+), 16 deletions(-)

-- 
2.14.1



[PATCH 2/2] blktests: block/025: an io.latency test

2018-12-05 Thread Josef Bacik
This is a test to verify io.latency is working properly.  It does this
by first running a fio job by itself to figure out how fast it runs.
Then we calculate some thresholds, set up 2 cgroups, a fast and a slow
cgroup, and then run the same job in both groups simultaneously.  We
should see the slow group get throttled until the first cgroup is able
to finish, and then the slow cgroup will be allowed to finish.

Signed-off-by: Josef Bacik 
---
 tests/block/025 | 133 
 tests/block/025.out |   1 +
 2 files changed, 134 insertions(+)
 create mode 100644 tests/block/025
 create mode 100644 tests/block/025.out

diff --git a/tests/block/025 b/tests/block/025
new file mode 100644
index ..cdaa35c5e335
--- /dev/null
+++ b/tests/block/025
@@ -0,0 +1,133 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+#
+# Test io.latency to make sure it's protecting the higher priority group
+# properly.
+
+. tests/block/rc
+
+DESCRIPTION="test the io.latency interface to make sure it's working right"
+
+requires() {
+   _have_cgroup2_controller_file io io.latency && _have_fio && \
+   _have_program jq
+}
+
+_fio_results_key() {
+   _job=$1
+   _key=$2
+   _resultfile=$3
+
+   jq '.jobs[] | select(.jobname == "'${_job}'") | .'${_key} ${_resultfile}
+}
+
+test_device() {
+   local fio_config_single fio_config_double fio_results fio_args qd
+
+   fio_config_single=${TMPDIR}/single.fio
+   fio_config_double=${TMPDIR}/double.fio
+   fio_results=${TMPDIR}/025.json
+   fio_args="--output-format=json --output=${fio_results}"
+   qd=$(cat ${TEST_DEV_SYSFS}/queue/nr_requests)
+
+   cat << EOF > "${fio_config_single}"
+   [fast]
+   filename=${TEST_DEV}
+   direct=1
+   allrandrepeat=1
+   readwrite=randrw
+   size=4G
+   ioengine=libaio
+   iodepth=${qd}
+   fallocate=none
+   randseed=12345
+EOF
+
+   cat << EOF > "${fio_config_double}"
+   [global]
+   filename=${TEST_DEV}
+   direct=1
+   allrandrepeat=1
+   readwrite=randrw
+   size=4G
+   ioengine=libaio
+   iodepth=${qd}
+   fallocate=none
+   randseed=12345
+
+   [fast]
+   cgroup=blktests/fast
+
+   [slow]
+   cgroup=blktests/slow
+EOF
+   # We run the test once so we have an idea of how fast this workload will
+   # go with nobody else doing IO on the device.
+   if ! fio ${fio_args} ${fio_config_single}; then
+   echo "fio exited with status $?"
+   return 1
+   fi
+
+   _time_taken=$(_fio_results_key fast job_runtime ${fio_results})
+   if [ "${_time_taken}" = "" ]; then
+   echo "fio doesn't report job_runtime"
+   return 1
+   fi
+
+   echo "normal time taken ${_time_taken}" >> $FULL
+
+   # There's no way to predict how the two workloads are going to affect
+   # each other, so we weant to set thresholds to something reasonable so
+   # we can verify io.latency is doing something.  This means we set 15%
+   # for the fast cgroup, just to give us enough wiggle room as throttling
+   # doesn't happen immediately.  But if we have a super fast disk we could
+   # run both groups really fast and make it under our fast threshold, so
+   # we need to set a threshold for the slow group at 50%.  We assume that
+   # if it was faster than 50% of the fast threshold then we probably
+   # didn't throttle and we can assume io.latency is broken.
+   _fast_thresh=$((${_time_taken} + ${_time_taken} * 15 / 100))
+   _slow_thresh=$((${_time_taken} + ${_time_taken} * 50 / 100))
+   echo "fast threshold time is ${_fast_thresh}" >> $FULL
+   echo "slow threshold time is ${_slow_thresh}" >> $FULL
+
+   # Create the cgroup files
+   _dir=$(_cgroup2_base_dir)/blktests
+   echo "+io" > ${_dir}/cgroup.subtree_control
+   mkdir ${_dir}/fast
+   mkdir ${_dir}/slow
+
+   # We set the target to 1usec because we could have a fast device that is
+   # capable of remarkable IO latencies that would skew the test.  It needs
+   # to be low enough that we do actually throttle the slow group,
+   # otherwise the test will fail when there's nothing wrong.
+   _major=$((0x$(stat -c "%t" ${TEST_DEV})))
+   _minor=$((0x$(stat -c "%T" ${TEST_DEV})))
+   echo "${_major}:${_minor} is our device" >> $FULL
+   if ! echo "${_major}:${_minor} target=1" > ${_dir}/fast/io.latency; then
+   echo "Failed to set our latency target"
+   return 1
+   fi
+
+   if ! fio ${fio_args} ${fio_config_double}; then
+   echo "fio exited with status $?"
+   return 1
+   fi
+
+   _fast_time=$(_fio_results_key fast job_runtime ${fio_results})
+   echo "Fast time ${_fast_time}" >> $FULL
+   _slow_time=$(_fio_results_key slow job_runtime ${fio_results})
+   echo 

[PATCH 1/2] blktests: add cgroup2 infrastructure

2018-12-05 Thread Josef Bacik
In order to test io.latency and other cgroup related things we need some
supporting helpers to setup and tear down cgroup2.  This adds support
for checking that we can even configure cgroup2 things, set them up if
need be, and then add the cleanup stuff to the main cleanup function so
everything is always in a clean state.

Signed-off-by: Josef Bacik 
---
 check |  2 ++
 common/rc | 48 
 2 files changed, 50 insertions(+)

diff --git a/check b/check
index ebd87c097e25..1c9dbc518fa1 100755
--- a/check
+++ b/check
@@ -294,6 +294,8 @@ _cleanup() {
done
unset RESTORE_CPUS_ONLINE
fi
+
+   _cleanup_cgroup2
 }
 
 _call_test() {
diff --git a/common/rc b/common/rc
index 8a892bcd5fde..a785f2329687 100644
--- a/common/rc
+++ b/common/rc
@@ -202,3 +202,51 @@ _test_dev_in_hotplug_slot() {
 _filter_xfs_io_error() {
sed -e 's/^\(.*\)64\(: .*$\)/\1\2/'
 }
+
+_cgroup2_base_dir()
+{
+   grep cgroup2 /proc/mounts | awk '{ print $2 }'
+}
+
+_cleanup_cgroup2()
+{
+   _dir=$(_cgroup2_base_dir)/blktests
+   [ -d "${_dir}" ] || return
+
+   for i in $(find ${_dir} -type d | tac)
+   do
+   rmdir $i
+   done
+}
+
+_have_cgroup2()
+{
+   if ! grep -q 'cgroup2' /proc/mounts; then
+   SKIP_REASON="This test requires cgroup2"
+   return 1
+   fi
+   return 0
+}
+
+_have_cgroup2_controller_file()
+{
+   _have_cgroup2 || return 1
+
+   _controller=$1
+   _file=$2
+   _dir=$(_cgroup2_base_dir)
+
+   if ! grep -q ${_controller} ${_dir}/cgroup.controllers; then
+   SKIP_REASON="No support for ${_controller} cgroup controller"
+   return 1
+   fi
+
+   mkdir ${_dir}/blktests
+   echo "+${_controller}" > ${_dir}/cgroup.subtree_control
+   if [ ! -f ${_dir}/blktests/${_file} ]; then
+   _cleanup_cgroup2
+   SKIP_REASON="Cgroup file ${_file} doesn't exist"
+   return 1
+   fi
+   return 0
+}
-- 
2.14.3



[PATCH 0/2][V2] io.latency test for blktests

2018-12-05 Thread Josef Bacik
v1->v2:
- dropped my python library, TIL about jq.
- fixed the spelling mistakes in the test.

-- Original message --

This patchset is to add a test to verify io.latency is working properly, and to
add all the supporting code to run that test.

First is the cgroup2 infrastructure which is fairly straightforward.  Just
verifies we have cgroup2, and gives us the helpers to check and make sure we
have the right controllers in place for the test.

The second patch brings over some python scripts I put in xfstests for parsing
the fio json output.  I looked at the existing fio performance stuff in
blktests, but we only capture bw stuff, which is wonky with this particular test
because once the fast group is finished the slow group is allowed to go as fast
as it wants.  So I needed this to pull out actual jobtime spent.  This will give
us flexibility to pull out other fio performance data in the future.

The final patch is the test itself.  It simply runs a job by itself to get a
baseline view of the disk performance.  Then it creates 2 cgroups, one fast and
one slow, and runs the same job simultaneously in both groups.  The result
should be that the fast group takes just slightly longer time than the baseline
(I use a 15% threshold to be safe), and that the slow one takes considerably
longer.  Thanks,

Josef


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-05 Thread Jens Axboe
On 12/5/18 7:41 AM, Christoph Hellwig wrote:
> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>> we queue the request up normally. However, the SCSI layer may have
>> already setup SG tables etc for this particular command. If we later
>> merge with this request, then the old tables are no longer valid. Once
>> we issue the IO, we only read/write the original part of the request,
>> not the new state of it.
>>
>> This causes data corruption, and is most often noticed with the file
>> system complaining about the just read data being invalid:
> 
> Looks good as the first quick fix:
> 
> Reviewed-by: Christoph Hellwig 
> 
> But as we discussed I think we need to be even more careful. including
> the only read and write check below in the thread or some higher level
> approach.

I agree, I'm reviewing the patches from Jianchao now and will do
something cleaner on top of that.

-- 
Jens Axboe



Re: [PATCH v3] block: add documentation for io_timeout

2018-12-05 Thread Weiping Zhang
Weiping Zhang  于2018年12月5日周三 下午10:49写道:
>
> Christoph Hellwig  于2018年12月5日周三 下午10:40写道:
> >
> > Can you please also send a patch to not show this attribute for
> > drivers without a timeout handler?  Thanks!
> >
Is there a simple way do that ?

Shall we return -ENOTSUPP when user read/write this attribute when
driver has no timeout handler ?
> Do you means checking q->mq_ops->timeout is NULL or not ?
>
> static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> {
> req->rq_flags |= RQF_TIMED_OUT;
> If it's NULL, only re-start a timer for this request in blk_add_timer.
> Is there some defect ? if show this attribute and the driver doesn't
> support timeout handler ?
>
> if (req->q->mq_ops->timeout) {
> enum blk_eh_timer_return ret;
>
> ret = req->q->mq_ops->timeout(req, reserved);
> if (ret == BLK_EH_DONE)
> return;
> WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
> }
>
> blk_add_timer(req);
> }
>
>
> > On Wed, Dec 05, 2018 at 10:17:06PM +0800, Weiping Zhang wrote:
> > > Add documentation for /sys/block//queue/io_timeout.
> > >
> > > Signed-off-by: Weiping Zhang 
> > > ---
> > >  Documentation/ABI/testing/sysfs-block | 10 ++
> > >  Documentation/block/queue-sysfs.txt   |  7 +++
> > >  2 files changed, 17 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-block 
> > > b/Documentation/ABI/testing/sysfs-block
> > > index dea212db9df3..f254a374710a 100644
> > > --- a/Documentation/ABI/testing/sysfs-block
> > > +++ b/Documentation/ABI/testing/sysfs-block
> > > @@ -271,3 +271,13 @@ Description:
> > >   size of 512B sectors of the zones of the device, with
> > >   the eventual exception of the last zone of the device
> > >   which may be smaller.
> > > +
> > > +What:/sys/block//queue/io_timeout
> > > +Date:November 2018
> > > +Contact: Weiping Zhang 
> > > +Description:
> > > + io_timeout is a request’s timeouts at block layer in
> > > + milliseconds. When the underlying driver starts processing
> > > + a request, the generic block layer will start a timer, if
> > > + this request cannot be completed in io_timeout milliseconds,
> > > + a timeout event will occur.
> > > diff --git a/Documentation/block/queue-sysfs.txt 
> > > b/Documentation/block/queue-sysfs.txt
> > > index 2c1e67058fd3..f0c9bbce73fd 100644
> > > --- a/Documentation/block/queue-sysfs.txt
> > > +++ b/Documentation/block/queue-sysfs.txt
> > > @@ -67,6 +67,13 @@ If set to a value larger than 0, the kernel will put 
> > > the process issuing
> > >  IO to sleep for this amont of microseconds before entering classic
> > >  polling.
> > >
> > > +io_timeout (RW)
> > > +---
> > > +This is a request’s timeouts at block layer in milliseconds. When the
> > > +underlying driver starts processing a request, the generic block layer
> > > +will start a timer, if this request cannot be completed in io_timeout
> > > +milliseconds, a timeout event will occur.
> > > +
> > >  iostats (RW)
> > >  -
> > >  This file is used to control (on/off) the iostats accounting of the
> > > --
> > > 2.14.1
> > >
> > ---end quoted text---


Re: [PATCH v3] block: add documentation for io_timeout

2018-12-05 Thread Weiping Zhang
Christoph Hellwig  于2018年12月5日周三 下午10:40写道:
>
> Can you please also send a patch to not show this attribute for
> drivers without a timeout handler?  Thanks!
>
Do you means checking q->mq_ops->timeout is NULL or not ?

static void blk_mq_rq_timed_out(struct request *req, bool reserved)
{
req->rq_flags |= RQF_TIMED_OUT;
If it's NULL, only re-start a timer for this request in blk_add_timer.
Is there some defect ? if show this attribute and the driver doesn't
support timeout handler ?

if (req->q->mq_ops->timeout) {
enum blk_eh_timer_return ret;

ret = req->q->mq_ops->timeout(req, reserved);
if (ret == BLK_EH_DONE)
return;
WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
}

blk_add_timer(req);
}


> On Wed, Dec 05, 2018 at 10:17:06PM +0800, Weiping Zhang wrote:
> > Add documentation for /sys/block//queue/io_timeout.
> >
> > Signed-off-by: Weiping Zhang 
> > ---
> >  Documentation/ABI/testing/sysfs-block | 10 ++
> >  Documentation/block/queue-sysfs.txt   |  7 +++
> >  2 files changed, 17 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-block 
> > b/Documentation/ABI/testing/sysfs-block
> > index dea212db9df3..f254a374710a 100644
> > --- a/Documentation/ABI/testing/sysfs-block
> > +++ b/Documentation/ABI/testing/sysfs-block
> > @@ -271,3 +271,13 @@ Description:
> >   size of 512B sectors of the zones of the device, with
> >   the eventual exception of the last zone of the device
> >   which may be smaller.
> > +
> > +What:/sys/block//queue/io_timeout
> > +Date:November 2018
> > +Contact: Weiping Zhang 
> > +Description:
> > + io_timeout is a request’s timeouts at block layer in
> > + milliseconds. When the underlying driver starts processing
> > + a request, the generic block layer will start a timer, if
> > + this request cannot be completed in io_timeout milliseconds,
> > + a timeout event will occur.
> > diff --git a/Documentation/block/queue-sysfs.txt 
> > b/Documentation/block/queue-sysfs.txt
> > index 2c1e67058fd3..f0c9bbce73fd 100644
> > --- a/Documentation/block/queue-sysfs.txt
> > +++ b/Documentation/block/queue-sysfs.txt
> > @@ -67,6 +67,13 @@ If set to a value larger than 0, the kernel will put the 
> > process issuing
> >  IO to sleep for this amont of microseconds before entering classic
> >  polling.
> >
> > +io_timeout (RW)
> > +---
> > +This is a request’s timeouts at block layer in milliseconds. When the
> > +underlying driver starts processing a request, the generic block layer
> > +will start a timer, if this request cannot be completed in io_timeout
> > +milliseconds, a timeout event will occur.
> > +
> >  iostats (RW)
> >  -
> >  This file is used to control (on/off) the iostats accounting of the
> > --
> > 2.14.1
> >
> ---end quoted text---


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-05 Thread Christoph Hellwig
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:

Looks good as the first quick fix:

Reviewed-by: Christoph Hellwig 

But as we discussed I think we need to be even more careful. including
the only read and write check below in the thread or some higher level
approach.


[PATCH v3] block: add documentation for io_timeout

2018-12-05 Thread Weiping Zhang
Add documentation for /sys/block//queue/io_timeout.

Signed-off-by: Weiping Zhang 
---
 Documentation/ABI/testing/sysfs-block | 10 ++
 Documentation/block/queue-sysfs.txt   |  7 +++
 2 files changed, 17 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block 
b/Documentation/ABI/testing/sysfs-block
index dea212db9df3..f254a374710a 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -271,3 +271,13 @@ Description:
size of 512B sectors of the zones of the device, with
the eventual exception of the last zone of the device
which may be smaller.
+
+What:  /sys/block//queue/io_timeout
+Date:  November 2018
+Contact:   Weiping Zhang 
+Description:
+   io_timeout is a request’s timeouts at block layer in
+   milliseconds. When the underlying driver starts processing
+   a request, the generic block layer will start a timer, if
+   this request cannot be completed in io_timeout milliseconds,
+   a timeout event will occur.
diff --git a/Documentation/block/queue-sysfs.txt 
b/Documentation/block/queue-sysfs.txt
index 2c1e67058fd3..f0c9bbce73fd 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -67,6 +67,13 @@ If set to a value larger than 0, the kernel will put the 
process issuing
 IO to sleep for this amont of microseconds before entering classic
 polling.
 
+io_timeout (RW)
+---
+This is a request’s timeouts at block layer in milliseconds. When the
+underlying driver starts processing a request, the generic block layer
+will start a timer, if this request cannot be completed in io_timeout
+milliseconds, a timeout event will occur.
+
 iostats (RW)
 -
 This file is used to control (on/off) the iostats accounting of the
-- 
2.14.1



Re: [PATCH 15/26] aio: support for IO polling

2018-12-05 Thread Jens Axboe
On 12/5/18 2:56 AM, Benny Halevy wrote:
>> +static int aio_iopoll_getevents(struct kioctx *ctx,
>> +struct io_event __user *event,
>> +unsigned int *nr_events, long min, long max)
>> +{
>> +struct aio_kiocb *iocb;
>> +int to_poll, polled, ret;
>> +
>> +/*
>> + * Check if we already have done events that satisfy what we need
>> + */
>> +if (!list_empty(>poll_completing)) {
>> +ret = aio_iopoll_reap(ctx, event, nr_events, max);
>> +if (ret < 0)
>> +return ret;
>> +/* if min is zero, still go through a poll check */
> 
> A function-level comment about that would be more visible.

Yes, that might be a better idea.

>> +if (min && *nr_events >= min)
>>
>> +return 0;
> 
> if ((min && *nr_events >= min) || *nr_events >= max) ?
> 
> Otherwise, if poll_completing or poll_submitted are not empty,
> besides getting to the "Shouldn't happen" case in aio_iopoll_reap,
> it looks like we're gonna waste some cycles and never call f_op->iopoll

Agree, that would be a better place to catch it. I'll make those two
changes, thanks!

>> +
>> +/*
>> + * Find up to 'max' worth of events to poll for, including the
>> + * events we already successfully polled
>> + */
>> +polled = to_poll = 0;
>> +list_for_each_entry(iocb, >poll_completing, ki_list) {
>> +/*
>> + * Poll for needed events with spin == true, anything after
>> + * that we just check if we have more, up to max.
>> + */
>> +bool spin = polled + *nr_events >= min;
> 
> I'm confused. Shouldn't spin == true if polled + *nr_events < min?

Heh, that does look off, and it is. I think the correct condition is:

bool spin = !polled || *nr_events < min;

and I'm not sure why that got broken. I just tested it, slight
performance win as expected correcting this. It ties in with the above
nr_events check - it's perfectly valid to pass min_nr == 0 and just do a
poll check. Conversely, if we already spun, just do non-spin checks on
followup poll checks.

>> +static int __aio_iopoll_check(struct kioctx *ctx, struct io_event __user 
>> *event,
>> +  unsigned int *nr_events, long min_nr, long max_nr)
>> +{
>> +int ret = 0;
>> +
>> +while (!*nr_events || !need_resched()) {
>> +int tmin = 0;
>> +
>> +if (*nr_events < min_nr)
>> +tmin = min_nr - *nr_events;
> 
> Otherwise, shouldn't the function just return 0?
> 
> Or perhaps you explicitly want to go through the tmin==0 case
> in aio_iopoll_getevents if *nr_events == min_nr (or min_nr==0)?

No, we need to go through poll, if min_nr == 0, but only if min_nr == 0
and !nr_events. But we only get here for nr_events != 0, so should be
fine as-is.

Thanks for your review, very useful! I'll make the above changes right
now.

-- 
Jens Axboe



Re: [PATCH 15/26] aio: support for IO polling

2018-12-05 Thread Benny Halevy
On Tue, 2018-12-04 at 16:37 -0700, Jens Axboe wrote:
> Add polled variants of PREAD/PREADV and PWRITE/PWRITEV. These act
> like their non-polled counterparts, except we expect to poll for
> completion of them. The polling happens at io_getevent() time, and
> works just like non-polled IO.
> 
> To setup an io_context for polled IO, the application must call
> io_setup2() with IOCTX_FLAG_IOPOLL as one of the flags. It is illegal
> to mix and match polled and non-polled IO on an io_context.
> 
> Polled IO doesn't support the user mapped completion ring. Events
> must be reaped through the io_getevents() system call. For non-irq
> driven poll devices, there's no way to support completion reaping
> from userspace by just looking at the ring. The application itself
> is the one that pulls completion entries.
> 
> Signed-off-by: Jens Axboe 
> ---
>  fs/aio.c | 393 +++
>  include/uapi/linux/aio_abi.h |   3 +
>  2 files changed, 360 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index bb6f07ca6940..5d34317c2929 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -153,6 +153,18 @@ struct kioctx {
>   atomic_treqs_available;
>   } cacheline_aligned_in_smp;
>  
> + /* iopoll submission state */
> + struct {
> + spinlock_t poll_lock;
> + struct list_head poll_submitted;
> + } cacheline_aligned_in_smp;
> +
> + /* iopoll completion state */
> + struct {
> + struct list_head poll_completing;
> + struct mutex getevents_lock;
> + } cacheline_aligned_in_smp;
> +
>   struct {
>   spinlock_t  ctx_lock;
>   struct list_head active_reqs;   /* used for cancellation */
> @@ -205,14 +217,27 @@ struct aio_kiocb {
>   __u64   ki_user_data;   /* user's data for completion */
>  
>   struct list_headki_list;/* the aio core uses this
> -  * for cancellation */
> +  * for cancellation, or for
> +  * polled IO */
> +
> + unsigned long   ki_flags;
> +#define IOCB_POLL_COMPLETED  0   /* polled IO has completed */
> +#define IOCB_POLL_EAGAIN 1   /* polled submission got EAGAIN */
> +
>   refcount_t  ki_refcnt;
>  
> - /*
> -  * If the aio_resfd field of the userspace iocb is not zero,
> -  * this is the underlying eventfd context to deliver events to.
> -  */
> - struct eventfd_ctx  *ki_eventfd;
> + union {
> + /*
> +  * If the aio_resfd field of the userspace iocb is not zero,
> +  * this is the underlying eventfd context to deliver events to.
> +  */
> + struct eventfd_ctx  *ki_eventfd;
> +
> + /*
> +  * For polled IO, stash completion info here
> +  */
> + struct io_event ki_ev;
> + };
>  };
>  
>  /*-- sysctl variables*/
> @@ -233,6 +258,7 @@ static const unsigned int iocb_page_shift =
>   ilog2(PAGE_SIZE / sizeof(struct iocb));
>  
>  static void aio_useriocb_unmap(struct kioctx *);
> +static void aio_iopoll_reap_events(struct kioctx *);
>  
>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>  {
> @@ -471,11 +497,15 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned 
> int nr_events)
>   int i;
>   struct file *file;
>  
> - /* Compensate for the ring buffer's head/tail overlap entry */
> - nr_events += 2; /* 1 is required, 2 for good luck */
> -
> + /*
> +  * Compensate for the ring buffer's head/tail overlap entry.
> +  * IO polling doesn't require any io event entries
> +  */
>   size = sizeof(struct aio_ring);
> - size += sizeof(struct io_event) * nr_events;
> + if (!(ctx->flags & IOCTX_FLAG_IOPOLL)) {
> + nr_events += 2; /* 1 is required, 2 for good luck */
> + size += sizeof(struct io_event) * nr_events;
> + }
>  
>   nr_pages = PFN_UP(size);
>   if (nr_pages < 0)
> @@ -758,6 +788,11 @@ static struct kioctx *io_setup_flags(unsigned long ctxid,
>  
>   INIT_LIST_HEAD(>active_reqs);
>  
> + spin_lock_init(>poll_lock);
> + INIT_LIST_HEAD(>poll_submitted);
> + INIT_LIST_HEAD(>poll_completing);
> + mutex_init(>getevents_lock);
> +
>   if (percpu_ref_init(>users, free_ioctx_users, 0, GFP_KERNEL))
>   goto err;
>  
> @@ -829,11 +864,15 @@ static int kill_ioctx(struct mm_struct *mm, struct 
> kioctx *ctx,
>  {
>   struct kioctx_table *table;
>  
> + mutex_lock(>getevents_lock);
>   spin_lock(>ioctx_lock);
>   if (atomic_xchg(>dead, 1)) {
>   spin_unlock(>ioctx_lock);
> + mutex_unlock(>getevents_lock);
>   return -EINVAL;
>   }
> + 

Re: [PATCH 2/3] blktests: add python scripts for parsing fio json output

2018-12-05 Thread Johannes Thumshirn
On 04/12/2018 18:47, Josef Bacik wrote:
> +For example
> +"write" : {
> +"io_bytes" : 313360384,
> +"bw" : 1016,
> +}
> +
> +Get's collapsed to
> +
> +"write_io_bytes" : 313360384,
> +"write_bw": 1016,

I'd definitively prefer to not pull in python as a dependency. The above
example can be done in shell (with the jq utility) as follows:

jthumshirn@linux-x5ow:~$ cat test.json |\
   jq '. | {write_io_bytes: .write.io_bytes, write_bw: .write.bw}'
{
  "write_io_bytes": 313360384,
  "write_bw": 1016
}

And on a plus side jq is small as well:

jthumshirn@linux-x5ow:~$ zypper info jq
Loading repository data...
Reading installed packages...


Information for package jq:
---
Repository : openSUSE:Leap:15.0
Name   : jq
Version: 1.5-lp150.1.10
Arch   : x86_64
Vendor : openSUSE
Installed Size : 93.4 KiB
Installed  : Yes
[...]

-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850