Re: [PATCH 0/2][V2] io.latency test for blktests

2018-12-06 Thread Omar Sandoval
On Wed, Dec 05, 2018 at 10:34:02AM -0500, Josef Bacik wrote:
> v1->v2:
> - dropped my python library, TIL about jq.
> - fixed the spelling mistakes in the test.
> 
> -- Original message --
> 
> This patchset is to add a test to verify io.latency is working properly, and 
> to
> add all the supporting code to run that test.
> 
> First is the cgroup2 infrastructure which is fairly straightforward.  Just
> verifies we have cgroup2, and gives us the helpers to check and make sure we
> have the right controllers in place for the test.
> 
> The second patch brings over some python scripts I put in xfstests for parsing
> the fio json output.  I looked at the existing fio performance stuff in
> blktests, but we only capture bw stuff, which is wonky with this particular 
> test
> because once the fast group is finished the slow group is allowed to go as 
> fast
> as it wants.  So I needed this to pull out actual jobtime spent.  This will 
> give
> us flexibility to pull out other fio performance data in the future.
> 
> The final patch is the test itself.  It simply runs a job by itself to get a
> baseline view of the disk performance.  Then it creates 2 cgroups, one fast 
> and
> one slow, and runs the same job simultaneously in both groups.  The result
> should be that the fast group takes just slightly longer time than the 
> baseline
> (I use a 15% threshold to be safe), and that the slow one takes considerably
> longer.  Thanks,

I cleaned up a ton of shellcheck warnings (from `make check`) and pushed
to https://github.com/osandov/blktests/tree/josef. On I tested with QEMU
on Jens' for-next branch. With an emulated NVMe device, it failed with
"Too much of a performance drop for the protected workload". On
virtio-blk, I hit this:

[ 1843.056452] INFO: task fio:20750 blocked for more than 120 seconds.
[ 1843.057495]   Not tainted 4.20.0-rc5-00251-g90efb26fa9a4 #19
[ 1843.058487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1843.059769] fio D0 20750  20747 0x0080
[ 1843.060688] Call Trace:
[ 1843.061123]  ? __schedule+0x286/0x870
[ 1843.061735]  ? blkcg_iolatency_done_bio+0x680/0x680
[ 1843.062574]  ? blkcg_iolatency_cleanup+0x60/0x60
[ 1843.063347]  schedule+0x32/0x80
[ 1843.063874]  io_schedule+0x12/0x40
[ 1843.064449]  rq_qos_wait+0x9a/0x120
[ 1843.065007]  ? karma_partition+0x210/0x210
[ 1843.065661]  ? blkcg_iolatency_done_bio+0x680/0x680
[ 1843.066435]  blkcg_iolatency_throttle+0x185/0x360
[ 1843.067196]  __rq_qos_throttle+0x23/0x30
[ 1843.067958]  blk_mq_make_request+0x101/0x5c0
[ 1843.068637]  generic_make_request+0x1b3/0x3c0
[ 1843.069329]  submit_bio+0x45/0x140
[ 1843.069876]  blkdev_direct_IO+0x3db/0x440
[ 1843.070527]  ? aio_complete+0x2f0/0x2f0
[ 1843.071146]  generic_file_direct_write+0x96/0x160
[ 1843.071880]  __generic_file_write_iter+0xb3/0x1c0
[ 1843.072599]  ? blk_mq_dispatch_rq_list+0x3aa/0x550
[ 1843.073340]  blkdev_write_iter+0xa0/0x120
[ 1843.073960]  ? __fget+0x6e/0xa0
[ 1843.074452]  aio_write+0x11f/0x1d0
[ 1843.074979]  ? __blk_mq_run_hw_queue+0x6f/0xe0
[ 1843.075658]  ? __check_object_size+0xa0/0x189
[ 1843.076345]  ? preempt_count_add+0x5a/0xb0
[ 1843.077086]  ? aio_read_events+0x259/0x380
[ 1843.077819]  ? kmem_cache_alloc+0x16e/0x1c0
[ 1843.078427]  io_submit_one+0x4a8/0x790
[ 1843.078975]  ? read_events+0x76/0x150
[ 1843.079510]  __se_sys_io_submit+0x98/0x1a0
[ 1843.080116]  ? syscall_trace_enter+0x1d3/0x2d0
[ 1843.080785]  do_syscall_64+0x55/0x160
[ 1843.081404]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1843.082210] RIP: 0033:0x7f6e571fc4ed
[ 1843.082763] Code: Bad RIP value.
[ 1843.083268] RSP: 002b:7ffc212b76f8 EFLAGS: 0246 ORIG_RAX: 
00d1
[ 1843.084445] RAX: ffda RBX: 7f6e4c876870 RCX: 7f6e571fc4ed
[ 1843.085545] RDX: 557c4bc11208 RSI: 0001 RDI: 7f6e4c85e000
[ 1843.086251] RBP: 7f6e4c85e000 R08: 557c4bc2b130 R09: 02f8
[ 1843.087308] R10: 557c4bbf4470 R11: 0246 R12: 0001
[ 1843.088310] R13:  R14: 557c4bc11208 R15: 7f6e2b17f070


Re: [PATCH] block/025: test discard sector alignement and sector size overflow

2018-12-05 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 12:00:17PM +0800, Ming Lei wrote:
> This test covers the following two issues:
> 
> 1) discard sector need to be aligned with logical block size
> 
> 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing
> with discard sector size
> 
> Signed-off-by: Ming Lei 
> ---
>  tests/block/025 | 37 +
>  tests/block/025.out |  2 ++
>  2 files changed, 39 insertions(+)
>  create mode 100755 tests/block/025
>  create mode 100644 tests/block/025.out

Applied, thanks Ming.


[PATCH] sbitmap: fix sbitmap_for_each_set()

2018-12-03 Thread Omar Sandoval
From: Omar Sandoval 

We need to ignore bits in the cleared mask when iterating over all set
bits.

Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
Signed-off-by: Omar Sandoval 
---
 include/linux/sbitmap.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 92806a2dbab7..03f50fcedc79 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
nr = SB_NR_TO_BIT(sb, start);
 
while (scanned < sb->depth) {
-   struct sbitmap_word *word = >map[index];
-   unsigned int depth = min_t(unsigned int, word->depth - nr,
+   unsigned long word;
+   unsigned int depth = min_t(unsigned int,
+  sb->map[index].depth - nr,
   sb->depth - scanned);
 
scanned += depth;
-   if (!word->word)
+   word = sb->map[index].word & ~sb->map[index].cleared;
+   if (!word)
goto next;
 
/*
@@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
 */
depth += nr;
while (1) {
-   nr = find_next_bit(>word, depth, nr);
+   nr = find_next_bit(, depth, nr);
if (nr >= depth)
break;
if (!fn(sb, (index << sb->shift) + nr, data))
-- 
2.19.2



Re: sbitmap: check cleared bits when iterating busy bits

2018-12-03 Thread Omar Sandoval
On Mon, Dec 03, 2018 at 02:56:17PM -0700, Jens Axboe wrote:
> When we are iterating the set bits in a word, we also need to factor in
> the cleared bits. Don't call fn() unless the bit is also not set in
> the cleared word.
> 
> Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits")
> Signed-off-by: Jens Axboe 
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 92806a2dbab7..9f374fbcdba6 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -283,6 +283,11 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
> *sb,
>   nr = find_next_bit(>word, depth, nr);
>   if (nr >= depth)
>   break;
> + /* if set in cleared, it's actually free */
> + if (test_bit(nr, >cleared)) {
> + nr++;
> + continue;
> + }
>   if (!fn(sb, (index << sb->shift) + nr, data))
>   return;
>  
> -- 
> Jens Axboe
> 

How about something like this:

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index f0f49bbb2617..fe9122386255 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -265,12 +265,14 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
nr = SB_NR_TO_BIT(sb, start);
 
while (scanned < sb->depth) {
-   struct sbitmap_word *word = >map[index];
-   unsigned int depth = min_t(unsigned int, word->depth - nr,
+   unsigned long word;
+   unsigned int depth = min_t(unsigned int,
+  sb->map[index].depth - nr,
   sb->depth - scanned);
 
scanned += depth;
-   if (!word->word)
+   word = sb->map[index].word & ~sb->map[index].cleared;
+   if (!word)
goto next;
 
/*
@@ -280,7 +282,7 @@ static inline void __sbitmap_for_each_set(struct sbitmap 
*sb,
 */
depth += nr;
while (1) {
-   nr = find_next_bit(>word, depth, nr);
+   nr = find_next_bit(, depth, nr);
if (nr >= depth)
break;
if (!fn(sb, (index << sb->shift) + nr, data))

Might be marginally faster.


Re: [PATCH v2] blk-mq: don't call ktime_get_ns() if we don't need it

2018-12-03 Thread Omar Sandoval
On Fri, Nov 30, 2018 at 02:13:54PM -0700, Jens Axboe wrote:
> We only need the request fields and the end_io time if we have
> stats enabled, or if we have a scheduler attached as those may
> use it for completion time stats.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> 
> ---
> 
> v2: add helper, use it in both spots. also clear ->start_time_ns
> so merging doesn't read garbage.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7dcef565dc0f..e09d7f500077 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -281,6 +281,15 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
>  }
>  EXPORT_SYMBOL(blk_mq_can_queue);
>  
> +/*
> + * Only need start/end time stamping if we have stats enabled, or using
> + * an IO scheduler.
> + */
> +static inline bool blk_mq_need_time_stamp(struct request *rq)
> +{
> + return (rq->rq_flags & RQF_IO_STAT) || rq->q->elevator;
> +}
> +
>  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>   unsigned int tag, unsigned int op)
>  {
> @@ -316,7 +325,10 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   RB_CLEAR_NODE(>rb_node);
>   rq->rq_disk = NULL;
>   rq->part = NULL;
> - rq->start_time_ns = ktime_get_ns();
> + if (blk_mq_need_time_stamp(rq))
> + rq->start_time_ns = ktime_get_ns();
> + else
> + rq->start_time_ns = 0;
>   rq->io_start_time_ns = 0;
>   rq->nr_phys_segments = 0;
>  #if defined(CONFIG_BLK_DEV_INTEGRITY)
> @@ -522,7 +534,10 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request);
>  
>  inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
>  {
> - u64 now = ktime_get_ns();
> + u64 now = 0;
> +
> + if (blk_mq_need_time_stamp(rq))
> + now = ktime_get_ns();
>  
>   if (rq->rq_flags & RQF_STATS) {
>   blk_mq_poll_stats_start(rq->q);
> -- 
> Jens Axboe
> 


Re: [PATCH 1/2] sbitmap: ammortize cost of clearing bits

2018-11-30 Thread Omar Sandoval
On Fri, Nov 30, 2018 at 01:10:47PM -0700, Jens Axboe wrote:
> On 11/30/18 1:03 PM, Omar Sandoval wrote:
> > On Fri, Nov 30, 2018 at 09:01:17AM -0700, Jens Axboe wrote:
> >> sbitmap maintains a set of words that we use to set and clear bits, with
> >> each bit representing a tag for blk-mq. Even though we spread the bits
> >> out and maintain a hint cache, one particular bit allocated will end up
> >> being cleared in the exact same spot.
> >>
> >> This introduces batched clearing of bits. Instead of clearing a given
> >> bit, the same bit is set in a cleared/free mask instead. If we fail
> >> allocating a bit from a given word, then we check the free mask, and
> >> batch move those cleared bits at that time. This trades 64 atomic bitops
> >> for 2 cmpxchg().
> >>
> >> In a threaded poll test case, half the overhead of getting and clearing
> >> tags is removed with this change. On another poll test case with a
> >> single thread, performance is unchanged.
> >>
> >> Signed-off-by: Jens Axboe 
> >> ---
> >>  include/linux/sbitmap.h | 31 +---
> >>  lib/sbitmap.c   | 80 +
> >>  2 files changed, 100 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> >> index 804a50983ec5..07f117ee19dc 100644
> >> --- a/include/linux/sbitmap.h
> >> +++ b/include/linux/sbitmap.h
> >> @@ -30,14 +30,24 @@ struct seq_file;
> >>   */
> >>  struct sbitmap_word {
> >>/**
> >> -   * @word: The bitmap word itself.
> >> +   * @depth: Number of bits being used in @word/@cleared
> >> */
> >> -  unsigned long word;
> >> +  unsigned long depth;
> >>  
> >>/**
> >> -   * @depth: Number of bits being used in @word.
> >> +   * @word: word holding free bits
> >> */
> >> -  unsigned long depth;
> >> +  unsigned long word cacheline_aligned_in_smp;
> > 
> > Still splitting up word and depth in separate cachelines?
> 
> Yeah, I mentioned that in one of the other postings, there's still a
> definite win to doing that.
> 
> > Okay, I couldn't find any holes in this one :)
> 
> Good to hear that :-)
> 
> >> -unsigned int sbitmap_weight(const struct sbitmap *sb)
> >> +static unsigned int __sbitmap_weight(const struct sbitmap *sb, bool set)
> >>  {
> >>unsigned int i, weight = 0;
> >>  
> >>for (i = 0; i < sb->map_nr; i++) {
> >>const struct sbitmap_word *word = >map[i];
> >>  
> >> -  weight += bitmap_weight(>word, word->depth);
> >> +  if (set)
> >> +  weight += bitmap_weight(>word, word->depth);
> > 
> > Should probably do
> > weight -= bitmap_weight(>cleared, word->depth);
> > 
> > too, right?
> 
> We only use these for the debugfs stuff, how about I just make it static
> instead?

Yeah, with that,

Reviewed-by: Omar Sandoval 


Re: [PATCH 2/2] sbitmap: optimize wakeup check

2018-11-30 Thread Omar Sandoval
On Fri, Nov 30, 2018 at 09:01:18AM -0700, Jens Axboe wrote:
> Even if we have no waiters on any of the sbitmap_queue wait states, we
> still have to loop every entry to check. We do this for every IO, so
> the cost adds up.
> 
> Shift a bit of the cost to the slow path, when we actually have waiters.
> Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
> an internal count of how many are currently active. Then we can simply
> check this count in sbq_wake_ptr() and not have to loop if we don't
> have any sleepers.
> 
> Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq-tag.c   | 11 
>  drivers/target/iscsi/iscsi_target_util.c | 12 +
>  include/linux/sbitmap.h  | 34 
>  lib/sbitmap.c| 28 +++
>  4 files changed, 74 insertions(+), 11 deletions(-)


Re: [PATCH 1/2] sbitmap: ammortize cost of clearing bits

2018-11-30 Thread Omar Sandoval
On Fri, Nov 30, 2018 at 09:01:17AM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
> 
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg().
> 
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
> 
> Signed-off-by: Jens Axboe 
> ---
>  include/linux/sbitmap.h | 31 +---
>  lib/sbitmap.c   | 80 +
>  2 files changed, 100 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 804a50983ec5..07f117ee19dc 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -30,14 +30,24 @@ struct seq_file;
>   */
>  struct sbitmap_word {
>   /**
> -  * @word: The bitmap word itself.
> +  * @depth: Number of bits being used in @word/@cleared
>*/
> - unsigned long word;
> + unsigned long depth;
>  
>   /**
> -  * @depth: Number of bits being used in @word.
> +  * @word: word holding free bits
>*/
> - unsigned long depth;
> + unsigned long word cacheline_aligned_in_smp;

Still splitting up word and depth in separate cachelines?

> + /**
> +  * @cleared: word holding cleared bits
> +  */
> + unsigned long cleared cacheline_aligned_in_smp;
> +
> + /**
> +  * @swap_lock: Held while swapping word <-> cleared
> +  */
> + spinlock_t swap_lock;
>  } cacheline_aligned_in_smp;
>  
>  /**
> @@ -310,6 +320,19 @@ static inline void sbitmap_clear_bit(struct sbitmap *sb, 
> unsigned int bitnr)
>   clear_bit(SB_NR_TO_BIT(sb, bitnr), __sbitmap_word(sb, bitnr));
>  }
>  
> +/*
> + * This one is special, since it doesn't actually clear the bit, rather it
> + * sets the corresponding bit in the ->cleared mask instead. Paired with
> + * the caller doing sbitmap_batch_clear() if a given index is full, which
> + * will clear the previously freed entries in the corresponding ->word.
> + */
> +static inline void sbitmap_deferred_clear_bit(struct sbitmap *sb, unsigned 
> int bitnr)
> +{
> + unsigned long *addr = >map[SB_NR_TO_INDEX(sb, bitnr)].cleared;
> +
> + set_bit(SB_NR_TO_BIT(sb, bitnr), addr);
> +}
> +
>  static inline void sbitmap_clear_bit_unlock(struct sbitmap *sb,
>   unsigned int bitnr)
>  {
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 45cab6bbc1c7..f6a9553617bd 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -59,6 +59,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int 
> depth, int shift,
>   for (i = 0; i < sb->map_nr; i++) {
>   sb->map[i].depth = min(depth, bits_per_word);
>   depth -= sb->map[i].depth;
> + spin_lock_init(>map[i].swap_lock);
>   }
>   return 0;
>  }
> @@ -111,6 +112,57 @@ static int __sbitmap_get_word(unsigned long *word, 
> unsigned long depth,
>   return nr;
>  }
>  
> +/*
> + * See if we have deferred clears that we can batch move
> + */
> +static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index)
> +{
> + unsigned long mask, val;
> + bool ret = false;
> +
> + spin_lock(>map[index].swap_lock);
> +
> + if (!sb->map[index].cleared)
> + goto out_unlock;
> +
> + /*
> +  * First get a stable cleared mask, setting the old mask to 0.
> +  */
> + do {
> + mask = sb->map[index].cleared;
> + } while (cmpxchg(>map[index].cleared, mask, 0) != mask);
> +
> + /*
> +  * Now clear the masked bits in our free word
> +  */
> + do {
> + val = sb->map[index].word;
> + } while (cmpxchg(>map[index].word, val, val & ~mask) != val);
> +
> + ret = true;
> +out_unlock:
> + spin_unlock(>map[index].swap_lock);
> + return ret;
> +}

Okay, I couldn't find any holes in this one :)

> +static int sbitmap_find_bit_in_index(struct sbitmap *sb, int index,
> +  unsigned int alloc_hint, bool round_robin)
> +{
> + int nr;
> +
> + do {
> + nr = __sbitmap_get_word(>map[index].word,
> + sb->map[index].depth, alloc_hint,
> + !round_robin);
> + if (nr != -1)
> + break;
> + if (!sbitmap_deferred_clear(sb, index))
> + break;
> + } while (1);
> +
> + return nr;
> +}
> +
>  int 

Re: [PATCH] sbitmap: ammortize cost of clearing bits

2018-11-29 Thread Omar Sandoval
On Thu, Nov 29, 2018 at 01:00:25PM -0700, Jens Axboe wrote:
> sbitmap maintains a set of words that we use to set and clear bits, with
> each bit representing a tag for blk-mq. Even though we spread the bits
> out and maintain a hint cache, one particular bit allocated will end up
> being cleared in the exact same spot.
> 
> This introduces batched clearing of bits. Instead of clearing a given
> bit, the same bit is set in a cleared/free mask instead. If we fail
> allocating a bit from a given word, then we check the free mask, and
> batch move those cleared bits at that time. This trades 64 atomic bitops
> for 2 cmpxchg(). On top of that, we do those sequentially, hopefully
> making that a bit cheaper as well.
> 
> In a threaded poll test case, half the overhead of getting and clearing
> tags is removed with this change. On another poll test case with a
> single thread, performance is unchanged.
> 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> This patch is on top of the round robin fix for sbitmap just posted.

So I think this can lead to hangs due to interactions with wake_batch.
Bear with me:

Let's say the sbitmap_queue was created with depth=64 and shift=6, i.e.,
64 bits all in one word. In this case, wake_batch=8, because 64 bits
clearing should be enough to wake up in 8 batches of 8.

Let's say all 64 bits are allocated and there are 8 waiters. All 64 bits
are then freed (in the cleared word), so we wake up the 8 waiters. Let's
say they all attempt __sbitmap_get_word(), fail, and get to
sbitmap_deferred_clear() at the same time. One of them does the
cmpxchg() on cleared, setting it to zero, and then the rest see that
cleared is zero, so they return because there don't appear to be any
cleared bits. The first thread succeeds, and the rest go to sleep. 

Now, when the first thread clears the bit, it only subtracts one from
the batch, which isn't enough to do a wakeup. Hang!

This example is contrived, but in general I think that the window
between the cleared mask being zeroed and the word being cleared is
going to cause problems.

> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 804a50983ec5..cec685b89998 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -30,14 +30,19 @@ struct seq_file;
>   */
>  struct sbitmap_word {
>   /**
> -  * @word: The bitmap word itself.
> +  * @depth: Number of bits being used in @word/@cleared
>*/
> - unsigned long word;
> + unsigned long depth;
>  
>   /**
> -  * @depth: Number of bits being used in @word.
> +  * @word: word holding free bits
>*/
> - unsigned long depth;
> + unsigned long word cacheline_aligned_in_smp;

Completely separate note (assuming that we can iron out the race above),
this triples the size of each map. Does the word have to be in a
separate cacheline from the depth? Ignoring resize, depth and word are
always accessed together, so it seems to me that it would benefit from
sharing a cacheline now that clearing happens elsewhere.

> + /**
> +  * @cleared: word holding cleared bits
> +  */
> + unsigned long cleared cacheline_aligned_in_smp;
>  } cacheline_aligned_in_smp;


Re: [PATCH] sbitmap: don't loop for find_next_zero_bit() for !round_robin

2018-11-29 Thread Omar Sandoval
On Thu, Nov 29, 2018 at 12:34:12PM -0700, Jens Axboe wrote:
> If we aren't forced to do round robin tag allocation, just use the
> allocation hint to find the index for the tag word, don't use it for the
> offset inside the word.

Maybe also add "We're already fetching that cache line, so we might as
well check the whole word."

> This avoids a potential extra round trip in the
> bit looping.
> 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index fdd1b8aa8ac6..2987b2ac8ed7 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -118,10 +118,19 @@ int sbitmap_get(struct sbitmap *sb, unsigned int 
> alloc_hint, bool round_robin)
>  
>   index = SB_NR_TO_INDEX(sb, alloc_hint);
>  
> + /*
> +  * Unless we're doing round robin tag allocation, just use the
> +  * alloc_hint to find the right word index. No point in looping
> +  * twice in find_next_zero_bit() for that case.
> +  */
> + if (round_robin)
> + alloc_hint = SB_NR_TO_BIT(sb, alloc_hint);
> + else
> + alloc_hint = 0;
> +
>   for (i = 0; i < sb->map_nr; i++) {
>   nr = __sbitmap_get_word(>map[index].word,
> - sb->map[index].depth,
> - SB_NR_TO_BIT(sb, alloc_hint),
> + sb->map[index].depth, alloc_hint,
>   !round_robin);
>   if (nr != -1) {
>   nr += index << sb->shift;
> @@ -130,12 +139,8 @@ int sbitmap_get(struct sbitmap *sb, unsigned int 
> alloc_hint, bool round_robin)
>  
>   /* Jump to next index. */
>   index++;
> - alloc_hint = index << sb->shift;
> -
> - if (index >= sb->map_nr) {
> + if (index >= sb->map_nr)
>   index = 0;
> - alloc_hint = 0;
> - }

We need to set alloc_hint = 0 here for the round_robin case.

>   }
>  
>   return nr;
> 
> -- 
> Jens Axboe
> 


Re: [PATCH 2/8] block: improve logic around when to sort a plug list

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 04:59:14PM -0700, Jens Axboe wrote:
> On 11/27/18 4:49 PM, Jens Axboe wrote:
> > On 11/27/18 4:31 PM, Omar Sandoval wrote:
> >> On Mon, Nov 26, 2018 at 09:35:50AM -0700, Jens Axboe wrote:
> >>> Do it for the nr_hw_queues == 1 case, but only do it for the multi queue
> >>> case if we have requests for multiple devices in the plug.
> >>>
> >>> Signed-off-by: Jens Axboe 
> >>> ---
> >>>  block/blk-core.c   | 1 +
> >>>  block/blk-mq.c | 7 +--
> >>>  include/linux/blkdev.h | 1 +
> >>>  3 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index be9233400314..c9758d185357 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -1780,6 +1780,7 @@ void blk_start_plug(struct blk_plug *plug)
> >>>   INIT_LIST_HEAD(>mq_list);
> >>>   INIT_LIST_HEAD(>cb_list);
> >>>   plug->rq_count = 0;
> >>> + plug->do_sort = false;
> >>>  
> >>>   /*
> >>>* Store ordering should not be needed here, since a potential
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 99c66823d52f..6a249bf6ed00 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -1678,7 +1678,8 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, 
> >>> bool from_schedule)
> >>>   list_splice_init(>mq_list, );
> >>>   plug->rq_count = 0;
> >>>  
> >>> - list_sort(NULL, , plug_rq_cmp);
> >>> + if (plug->do_sort)
> >>> + list_sort(NULL, , plug_rq_cmp);
> >>>  
> >>>   this_q = NULL;
> >>>   this_hctx = NULL;
> >>> @@ -1935,6 +1936,7 @@ static blk_qc_t blk_mq_make_request(struct 
> >>> request_queue *q, struct bio *bio)
> >>>  
> >>>   list_add_tail(>queuelist, >mq_list);
> >>>   plug->rq_count++;
> >>> + plug->do_sort = true;
> >>>   } else if (plug && !blk_queue_nomerges(q)) {
> >>>   blk_mq_bio_to_request(rq, bio);
> >>>  
> >>> @@ -1958,7 +1960,8 @@ static blk_qc_t blk_mq_make_request(struct 
> >>> request_queue *q, struct bio *bio)
> >>>   data.hctx = same_queue_rq->mq_hctx;
> >>>   blk_mq_try_issue_directly(data.hctx, same_queue_rq,
> >>>   );
> >>> - }
> >>> + } else if (plug->rq_count > 1)
> >>> + plug->do_sort = true;
> >>
> >> If plug->rq_count == 2, there's no benefit to sorting, either. The
> >> nr_hw_queues == 1 case could also avoid sorting in that case. So maybe
> >> this whole patch could just be replaced with:
> > 
> > Heh yes, good point, it should be 3 at least. But if you look at the
> > later mq plug patch, we only sort for that one if we have multiple
> > queues. So the logic should be something ala:
> > 
> > if (plug->rq_count > 2 && plug->has_multiple_queues)
> > 
> > since that's the only case we want to sort for.
> 
> How about something like this?
> 
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index be9233400314..d107d016b92b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1780,6 +1780,7 @@ void blk_start_plug(struct blk_plug *plug)
>   INIT_LIST_HEAD(>mq_list);
>   INIT_LIST_HEAD(>cb_list);
>   plug->rq_count = 0;
> + plug->multiple_queues = false;
>  
>   /*
>* Store ordering should not be needed here, since a potential
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7b7dff85cf6c..02daa32c5d77 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1677,7 +1677,8 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
> from_schedule)
>   list_splice_init(>mq_list, );
>   plug->rq_count = 0;
>  
> - list_sort(NULL, , plug_rq_cmp);
> + if (plug->rq_count > 2 && plug->multiple_queues)
> + list_sort(NULL, , plug_rq_cmp);
>  
>   this_q = NULL;
>   this_hctx = NULL;
> @@ -1866,6 +1867,20 @@ void blk_mq_try_issue_list_directly(struct 
> blk_mq_hw_ctx *hctx,
>   }
>  }
>  
> +static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
> +{
> + list_add_tail(>queuelist, >mq_

Re: [PATCH 2/8] block: improve logic around when to sort a plug list

2018-11-27 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 04:49:27PM -0700, Jens Axboe wrote:
> On 11/27/18 4:31 PM, Omar Sandoval wrote:
> > On Mon, Nov 26, 2018 at 09:35:50AM -0700, Jens Axboe wrote:
> >> Do it for the nr_hw_queues == 1 case, but only do it for the multi queue
> >> case if we have requests for multiple devices in the plug.
> >>
> >> Signed-off-by: Jens Axboe 
> >> ---
> >>  block/blk-core.c   | 1 +
> >>  block/blk-mq.c | 7 +--
> >>  include/linux/blkdev.h | 1 +
> >>  3 files changed, 7 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index be9233400314..c9758d185357 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1780,6 +1780,7 @@ void blk_start_plug(struct blk_plug *plug)
> >>INIT_LIST_HEAD(>mq_list);
> >>INIT_LIST_HEAD(>cb_list);
> >>plug->rq_count = 0;
> >> +  plug->do_sort = false;
> >>  
> >>/*
> >> * Store ordering should not be needed here, since a potential
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 99c66823d52f..6a249bf6ed00 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -1678,7 +1678,8 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, 
> >> bool from_schedule)
> >>list_splice_init(>mq_list, );
> >>plug->rq_count = 0;
> >>  
> >> -  list_sort(NULL, , plug_rq_cmp);
> >> +  if (plug->do_sort)
> >> +  list_sort(NULL, , plug_rq_cmp);
> >>  
> >>this_q = NULL;
> >>this_hctx = NULL;
> >> @@ -1935,6 +1936,7 @@ static blk_qc_t blk_mq_make_request(struct 
> >> request_queue *q, struct bio *bio)
> >>  
> >>list_add_tail(>queuelist, >mq_list);
> >>plug->rq_count++;
> >> +  plug->do_sort = true;
> >>} else if (plug && !blk_queue_nomerges(q)) {
> >>blk_mq_bio_to_request(rq, bio);
> >>  
> >> @@ -1958,7 +1960,8 @@ static blk_qc_t blk_mq_make_request(struct 
> >> request_queue *q, struct bio *bio)
> >>data.hctx = same_queue_rq->mq_hctx;
> >>blk_mq_try_issue_directly(data.hctx, same_queue_rq,
> >>);
> >> -  }
> >> +  } else if (plug->rq_count > 1)
> >> +  plug->do_sort = true;
> > 
> > If plug->rq_count == 2, there's no benefit to sorting, either. The
> > nr_hw_queues == 1 case could also avoid sorting in that case. So maybe
> > this whole patch could just be replaced with:
> 
> Heh yes, good point, it should be 3 at least. But if you look at the
> later mq plug patch, we only sort for that one if we have multiple
> queues. So the logic should be something ala:
> 
> if (plug->rq_count > 2 && plug->has_multiple_queues)
> 
> since that's the only case we want to sort for.

Yeah, just got to that one. If you're going to respin this, I'll wait
for you to change that around before really looking at it.


Re: [PATCH 7/8] blk-mq: use bd->last == true for list inserts

2018-11-27 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 09:35:55AM -0700, Jens Axboe wrote:
> If we are issuing a list of requests, we know if we're at the last one.
> If we fail issuing, ensure that we call ->commits_rqs() to flush any
> potential previous requests.

One comment below, otherwise

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-core.c |  2 +-
>  block/blk-mq.c   | 32 
>  block/blk-mq.h   |  2 +-
>  3 files changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index c9758d185357..808a65d23f1a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1334,7 +1334,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + return blk_mq_request_issue_directly(rq, true);
>  }
>  EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6a249bf6ed00..0a12cec0b426 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1260,6 +1260,14 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list,
>   if (!list_empty(list)) {
>   bool needs_restart;
>  
> + /*
> +  * If we didn't flush the entire list, we could have told
> +  * the driver there was more coming, but that turned out to
> +  * be a lie.
> +  */
> + if (q->mq_ops->commit_rqs)
> + q->mq_ops->commit_rqs(hctx);
> +

This hunk seems like it should go with the patch adding commit_rqs.

>   spin_lock(>lock);
>   list_splice_init(list, >dispatch);
>   spin_unlock(>lock);
> @@ -1736,12 +1744,12 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
> *hctx, struct request *rq)
>  
>  static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - blk_qc_t *cookie)
> + blk_qc_t *cookie, bool last)
>  {
>   struct request_queue *q = rq->q;
>   struct blk_mq_queue_data bd = {
>   .rq = rq,
> - .last = true,
> + .last = last,
>   };
>   blk_qc_t new_cookie;
>   blk_status_t ret;
> @@ -1776,7 +1784,7 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> - bool bypass_insert)
> + bool bypass_insert, bool last)
>  {
>   struct request_queue *q = rq->q;
>   bool run_queue = true;
> @@ -1805,7 +1813,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   goto insert;
>   }
>  
> - return __blk_mq_issue_directly(hctx, rq, cookie);
> + return __blk_mq_issue_directly(hctx, rq, cookie, last);
>  insert:
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
> @@ -1824,7 +1832,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>   hctx_lock(hctx, _idx);
>  
> - ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> + ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true);
>   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
>   blk_mq_sched_insert_request(rq, false, true, false);
>   else if (ret != BLK_STS_OK)
> @@ -1833,7 +1841,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   hctx_unlock(hctx, srcu_idx);
>  }
>  
> -blk_status_t blk_mq_request_issue_directly(struct request *rq)
> +blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
>  {
>   blk_status_t ret;
>   int srcu_idx;
> @@ -1841,7 +1849,7 @@ blk_status_t blk_mq_request_issue_directly(struct 
> request *rq)
>   struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
>   hctx_lock(hctx, _idx);
> - ret = __blk_mq_try_issue_directly(hctx, rq, _cookie, true);
> + ret = __blk_mq_try_issue_directly(hctx, rq, _cookie, true, last);
>   hctx_unlock(hctx, srcu_idx);
>  
>   return ret;
> @@ -1856,7 +1864,7 @@ void blk_mq_try_issue_list_directly(struct 
> blk_mq_hw_ctx *hctx,
> 

Re: [PATCH 6/8] ataflop: implement mq_ops->commit_rqs() hook

2018-11-27 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 09:35:54AM -0700, Jens Axboe wrote:
> We need this for blk-mq to kick things into gear, if we told it that
> we had more IO coming, but then failed to deliver on that promise.

Who converted this one? Oh yeah, it was me...

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/ataflop.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
> index f88b4c26d422..475cb972f324 100644
> --- a/drivers/block/ataflop.c
> +++ b/drivers/block/ataflop.c
> @@ -1471,6 +1471,13 @@ static void setup_req_params( int drive )
>   ReqTrack, ReqSector, (unsigned long)ReqData ));
>  }
>  
> +static void ataflop_commit_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> + spin_lock_irq(_lock);
> + finish_fdc();
> + spin_unlock_irq(_lock);
> +}
> +
>  static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx,
>const struct blk_mq_queue_data *bd)
>  {
> @@ -1947,6 +1954,7 @@ static const struct block_device_operations floppy_fops 
> = {
>  
>  static const struct blk_mq_ops ataflop_mq_ops = {
>   .queue_rq = ataflop_queue_rq,
> + .commit_rqs = ataflop_commit_rqs,
>  };
>  
>  static struct kobject *floppy_find(dev_t dev, int *part, void *data)
> -- 
> 2.17.1
> 


Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook

2018-11-27 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote:
> We need this for blk-mq to kick things into gear, if we told it that
> we had more IO coming, but then failed to deliver on that promise.

Reviewed-by: Omar Sandoval 

But also cc'd the virtio-blk maintainers.

> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/virtio_blk.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 6e869d05f91e..b49c57e77780 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq)
>   spin_unlock_irqrestore(>vqs[qid].lock, flags);
>  }
>  
> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> + struct virtio_blk *vblk = hctx->queue->queuedata;
> + int qid = hctx->queue_num;
> + bool kick;
> +
> + spin_lock_irq(>vqs[qid].lock);
> + kick = virtqueue_kick_prepare(vblk->vqs[qid].vq);
> + spin_unlock_irq(>vqs[qid].lock);
> +
> + if (kick)
> + virtqueue_notify(vblk->vqs[qid].vq);
> +}
> +
>  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>  const struct blk_mq_queue_data *bd)
>  {
> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request *req)
>  
>  static const struct blk_mq_ops virtio_mq_ops = {
>   .queue_rq   = virtio_queue_rq,
> + .commit_rqs = virtio_commit_rqs,
>   .complete   = virtblk_request_done,
>   .init_request   = virtblk_init_request,
>  #ifdef CONFIG_VIRTIO_BLK_SCSI
> -- 
> 2.17.1
> 


Re: [PATCH 3/8] blk-mq: add mq_ops->commit_rqs()

2018-11-27 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 09:35:51AM -0700, Jens Axboe wrote:
> blk-mq passes information to the hardware about any given request being
> the last that we will issue in this sequence. The point is that hardware
> can defer costly doorbell type writes to the last request. But if we run
> into errors issuing a sequence of requests, we may never send the request
> with bd->last == true set. For that case, we need a hook that tells the
> hardware that nothing else is coming right now.
> 
> For failures returned by the drivers ->queue_rq() hook, the driver is
> responsible for flushing pending requests, if it uses bd->last to
> optimize that part. This works like before, no changes there.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  include/linux/blk-mq.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index ca0520ca6437..1fd139b65a6e 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -117,6 +117,7 @@ struct blk_mq_queue_data {
>  
>  typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
>   const struct blk_mq_queue_data *);
> +typedef void (commit_rqs_fn)(struct blk_mq_hw_ctx *);
>  /* takes rq->cmd_flags as input, returns a hardware type index */
>  typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int);
>  typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
> @@ -144,6 +145,15 @@ struct blk_mq_ops {
>*/
>   queue_rq_fn *queue_rq;
>  
> + /*
> +  * If a driver uses bd->last to judge when to submit requests to
> +  * hardware, it must define this function. In case of errors that
> +  * make us stop issuing further requests, this hook serves the
> +  * purpose of kicking the hardware (which the last request otherwise
> +  * would have done).
> +  */
> + commit_rqs_fn   *commit_rqs;
> +
>   /*
>* Return a queue map type for the given request/bio flags
>*/
> -- 
> 2.17.1
> 


Re: [PATCH 2/8] block: improve logic around when to sort a plug list

2018-11-27 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 09:35:50AM -0700, Jens Axboe wrote:
> Do it for the nr_hw_queues == 1 case, but only do it for the multi queue
> case if we have requests for multiple devices in the plug.
> 
> Signed-off-by: Jens Axboe 
> ---
>  block/blk-core.c   | 1 +
>  block/blk-mq.c | 7 +--
>  include/linux/blkdev.h | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index be9233400314..c9758d185357 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1780,6 +1780,7 @@ void blk_start_plug(struct blk_plug *plug)
>   INIT_LIST_HEAD(>mq_list);
>   INIT_LIST_HEAD(>cb_list);
>   plug->rq_count = 0;
> + plug->do_sort = false;
>  
>   /*
>* Store ordering should not be needed here, since a potential
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 99c66823d52f..6a249bf6ed00 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1678,7 +1678,8 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
> from_schedule)
>   list_splice_init(>mq_list, );
>   plug->rq_count = 0;
>  
> - list_sort(NULL, , plug_rq_cmp);
> + if (plug->do_sort)
> + list_sort(NULL, , plug_rq_cmp);
>  
>   this_q = NULL;
>   this_hctx = NULL;
> @@ -1935,6 +1936,7 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>  
>   list_add_tail(>queuelist, >mq_list);
>   plug->rq_count++;
> + plug->do_sort = true;
>   } else if (plug && !blk_queue_nomerges(q)) {
>   blk_mq_bio_to_request(rq, bio);
>  
> @@ -1958,7 +1960,8 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   data.hctx = same_queue_rq->mq_hctx;
>   blk_mq_try_issue_directly(data.hctx, same_queue_rq,
>   );
> - }
> + } else if (plug->rq_count > 1)
> + plug->do_sort = true;

If plug->rq_count == 2, there's no benefit to sorting, either. The
nr_hw_queues == 1 case could also avoid sorting in that case. So maybe
this whole patch could just be replaced with:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7b7dff85cf6c..de93c14e4700 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1675,9 +1675,10 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
from_schedule)
unsigned int depth;
 
list_splice_init(>mq_list, );
-   plug->rq_count = 0;
 
-   list_sort(NULL, , plug_rq_cmp);
+   if (plug->rq_count > 2)
+   list_sort(NULL, , plug_rq_cmp);
+   plug->rq_count = 0;
 
this_q = NULL;
this_hctx = NULL;

That will also handle the multi-queue case since we only plug one
request per multi-queue device. Thoughts?


Re: [PATCH] block/025: test discard sector alignement and sector size overflow

2018-11-26 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 12:00:17PM +0800, Ming Lei wrote:
> This test covers the following two issues:
> 
> 1) discard sector need to be aligned with logical block size
> 
> 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing
> with discard sector size
> 
> Signed-off-by: Ming Lei 
> ---
>  tests/block/025 | 37 +
>  tests/block/025.out |  2 ++
>  2 files changed, 39 insertions(+)
>  create mode 100755 tests/block/025
>  create mode 100644 tests/block/025.out
> 
> diff --git a/tests/block/025 b/tests/block/025
> new file mode 100755
> index ..32b632431793
> --- /dev/null
> +++ b/tests/block/025
> @@ -0,0 +1,37 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2018 Ming Lei 
> +#
> +# Check two corener cases of BLKDISCARD.
> +#
> +# 1) test if discard bio's sector is algined with logical size, fixed by
> +#1adfc5e4136f ("block: make sure discard bio is aligned with logical 
> block size")

Hm, I'm not seeing how this test case tests this commit. Aren't 2049G
and 512M both aligned to 4096 bytes?

> +# 2) test 32 bit overflow when comparing discard sector size. Fixed by
> +#4800bf7bc8c72 ("block: fix 32 bit overflow in __blkdev_issue_discard()")
> +
> +. tests/block/rc
> +. common/scsi_debug
> +
> +DESCRIPTION="check sector alignment and sector size overflow of BLKDISCARD"
> +
> +requires() {
> + _have_scsi_debug
> +}
> +
> +test() {
> + echo "Running ${TEST_NAME}"
> +
> + rm -f "$FULL"
> +
> + # Create virtual device with unmap_zeroes_data support
> + if ! _init_scsi_debug virtual_gb=2049 sector_size=4096 lbpws10=1 
> dev_size_mb=512; then
> + return 1
> + fi
> +
> + local dev="/dev/${SCSI_DEBUG_DEVICES[0]}"
> + blkdiscard "$dev"
> +
> + _exit_scsi_debug
> +
> + echo "Test complete"
> +}
> diff --git a/tests/block/025.out b/tests/block/025.out
> new file mode 100644
> index ..fd9a6d5f70de
> --- /dev/null
> +++ b/tests/block/025.out
> @@ -0,0 +1,2 @@
> +Running block/025
> +Test complete
> -- 
> 2.9.5
> 


Re: [PATCH blktests] Add use of logger so that syslog files show when each test starts

2018-11-26 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 02:31:29PM -0500, Theodore Y. Ts'o wrote:
> On Mon, Nov 26, 2018 at 10:37:23AM -0800, Omar Sandoval wrote:
> > 
> > Hm, what if we output it as KERN_INFO?
> > 
> > diff --git a/check b/check
> > index f6c3537..9b4765f 100755
> > --- a/check
> > +++ b/check
> > @@ -314,7 +314,7 @@ _call_test() {
> >  
> > if [[ -w /dev/kmsg ]]; then
> > local dmesg_marker="run blktests $TEST_NAME at 
> > ${TEST_RUN["date"]}"
> > -   echo "$dmesg_marker" >> /dev/kmsg
> > +   echo "<6>$dmesg_marker" >> /dev/kmsg
> > else
> > local dmesg_marker=""
> > CHECK_DMESG=0
> 
> Still not working, alas.   I tested it via:
> 
> root@xfstests-ltm:~# echo "<6>testing 123" >> /dev/kmsg 
> 
> It still didn't appear in /var/log/kern.log
> 
>   - Ted

Ok, well that's my due diligence, so applied. Thanks!


Re: [PATCH blktests] Add use of logger so that syslog files show when each test starts

2018-11-26 Thread Omar Sandoval
On Mon, Nov 26, 2018 at 01:32:11PM -0500, Theodore Y. Ts'o wrote:
> On Mon, Nov 26, 2018 at 09:50:57AM -0800, Omar Sandoval wrote:
> > 
> > Hey, Ted, sorry, I meant to ask you about this in person at LPC but
> > forgot to. Forgive my ignorance about syslog, but does syslog not pick
> > up the line we write to dmesg?
> 
> Unfortunately, it does not.
> 
> root@xfstests-ltm:~# echo foo testing >> /dev/kmsg 
> root@xfstests-ltm:~# dmesg | tail
> [7.103899] random: crng init done
> [7.104759] random: 7 urandom warning(s) missed due to ratelimiting
> [7.154525] EDAC MC: Ver: 3.0.0
> [7.221973] EDAC sbridge: Seeking for: PCI ID 8086:2fa0
> [7.221977] EDAC sbridge:  Ver: 1.1.1 
> [7.259154] device-mapper: uevent: version 1.0.3
> [7.267252] device-mapper: ioctl: 4.35.0-ioctl (2016-06-23) initialised: 
> dm-de...@redhat.com
> [7.762306] EXT4-fs (sda1): resizing filesystem from 2620928 to 13106683 
> blocks
> [8.541258] EXT4-fs (sda1): resized filesystem to 13106683
> [840160.646340] foo testing
> root@xfstests-ltm:~# tail /var/log/kern.log
> root@xfstests-ltm:~# grep kern.log /etc/rsyslog.conf 
> kern.*-/var/log/kern.log
> 
> So as you can see nothing written to kern.log, even though we wrote to
> /dev/kmsg.
> 
> To prove that kern.log will get kernel messages:
> 
> root@xfstests-ltm:~# mke2fs -t ext4 -Fq /tmp/foo.img 8M
> root@xfstests-ltm:~# mount -t ext4 /tmp/foo.img /mnt
> root@xfstests-ltm:~# echo testing > /sys/fs/ext4/loop0/trigger_fs_error 
> root@xfstests-ltm:~# tail /var/log/kern.log
> Nov 26 13:30:22 xfstests-ltm kernel: [840286.728296] loop: module loaded
> Nov 26 13:30:22 xfstests-ltm kernel: [840286.752181] EXT4-fs (loop0): mounted 
> filesystem with ordered data mode. Opts: (null)
> Nov 26 13:30:41 xfstests-ltm kernel: [840305.773418] EXT4-fs error (device 
> loop0): trigger_test_error:123: comm bash: testing
> root@xfstests-ltm:~# 

Hm, what if we output it as KERN_INFO?

diff --git a/check b/check
index f6c3537..9b4765f 100755
--- a/check
+++ b/check
@@ -314,7 +314,7 @@ _call_test() {
 
if [[ -w /dev/kmsg ]]; then
local dmesg_marker="run blktests $TEST_NAME at 
${TEST_RUN["date"]}"
-   echo "$dmesg_marker" >> /dev/kmsg
+   echo "<6>$dmesg_marker" >> /dev/kmsg
else
local dmesg_marker=""
CHECK_DMESG=0


Re: [PATCH blktests] Add use of logger so that syslog files show when each test starts

2018-11-26 Thread Omar Sandoval
On Thu, Nov 22, 2018 at 08:02:21PM -0500, Theodore Y. Ts'o wrote:
> Ping?
> 
>   - Ted
>   
> On Mon, Oct 29, 2018 at 12:15:57PM -0400, Theodore Ts'o wrote:
> > Signed-off-by: Theodore Ts'o 
> > ---
> >  check | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/check b/check
> > index f6c3537..ebd87c0 100755
> > --- a/check
> > +++ b/check
> > @@ -319,6 +319,7 @@ _call_test() {
> > local dmesg_marker=""
> > CHECK_DMESG=0
> > fi
> > +   $LOGGER_PROG "run blktests $TEST_NAME"
> >  
> > trap _cleanup EXIT
> > if ! TMPDIR="$(mktemp --tmpdir -p "$OUTPUT" -d 
> > "tmpdir.${TEST_NAME//\//.}.XXX")"; then
> > @@ -578,6 +579,8 @@ fi
> >  eval set -- "$TEMP"
> >  unset TEMP
> >  
> > +LOGGER_PROG="$(type -P logger)" || LOGGER_PROG=true
> > +
> >  if [[ -r config ]]; then
> > # shellcheck disable=SC1091
> > . config
> > -- 
> > 2.18.0.rc0
> > 

Hey, Ted, sorry, I meant to ask you about this in person at LPC but
forgot to. Forgive my ignorance about syslog, but does syslog not pick
up the line we write to dmesg?


Re: [PATCH 6/6] mmc: stop abusing the request queue_lock pointer

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:10:06AM +0100, Christoph Hellwig wrote:
> Replace the lock in mmc_blk_data that is only used through a pointer
> in struct mmc_queue and to protect fields in that structure with
> an actual lock in struct mmc_queue.

Looks sane to me, but I'll let the mmc people ack.

> Suggested-by: Ulf Hansson 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/mmc/core/block.c | 24 +++-
>  drivers/mmc/core/queue.c | 31 +++
>  drivers/mmc/core/queue.h |  4 ++--
>  3 files changed, 28 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 70ec465beb69..2c329a3e3fdb 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -100,7 +100,6 @@ static DEFINE_IDA(mmc_rpmb_ida);
>   * There is one mmc_blk_data per slot.
>   */
>  struct mmc_blk_data {
> - spinlock_t  lock;
>   struct device   *parent;
>   struct gendisk  *disk;
>   struct mmc_queue queue;
> @@ -1483,7 +1482,7 @@ static void mmc_blk_cqe_complete_rq(struct mmc_queue 
> *mq, struct request *req)
>   blk_mq_end_request(req, BLK_STS_OK);
>   }
>  
> - spin_lock_irqsave(mq->lock, flags);
> + spin_lock_irqsave(>lock, flags);
>  
>   mq->in_flight[mmc_issue_type(mq, req)] -= 1;
>  
> @@ -1491,7 +1490,7 @@ static void mmc_blk_cqe_complete_rq(struct mmc_queue 
> *mq, struct request *req)
>  
>   mmc_cqe_check_busy(mq);
>  
> - spin_unlock_irqrestore(mq->lock, flags);
> + spin_unlock_irqrestore(>lock, flags);
>  
>   if (!mq->cqe_busy)
>   blk_mq_run_hw_queues(q, true);
> @@ -1991,13 +1990,13 @@ static void mmc_blk_mq_dec_in_flight(struct mmc_queue 
> *mq, struct request *req)
>   unsigned long flags;
>   bool put_card;
>  
> - spin_lock_irqsave(mq->lock, flags);
> + spin_lock_irqsave(>lock, flags);
>  
>   mq->in_flight[mmc_issue_type(mq, req)] -= 1;
>  
>   put_card = (mmc_tot_in_flight(mq) == 0);
>  
> - spin_unlock_irqrestore(mq->lock, flags);
> + spin_unlock_irqrestore(>lock, flags);
>  
>   if (put_card)
>   mmc_put_card(mq->card, >ctx);
> @@ -2093,11 +2092,11 @@ static void mmc_blk_mq_req_done(struct mmc_request 
> *mrq)
>* request does not need to wait (although it does need to
>* complete complete_req first).
>*/
> - spin_lock_irqsave(mq->lock, flags);
> + spin_lock_irqsave(>lock, flags);
>   mq->complete_req = req;
>   mq->rw_wait = false;
>   waiting = mq->waiting;
> - spin_unlock_irqrestore(mq->lock, flags);
> + spin_unlock_irqrestore(>lock, flags);
>  
>   /*
>* If 'waiting' then the waiting task will complete this
> @@ -2116,10 +2115,10 @@ static void mmc_blk_mq_req_done(struct mmc_request 
> *mrq)
>   /* Take the recovery path for errors or urgent background operations */
>   if (mmc_blk_rq_error(>brq) ||
>   mmc_blk_urgent_bkops_needed(mq, mqrq)) {
> - spin_lock_irqsave(mq->lock, flags);
> + spin_lock_irqsave(>lock, flags);
>   mq->recovery_needed = true;
>   mq->recovery_req = req;
> - spin_unlock_irqrestore(mq->lock, flags);
> + spin_unlock_irqrestore(>lock, flags);
>   wake_up(>wait);
>   schedule_work(>recovery_work);
>   return;
> @@ -2142,7 +2141,7 @@ static bool mmc_blk_rw_wait_cond(struct mmc_queue *mq, 
> int *err)
>* Wait while there is another request in progress, but not if recovery
>* is needed. Also indicate whether there is a request waiting to start.
>*/
> - spin_lock_irqsave(mq->lock, flags);
> + spin_lock_irqsave(>lock, flags);
>   if (mq->recovery_needed) {
>   *err = -EBUSY;
>   done = true;
> @@ -2150,7 +2149,7 @@ static bool mmc_blk_rw_wait_cond(struct mmc_queue *mq, 
> int *err)
>   done = !mq->rw_wait;
>   }
>   mq->waiting = !done;
> - spin_unlock_irqrestore(mq->lock, flags);
> + spin_unlock_irqrestore(>lock, flags);
>  
>   return done;
>  }
> @@ -2327,12 +2326,11 @@ static struct mmc_blk_data *mmc_blk_alloc_req(struct 
> mmc_card *card,
>   goto err_kfree;
>   }
>  
> - spin_lock_init(>lock);
>   INIT_LIST_HEAD(>part);
>   INIT_LIST_HEAD(>rpmbs);
>   md->usage = 1;
>  
> - ret = mmc_init_queue(>queue, card, >lock);
> + ret = mmc_init_queue(>queue, card);
>   if (ret)
>   goto err_putdisk;
>  
> diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
> index 4485cf12218c..35cc138b096d 100644
> --- a/drivers/mmc/core/queue.c
> +++ b/drivers/mmc/core/queue.c
> @@ -89,9 +89,9 @@ void mmc_cqe_recovery_notifier(struct mmc_request *mrq)
>   struct mmc_queue *mq = q->queuedata;
>   unsigned long flags;
>  
> - 

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 12:32:33AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 16, 2018 at 03:04:57PM +1100, Dave Chinner wrote:
> > They don't run on my test machines because they require a modular
> > kernel and I run a monolithic kernel specified externally by the
> > qemu command line on all my test VMs.
> > 
> > generic/349 [not run] scsi_debug module not found
> > generic/350 [not run] scsi_debug module not found
> > generic/351 [not run] scsi_debug module not found
> 
> Same here, btw.  Any test that requires modules is a rather bad idea.

I'll plug my vm.py script that supports running a kernel build with
modules without installing them into the VM (by mounting the modules
over 9p): https://github.com/osandov/osandov-linux#vm-setup


Re: [PATCH 5/6] ide: don't acquire queue_lock in ide_complete_pm_rq

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:40:04AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 16, 2018 at 12:37:32AM -0800, Omar Sandoval wrote:
> > On Fri, Nov 16, 2018 at 09:10:05AM +0100, Christoph Hellwig wrote:
> > > blk_mq_stop_hw_queues doesn't need any locking, and the ide
> > > dev_flags field isn't protected by it either.
> > 
> > Is it a bug that dev_flags is no longer protected by queue_lock after
> > the mq conversion?
> 
> Most if not all users weren't under queue_lock before.  As far as I can
> tell it generally is set in the probe path, and without any obvious
> locking.

Good enough for me, you can add

Reviewed-by: Omar Sandoval 


Re: [PATCH 5/6] ide: don't acquire queue_lock in ide_complete_pm_rq

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:10:05AM +0100, Christoph Hellwig wrote:
> blk_mq_stop_hw_queues doesn't need any locking, and the ide
> dev_flags field isn't protected by it either.

Is it a bug that dev_flags is no longer protected by queue_lock after
the mq conversion?

> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/ide/ide-pm.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> index 56690f523100..192e6c65d34e 100644
> --- a/drivers/ide/ide-pm.c
> +++ b/drivers/ide/ide-pm.c
> @@ -201,7 +201,6 @@ void ide_complete_pm_rq(ide_drive_t *drive, struct 
> request *rq)
>  {
>   struct request_queue *q = drive->queue;
>   struct ide_pm_state *pm = ide_req(rq)->special;
> - unsigned long flags;
>  
>   ide_complete_power_step(drive, rq);
>   if (pm->pm_step != IDE_PM_COMPLETED)
> @@ -211,12 +210,10 @@ void ide_complete_pm_rq(ide_drive_t *drive, struct 
> request *rq)
>   printk("%s: completing PM request, %s\n", drive->name,
>  (ide_req(rq)->type == ATA_PRIV_PM_SUSPEND) ? "suspend" : 
> "resume");
>  #endif
> - spin_lock_irqsave(>queue_lock, flags);
>   if (ide_req(rq)->type == ATA_PRIV_PM_SUSPEND)
>   blk_mq_stop_hw_queues(q);
>   else
>   drive->dev_flags &= ~IDE_DFLAG_BLOCKED;
> - spin_unlock_irqrestore(>queue_lock, flags);
>  
>   drive->hwif->rq = NULL;
>  
> -- 
> 2.19.1
> 


Re: [PATCH 4/6] ide: don't acquire queue lock in ide_pm_execute_rq

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:10:04AM +0100, Christoph Hellwig wrote:
> There is nothing we can synchronize against over a call to
> blk_queue_dying.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/ide/ide-pm.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> index 51fe10ac02fa..56690f523100 100644
> --- a/drivers/ide/ide-pm.c
> +++ b/drivers/ide/ide-pm.c
> @@ -44,15 +44,12 @@ static int ide_pm_execute_rq(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
>  
> - spin_lock_irq(>queue_lock);
>   if (unlikely(blk_queue_dying(q))) {
>   rq->rq_flags |= RQF_QUIET;
>   scsi_req(rq)->result = -ENXIO;
> - spin_unlock_irq(>queue_lock);
>   blk_mq_end_request(rq, BLK_STS_OK);
>   return -ENXIO;
>   }
> - spin_unlock_irq(>queue_lock);
>   blk_execute_rq(q, NULL, rq, true);
>  
>   return scsi_req(rq)->result ? -EIO : 0;
> -- 
> 2.19.1
> 


Re: [PATCH 3/6] pktcdvd: remove queue_lock around blk_queue_max_hw_sectors

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:10:03AM +0100, Christoph Hellwig wrote:
> blk_queue_max_hw_sectors can't do anything with queue_lock protection
> so don't hold it.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/pktcdvd.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index 4adf4c8861cd..f5a71023f76c 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2203,9 +2203,7 @@ static int pkt_open_dev(struct pktcdvd_device *pd, 
> fmode_t write)
>* Some CDRW drives can not handle writes larger than one 
> packet,
>* even if the size is a multiple of the packet size.
>*/
> - spin_lock_irq(>queue_lock);
>   blk_queue_max_hw_sectors(q, pd->settings.size);
> - spin_unlock_irq(>queue_lock);
>   set_bit(PACKET_WRITABLE, >flags);
>   } else {
>   pkt_set_speed(pd, MAX_SPEED, MAX_SPEED);
> -- 
> 2.19.1
> 


Re: [PATCH 2/6] floppy: remove queue_lock around floppy_end_request

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:10:02AM +0100, Christoph Hellwig wrote:
> There is nothing the queue_lock could protect inside floppy_end_request,
> so remove it.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/floppy.c | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
> index eeb4be8d000b..218099dd8e44 100644
> --- a/drivers/block/floppy.c
> +++ b/drivers/block/floppy.c
> @@ -2254,10 +2254,7 @@ static void request_done(int uptodate)
>   if (block > _floppy->sect)
>   DRS->maxtrack = 1;
>  
> - /* unlock chained buffers */
> - spin_lock_irqsave(>queue_lock, flags);
>   floppy_end_request(req, 0);
> - spin_unlock_irqrestore(>queue_lock, flags);
>   } else {
>   if (rq_data_dir(req) == WRITE) {
>   /* record write error information */
> @@ -2269,9 +2266,7 @@ static void request_done(int uptodate)
>   DRWE->last_error_sector = blk_rq_pos(req);
>   DRWE->last_error_generation = DRS->generation;
>   }
> - spin_lock_irqsave(>queue_lock, flags);
>   floppy_end_request(req, BLK_STS_IOERR);
> - spin_unlock_irqrestore(>queue_lock, flags);
>   }
>  }
>  
> -- 
> 2.19.1
> 


Re: [PATCH 1/6] block: remove the rq_alloc_data request_queue field

2018-11-16 Thread Omar Sandoval
On Fri, Nov 16, 2018 at 09:10:01AM +0100, Christoph Hellwig wrote:

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/blkdev.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1d185f1fc333..5c5ef461845f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -567,7 +567,6 @@ struct request_queue {
>   boolmq_sysfs_init_done;
>  
>   size_t  cmd_size;
> - void*rq_alloc_data;
>  
>   struct work_struct  release_work;
>  
> -- 
> 2.19.1
> 


Re: "kyber: add tracepoints" causes write beyond size of object

2018-11-14 Thread Omar Sandoval
On Wed, Nov 14, 2018 at 05:23:06PM -0600, Kees Cook wrote:
> On Sat, Nov 10, 2018 at 8:15 AM, Jordan Glover
>  wrote:
> > Hello,
> >
> > Commit 6c3b7af1c975b87b86dcb2af233d1ae21eb05107 ("kyber: add 
> > tracepoints")[1] causes write beyond size of object. This was detected by 
> > "FORTIFY_SOURCE intra-object overflow checking"[2] feature which is part of 
> > linux-hardened out-of-tree patchset designed to catch such errors.
> >
> > The specific error is:
> >
> > In file included from ./include/linux/bitmap.h:9,
> >  from ./include/linux/cpumask.h:12,
> >  from ./arch/x86/include/asm/cpumask.h:5,
> >  from ./arch/x86/include/asm/msr.h:11,
> >  from ./arch/x86/include/asm/processor.h:21,
> >  from ./arch/x86/include/asm/cpufeature.h:8,
> >  from ./arch/x86/include/asm/thread_info.h:53,
> >  from ./include/linux/thread_info.h:38,
> >  from ./arch/x86/include/asm/preempt.h:7,
> >  from ./include/linux/preempt.h:81,
> >  from ./include/linux/rcupdate.h:40,
> >  from ./include/linux/rculist.h:11,
> >  from ./include/linux/pid.h:5,
> >  from ./include/linux/sched.h:14,
> >  from ./include/linux/blkdev.h:5,
> >  from block/kyber-iosched.c:21:
> > In function ‘strlcpy’,
> > inlined from ‘perf_trace_kyber_latency’ at 
> > ./include/trace/events/kyber.h:14:1:
> > ./include/linux/string.h:310:4: error: call to ‘__write_overflow’ declared 
> > with attribute error: detected write beyond size of object passed as 1st 
> > parameter
> > __write_overflow();
> > ^~
> > In function ‘strlcpy’,
> > inlined from ‘trace_event_raw_event_kyber_latency’ at 
> > ./include/trace/events/kyber.h:14:1:
> > ./include/linux/string.h:310:4: error: call to ‘__write_overflow’ declared 
> > with attribute error: detected write beyond size of object passed as 1st 
> > parameter
> > __write_overflow();
> > ^~
> > make[1]: *** [scripts/Makefile.build:293: block/kyber-iosched.o] Error 1
> > make: *** [Makefile:1063: block] Error 2
> > make: *** Waiting for unfinished jobs
> >
> > Using 'strlcpy' function is generally not recommended[3][4].
> 
> Due to the macros, this was a little tricky to find, but it looks like
> a cut/paste typo:
> 
> #define DOMAIN_LEN  16
> #define LATENCY_TYPE_LEN8
> 
> strlcpy(__entry->domain, domain, DOMAIN_LEN);
> strlcpy(__entry->type, type, DOMAIN_LEN);
> 
> This should use strscpy() regardless, and should use sizeof(dst)
> instead of separate literals. The primary bug is using DOMAIN_LEN for
> __entry->type when it is actually LATENCY_TYPE_LEN bytes.
> 
> Can you build a patch for this? I'm happy to review.
> 
> Thanks for finding this!

Sorry, I forgot to reply to this thread, but Jens queued up a fix for
this already:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus=18e962ac0781bcb70d433de3b2a325ff792b4288


Re: [PATCH v2] makefile: Add install rule

2018-11-14 Thread Omar Sandoval
On Wed, Nov 14, 2018 at 11:39:31AM -0800, Gwendal Grignou wrote:
> Add rule to install to a target directory, /usr/local/blktests by
> default.
> 
> Signed-off-by: Gwendal Grignou 

Pushed with a slightly expanded description in the README. Thanks!


Re: [PATCH blktests] makefile: Add install rule

2018-11-14 Thread Omar Sandoval
On Fri, Nov 09, 2018 at 07:15:24AM -0800, Gwendal Grignou wrote:
> Add rule to install to a target directory, /usr/local/blktests by
> default.

This seems reasonale, but I'm curious, what's your use case?

> Signed-off-by: Gwendal Grignou 
> ---
>  Makefile | 9 +
>  src/Makefile | 3 +++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/Makefile b/Makefile
> index 38b8ad1..d7c2b74 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1,9 +1,18 @@
> +prefix ?= /usr/local
> +dest = $(DESTDIR)$(prefix)/blktests
> +
>  all:
>   $(MAKE) -C src all
>  
>  clean:
>   $(MAKE) -C src clean
>  
> +install:

install should be added to the .PHONY targets.

> + install -m755 -d $(dest)
> + install check $(dest)
> + cp -R tests common $(dest)
> + $(MAKE) -C src dest=$(dest) install

The src programs need to be installed to $(dest)/src.

>  # SC2119: "Use foo "$@" if function's $1 should mean script's $1". False
>  # positives on helpers like _init_scsi_debug.
>  SHELLCHECK_EXCLUDE := SC2119
> diff --git a/src/Makefile b/src/Makefile
> index 15c1022..f0ddbb5 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -20,6 +20,9 @@ all: $(TARGETS)
>  clean:
>   rm -f $(TARGETS)
>  
> +install: $(TARGETS)
> + install $(TARGETS) $(dest)
> +
>  $(C_TARGETS): %: %.c
>   $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ $^

This ability to install tests could also use a mention in the README.


[PATCH v2] kyber: fix wrong strlcpy() size in trace_kyber_latency()

2018-11-12 Thread Omar Sandoval
From: Omar Sandoval 

When copying to the latency type, we should be passing LATENCY_TYPE_LEN,
not DOMAIN_LEN (this isn't a problem in practice because we only pass
"total" or "I/O"). Fix it by changing all of the strlcpy() calls to use
sizeof().

Fixes: 6c3b7af1c975 ("kyber: add tracepoints")
Reported-by: Jordan Glover 
Signed-off-by: Omar Sandoval 
---
Changes from v1:
- Change to using sizeof() instead of constants

 include/trace/events/kyber.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index a9834c37ac40..c0e7d24ca256 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -31,8 +31,8 @@ TRACE_EVENT(kyber_latency,
 
TP_fast_assign(
__entry->dev= 
disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
-   strlcpy(__entry->domain, domain, DOMAIN_LEN);
-   strlcpy(__entry->type, type, DOMAIN_LEN);
+   strlcpy(__entry->domain, domain, sizeof(__entry->domain));
+   strlcpy(__entry->type, type, sizeof(__entry->type));
__entry->percentile = percentile;
__entry->numerator  = numerator;
__entry->denominator= denominator;
@@ -60,7 +60,7 @@ TRACE_EVENT(kyber_adjust,
 
TP_fast_assign(
__entry->dev= 
disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
-   strlcpy(__entry->domain, domain, DOMAIN_LEN);
+   strlcpy(__entry->domain, domain, sizeof(__entry->domain));
__entry->depth  = depth;
),
 
@@ -82,7 +82,7 @@ TRACE_EVENT(kyber_throttled,
 
TP_fast_assign(
__entry->dev= 
disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
-   strlcpy(__entry->domain, domain, DOMAIN_LEN);
+   strlcpy(__entry->domain, domain, sizeof(__entry->domain));
),
 
TP_printk("%d,%d %s", MAJOR(__entry->dev), MINOR(__entry->dev),
-- 
2.19.1



Re: [PATCH] kyber: fix wrong strlcpy() size in trace_kyber_latency()

2018-11-11 Thread Omar Sandoval
On Sun, Nov 11, 2018 at 09:25:27AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> When copying to the latency type, we should be passing LATENCY_TYPE_LEN,
> not DOMAIN_LEN. This isn't a problem in practice because we only pass
> "total" or "I/O", but let's fix it.

Oh, I forgot

Fixes: 6c3b7af1c975 ("kyber: add tracepoints")

> Reported-by: Jordan Glover 
> Signed-off-by: Omar Sandoval 
> ---
>  include/trace/events/kyber.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
> index a9834c37ac40..7aaa298375ad 100644
> --- a/include/trace/events/kyber.h
> +++ b/include/trace/events/kyber.h
> @@ -32,7 +32,7 @@ TRACE_EVENT(kyber_latency,
>   TP_fast_assign(
>   __entry->dev= 
> disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
>   strlcpy(__entry->domain, domain, DOMAIN_LEN);
> - strlcpy(__entry->type, type, DOMAIN_LEN);
> + strlcpy(__entry->type, type, LATENCY_TYPE_LEN);
>   __entry->percentile = percentile;
>   __entry->numerator  = numerator;
>   __entry->denominator= denominator;
> -- 
> 2.19.1
> 


[PATCH] kyber: fix wrong strlcpy() size in trace_kyber_latency()

2018-11-11 Thread Omar Sandoval
From: Omar Sandoval 

When copying to the latency type, we should be passing LATENCY_TYPE_LEN,
not DOMAIN_LEN. This isn't a problem in practice because we only pass
"total" or "I/O", but let's fix it.

Reported-by: Jordan Glover 
Signed-off-by: Omar Sandoval 
---
 include/trace/events/kyber.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index a9834c37ac40..7aaa298375ad 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -32,7 +32,7 @@ TRACE_EVENT(kyber_latency,
TP_fast_assign(
__entry->dev= 
disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
strlcpy(__entry->domain, domain, DOMAIN_LEN);
-   strlcpy(__entry->type, type, DOMAIN_LEN);
+   strlcpy(__entry->type, type, LATENCY_TYPE_LEN);
__entry->percentile = percentile;
__entry->numerator  = numerator;
__entry->denominator= denominator;
-- 
2.19.1



Re: [PATCH] floppy: fix race condition in __floppy_read_block_0()

2018-11-09 Thread Omar Sandoval
On Fri, Nov 09, 2018 at 03:58:40PM -0700, Jens Axboe wrote:
> LKP recently reported a hang at bootup in the floppy code:
> 
> [  245.678853] INFO: task mount:580 blocked for more than 120 seconds.
> [  245.679906]   Tainted: GT 4.19.0-rc6-00172-ga9f38e1 #1
> [  245.680959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  245.682181] mount   D 6372   580  1 0x0004
> [  245.683023] Call Trace:
> [  245.683425]  __schedule+0x2df/0x570
> [  245.683975]  schedule+0x2d/0x80
> [  245.684476]  schedule_timeout+0x19d/0x330
> [  245.685090]  ? wait_for_common+0xa5/0x170
> [  245.685735]  wait_for_common+0xac/0x170
> [  245.686339]  ? do_sched_yield+0x90/0x90
> [  245.686935]  wait_for_completion+0x12/0x20
> [  245.687571]  __floppy_read_block_0+0xfb/0x150
> [  245.688244]  ? floppy_resume+0x40/0x40
> [  245.688844]  floppy_revalidate+0x20f/0x240
> [  245.689486]  check_disk_change+0x43/0x60
> [  245.690087]  floppy_open+0x1ea/0x360
> [  245.690653]  __blkdev_get+0xb4/0x4d0
> [  245.691212]  ? blkdev_get+0x1db/0x370
> [  245.691777]  blkdev_get+0x1f3/0x370
> [  245.692351]  ? path_put+0x15/0x20
> [  245.692871]  ? lookup_bdev+0x4b/0x90
> [  245.693539]  blkdev_get_by_path+0x3d/0x80
> [  245.694165]  mount_bdev+0x2a/0x190
> [  245.694695]  squashfs_mount+0x10/0x20
> [  245.695271]  ? squashfs_alloc_inode+0x30/0x30
> [  245.695960]  mount_fs+0xf/0x90
> [  245.696451]  vfs_kern_mount+0x43/0x130
> [  245.697036]  do_mount+0x187/0xc40
> [  245.697563]  ? memdup_user+0x28/0x50
> [  245.698124]  ksys_mount+0x60/0xc0
> [  245.698639]  sys_mount+0x19/0x20
> [  245.699167]  do_int80_syscall_32+0x61/0x130
> [  245.699813]  entry_INT80_32+0xc7/0xc7
> 
> showing that we never complete that read request. The reason is that
> the completion setup is racy - it initializes the completion event
> AFTER submitting the IO, which means that the IO could complete
> before/during the init. If it does, we are passing garbage to
> complete() and we may sleep forever waiting for the event to
> occur.
> 
> Fixes: 7b7b68bba5ef ("floppy: bail out in open() if drive is not responding 
> to block0 read")

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/floppy.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
> index a8cfa011c284..fb23578e9a41 100644
> --- a/drivers/block/floppy.c
> +++ b/drivers/block/floppy.c
> @@ -4148,10 +4148,11 @@ static int __floppy_read_block_0(struct block_device 
> *bdev, int drive)
>   bio.bi_end_io = floppy_rb0_cb;
>   bio_set_op_attrs(, REQ_OP_READ, 0);
>  
> + init_completion();
> +
>   submit_bio();
>   process_fd_request();
>  
> - init_completion();
>   wait_for_completion();
>  
>   __free_page(page);
> -- 
> 2.17.1
> 
> -- 
> Jens Axboe
> 


Re: [PATCH 0/2] blktests: New loop tests

2018-10-25 Thread Omar Sandoval
On Thu, Oct 18, 2018 at 12:31:45PM +0200, Jan Kara wrote:
> 
> Hello,
> 
> these two patches create two new tests for blktests as regression tests
> for my recently posted loopback device fixes. More details in individual
> patches.

Thanks, Jan, I applied 007 renamed to 006.


Re: [PATCH v2] blktest: remove instances of null_blk queue_mode=1

2018-10-25 Thread Omar Sandoval
On Thu, Oct 25, 2018 at 03:03:30PM -0600, Jens Axboe wrote:
> This is no longer supported in recent kernels, get rid of
> any testing of queue_mode=1. queue_mode=1 tested the legacy
> IO path, which is going away completely. As such, there's
> no point in doing anymore testing with it.
> 
> Signed-off-by: Jens Axboe 

Thanks, applied this squashed it with the one removing block/022.


Re: [PATCH 1/2] loop/006: Add test for setting partscan flag

2018-10-23 Thread Omar Sandoval
On Tue, Oct 23, 2018 at 12:05:12PM +0200, Jan Kara wrote:
> On Mon 22-10-18 15:52:55, Omar Sandoval wrote:
> > On Thu, Oct 18, 2018 at 12:31:46PM +0200, Jan Kara wrote:
> > > Add test for setting partscan flag.
> > > 
> > > Signed-off-by: Jan Kara 
> > 
> > Sorry I didn't notice this earlier, but loop/001 already does a
> > partition rescan (via losetup -P). Does that cover this test case?
> 
> Yes I know. But the partition rescanning on device creation has been
> handled properly while partition rescanning as a result of LOOP_SET_STATUS
> was buggy. That's why I've added this test.

At least here, losetup -P does a LOOP_SET_STATUS:

$ sudo strace -e ioctl losetup -f --show -P test.img
ioctl(3, LOOP_CTL_GET_FREE) = 0
ioctl(4, LOOP_SET_FD, 3)= 0
ioctl(4, LOOP_SET_STATUS64, {lo_offset=0, lo_number=0, 
lo_flags=LO_FLAGS_PARTSCAN, lo_file_name="/home/osandov/test.img", ...}) = 0
/dev/loop0
+++ exited with 0 +++


Re: [PATCH blktests] common/rc: allow the loop driver to be built into the kernel

2018-10-23 Thread Omar Sandoval
On Tue, Oct 23, 2018 at 08:56:24AM -0400, Theodore Y. Ts'o wrote:
> On Mon, Oct 22, 2018 at 04:08:30PM -0700, Omar Sandoval wrote:
> > On Mon, Oct 22, 2018 at 06:50:04AM -0400, Theodore Ts'o wrote:
> > > A number of kernel modules used by blktests must be compiled as
> > > modules, since the module needs to be loaded with specific options, or
> > > part of the test is to exercise what what happens when the kernel
> > > module is loaded.  This is not true for the loop driver, so add a new
> > > bash function, _have_kernel_module which works like _have_module but
> > > will not fail if the driver is compiled directly into the kernel.
> > 
> > `modprobe loop` works for me if the module is built in, are you using
> > one from busybox or something? According to strace, it looks at the
> > depmod information (namely, /lib/modules/$(uname -r)/modules.builtin.bin).
> 
> Ah, you're right.  When I was first trying to use blktests, I was
> integrating it into my xfstests test appliance, and normally I build a
> completely module-free kernel.  This allows me to boot directly into a
> kernel by using kvm's "--kernel /path/to/bzImage" option without
> having to deal with the extra work of trying to install modules into a
> test appliance.

FWIW, I have a VM setup that uses --kernel and a virtfs mount in the
guest to use modules without needing a manual install step:
https://github.com/osandov/osandov-linux#running-custom-kernel-builds.


Re: [PATCH blktests] common/rc: allow the loop driver to be built into the kernel

2018-10-22 Thread Omar Sandoval
On Mon, Oct 22, 2018 at 06:50:04AM -0400, Theodore Ts'o wrote:
> A number of kernel modules used by blktests must be compiled as
> modules, since the module needs to be loaded with specific options, or
> part of the test is to exercise what what happens when the kernel
> module is loaded.  This is not true for the loop driver, so add a new
> bash function, _have_kernel_module which works like _have_module but
> will not fail if the driver is compiled directly into the kernel.

`modprobe loop` works for me if the module is built in, are you using
one from busybox or something? According to strace, it looks at the
depmod information (namely, /lib/modules/$(uname -r)/modules.builtin.bin).


Re: [PATCH 2/2] loop/007: Add test for oops during backing file verification

2018-10-22 Thread Omar Sandoval
On Thu, Oct 18, 2018 at 12:31:47PM +0200, Jan Kara wrote:
> Add regression test for patch "block/loop: Use global lock for ioctl()
> operation." where we can oops while traversing list of loop devices
> backing newly created device.
> 
> Signed-off-by: Jan Kara 

Looks good, sans a missing addition to src/.gitignore. I can fix that
and apply this once I hear back regarding the other test.


Re: [PATCH 1/2] loop/006: Add test for setting partscan flag

2018-10-22 Thread Omar Sandoval
On Thu, Oct 18, 2018 at 12:31:46PM +0200, Jan Kara wrote:
> Add test for setting partscan flag.
> 
> Signed-off-by: Jan Kara 

Sorry I didn't notice this earlier, but loop/001 already does a
partition rescan (via losetup -P). Does that cover this test case?

> ---
>  src/Makefile   |  3 ++-
>  src/loop_set_status_partscan.c | 45 
> ++
>  tests/loop/006 | 33 +++
>  tests/loop/006.out |  2 ++
>  4 files changed, 82 insertions(+), 1 deletion(-)
>  create mode 100644 src/loop_set_status_partscan.c
>  create mode 100755 tests/loop/006
>  create mode 100644 tests/loop/006.out
> 
> diff --git a/src/Makefile b/src/Makefile
> index f89f61701179..6dadcbec8beb 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -4,7 +4,8 @@ C_TARGETS := \
>   openclose \
>   sg/dxfer-from-dev \
>   sg/syzkaller1 \
> - nbdsetsize
> + nbdsetsize \
> + loop_set_status_partscan
>  
>  CXX_TARGETS := \
>   discontiguous-io
> diff --git a/src/loop_set_status_partscan.c b/src/loop_set_status_partscan.c
> new file mode 100644
> index ..8873a12e4334
> --- /dev/null
> +++ b/src/loop_set_status_partscan.c
> @@ -0,0 +1,45 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +void usage(const char *progname)
> +{
> + fprintf(stderr, "usage: %s PATH\n", progname);
> + exit(EXIT_FAILURE);
> +}
> +
> +int main(int argc, char **argv)
> +{
> + int ret;
> + int fd;
> + struct loop_info64 info;
> +
> + if (argc != 2)
> + usage(argv[0]);
> +
> + fd = open(argv[1], O_RDONLY);
> + if (fd == -1) {
> + perror("open");
> + return EXIT_FAILURE;
> + }
> +
> + memset(, 0, sizeof(info));
> + info.lo_flags = LO_FLAGS_PARTSCAN;
> + memcpy(info.lo_file_name, "part", 5);

What's the significance of this file name?

> + ret = ioctl(fd, LOOP_SET_STATUS64, );
> + if (ret == -1) {
> + perror("ioctl");
> + close(fd);
> + return EXIT_FAILURE;
> + }
> + close(fd);
> + return EXIT_SUCCESS;
> +}

[snip]


Re: [PATCH 0/15 v2] loop: Fix oops and possible deadlocks

2018-10-16 Thread Omar Sandoval
On Tue, Oct 16, 2018 at 01:36:54PM +0200, Jan Kara wrote:
> On Wed 10-10-18 14:28:09, Jan Kara wrote:
> > On Wed 10-10-18 13:42:27, Johannes Thumshirn wrote:
> > > On Wed, Oct 10, 2018 at 07:19:00PM +0900, Tetsuo Handa wrote:
> > > > On 2018/10/10 19:04, Jan Kara wrote:
> > > > > Hi,
> > > > > 
> > > > > this patch series fixes oops and possible deadlocks as reported by 
> > > > > syzbot [1]
> > > > > [2]. The second patch in the series (from Tetsuo) fixes the oops, the 
> > > > > remaining
> > > > > patches are cleaning up the locking in the loop driver so that we can 
> > > > > in the
> > > > > end reasonably easily switch to rereading partitions without holding 
> > > > > mutex
> > > > > protecting the loop device.
> > > > > 
> > > > > I have lightly tested the patches by creating, deleting, and 
> > > > > modifying loop
> > > > > devices but if there's some more comprehensive loopback device 
> > > > > testsuite, I
> > > > > can try running it. Review is welcome!
> > > > 
> > > > Testing on linux-next by syzbot will be the most comprehensive. ;-)
> > > 
> > > Apart from that blktests has a loop category and I think it could also be
> > > worthwhile to add the C reproducer from syzkaller to blktests.
> > 
> > Yeah, I did run loop tests now and they ran fine. I can try converting the
> > syzbot reproducers into something legible but it will take a while.
> 
> So I took a stab at this. But I hit two issues:
> 
> 1) For the reproducer triggering the lockdep warning, you need a 32-bit
> binary (so that it uses compat_ioctl). I don't think we want to introduce
> 32-bit devel environment dependency to blktests. With 64-bits, the problem
> is also there but someone noticed and silenced lockdep (with a reason that
> I consider is incorrect)... I think the test is still worth it though as
> I'll remove the lockdep-fooling code in my patches and thus new breakage
> will be noticed.

Agreed, even if it doesn't trigger lockdep now, it's a good regression
test.

> 2) For the oops (use-after-free) issue I was not able to reproduce that in
> my test KVM in couple hours. The race window is rather narrow and syzbot
> with KASAN and everything hit it only 11 times. So I'm not sure how useful
> that test is. Any opinions?

I'd say we should add it anyways. If anything, it's a smoke test for
changing fds on a loop device. You could add a note that the race it's
testing for is very narrow.


Re: [PATCH v2 00/11] Convert floppy drivers to blk-mq

2018-10-15 Thread Omar Sandoval
On Sun, Oct 14, 2018 at 12:12:24PM +1100, Finn Thain wrote:
> On Thu, 11 Oct 2018, Omar Sandoval wrote:
> 
> > From: Omar Sandoval 
> > 
> > Hi,
> > 
> > This series converts the various floppy drivers to blk-mq. Save for the
> > last one (floppy), they're compile-tested only. If I've Cc'd you, it's
> > because I think you might be able to test the changes. Please test if
> > you can, or let me know if there's a way to use QEMU/some other emulator
> > to test. The full series is available at [1]. Thanks!
> > 
> 
> I built your mq-conversions branch (425e985d1937) and asked Stan to test 
> this on his Centris 650. There are no regressions in the swim driver.

Thanks for testing :) I'm pretty impressed that it's still possible to
run Linux on those at all.


[PATCH v2 07/11] amiflop: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Straightforward conversion, just use the existing amiflop_lock to
serialize access to the controller. Compile-tested only.

Cc: Laurent Vivier 
Signed-off-by: Omar Sandoval 
---
 drivers/block/amiflop.c | 125 +++-
 1 file changed, 46 insertions(+), 79 deletions(-)

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index eef3b085e70a..1225b802f632 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -201,6 +201,7 @@ struct amiga_floppy_struct {
int dirty;  /* true when trackbuf is not on disk */
int status; /* current error code for unit */
struct gendisk *gendisk;
+   struct blk_mq_tag_set tag_set;
 };
 
 /*
@@ -281,7 +282,6 @@ static volatile int selected = -1;  /* currently selected 
drive */
 static int writepending;
 static int writefromint;
 static char *raw_buf;
-static int fdc_queue;
 
 static DEFINE_SPINLOCK(amiflop_lock);
 
@@ -1454,76 +1454,20 @@ static int get_track(int drive, int track)
return -1;
 }
 
-/*
- * Round-robin between our available drives, doing one request from each
- */
-static struct request *set_next_request(void)
-{
-   struct request_queue *q;
-   int cnt = FD_MAX_UNITS;
-   struct request *rq = NULL;
-
-   /* Find next queue we can dispatch from */
-   fdc_queue = fdc_queue + 1;
-   if (fdc_queue == FD_MAX_UNITS)
-   fdc_queue = 0;
-
-   for(cnt = FD_MAX_UNITS; cnt > 0; cnt--) {
-
-   if (unit[fdc_queue].type->code == FD_NODRIVE) {
-   if (++fdc_queue == FD_MAX_UNITS)
-   fdc_queue = 0;
-   continue;
-   }
-
-   q = unit[fdc_queue].gendisk->queue;
-   if (q) {
-   rq = blk_fetch_request(q);
-   if (rq)
-   break;
-   }
-
-   if (++fdc_queue == FD_MAX_UNITS)
-   fdc_queue = 0;
-   }
-
-   return rq;
-}
-
-static void redo_fd_request(void)
+static blk_status_t amiflop_rw_cur_segment(struct amiga_floppy_struct *floppy,
+  struct request *rq)
 {
-   struct request *rq;
+   int drive = floppy - unit;
unsigned int cnt, block, track, sector;
-   int drive;
-   struct amiga_floppy_struct *floppy;
char *data;
-   unsigned long flags;
-   blk_status_t err;
-
-next_req:
-   rq = set_next_request();
-   if (!rq) {
-   /* Nothing left to do */
-   return;
-   }
-
-   floppy = rq->rq_disk->private_data;
-   drive = floppy - unit;
 
-next_segment:
-   /* Here someone could investigate to be more efficient */
-   for (cnt = 0, err = BLK_STS_OK; cnt < blk_rq_cur_sectors(rq); cnt++) {
+   for (cnt = 0; cnt < blk_rq_cur_sectors(rq); cnt++) {
 #ifdef DEBUG
printk("fd: sector %ld + %d requested for %s\n",
   blk_rq_pos(rq), cnt,
   (rq_data_dir(rq) == READ) ? "read" : "write");
 #endif
block = blk_rq_pos(rq) + cnt;
-   if ((int)block > floppy->blocks) {
-   err = BLK_STS_IOERR;
-   break;
-   }
-
track = block / (floppy->dtype->sects * 
floppy->type->sect_mult);
sector = block % (floppy->dtype->sects * 
floppy->type->sect_mult);
data = bio_data(rq->bio) + 512 * cnt;
@@ -1532,10 +1476,8 @@ static void redo_fd_request(void)
   "0x%08lx\n", track, sector, data);
 #endif
 
-   if (get_track(drive, track) == -1) {
-   err = BLK_STS_IOERR;
-   break;
-   }
+   if (get_track(drive, track) == -1)
+   return BLK_STS_IOERR;
 
if (rq_data_dir(rq) == READ) {
memcpy(data, floppy->trackbuf + sector * 512, 512);
@@ -1543,31 +1485,40 @@ static void redo_fd_request(void)
memcpy(floppy->trackbuf + sector * 512, data, 512);
 
/* keep the drive spinning while writes are scheduled */
-   if (!fd_motor_on(drive)) {
-   err = BLK_STS_IOERR;
-   break;
-   }
+   if (!fd_motor_on(drive))
+   return BLK_STS_IOERR;
/*
 * setup a callback to write the track buffer
 * after a short (1 tick) delay.
 */
-   local_irq_save(flags);
-
floppy->dirty = 1;

[PATCH v2 08/11] ataflop: fold headers into C file

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

atafd.h and atafdreg.h are only used from ataflop.c, so merge them in
there.

Signed-off-by: Omar Sandoval 
---
 arch/m68k/include/asm/atafd.h| 13 -
 arch/m68k/include/asm/atafdreg.h | 80 --
 drivers/block/ataflop.c  | 83 +++-
 3 files changed, 81 insertions(+), 95 deletions(-)
 delete mode 100644 arch/m68k/include/asm/atafd.h
 delete mode 100644 arch/m68k/include/asm/atafdreg.h

diff --git a/arch/m68k/include/asm/atafd.h b/arch/m68k/include/asm/atafd.h
deleted file mode 100644
index ad7014cad633..
--- a/arch/m68k/include/asm/atafd.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_M68K_FD_H
-#define _ASM_M68K_FD_H
-
-/* Definitions for the Atari Floppy driver */
-
-struct atari_format_descr {
-int track; /* to be formatted */
-int head;  /*   "" "" */
-int sect_offset;   /* offset of first sector */
-};
-
-#endif
diff --git a/arch/m68k/include/asm/atafdreg.h b/arch/m68k/include/asm/atafdreg.h
deleted file mode 100644
index c31b4919ed2d..
--- a/arch/m68k/include/asm/atafdreg.h
+++ /dev/null
@@ -1,80 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_FDREG_H
-#define _LINUX_FDREG_H
-
-/*
-** WD1772 stuff
- */
-
-/* register codes */
-
-#define FDCSELREG_STP   (0x80)   /* command/status register */
-#define FDCSELREG_TRA   (0x82)   /* track register */
-#define FDCSELREG_SEC   (0x84)   /* sector register */
-#define FDCSELREG_DTA   (0x86)   /* data register */
-
-/* register names for FDC_READ/WRITE macros */
-
-#define FDCREG_CMD 0
-#define FDCREG_STATUS  0
-#define FDCREG_TRACK   2
-#define FDCREG_SECTOR  4
-#define FDCREG_DATA6
-
-/* command opcodes */
-
-#define FDCCMD_RESTORE  (0x00)   /*  -   */
-#define FDCCMD_SEEK (0x10)   /*   |  */
-#define FDCCMD_STEP (0x20)   /*   |  TYP 1 Commands  */
-#define FDCCMD_STIN (0x40)   /*   |  */
-#define FDCCMD_STOT (0x60)   /*  -   */
-#define FDCCMD_RDSEC(0x80)   /*  -   TYP 2 Commands  */
-#define FDCCMD_WRSEC(0xa0)   /*  -  "*/
-#define FDCCMD_RDADR(0xc0)   /*  -   */
-#define FDCCMD_RDTRA(0xe0)   /*   |  TYP 3 Commands  */
-#define FDCCMD_WRTRA(0xf0)   /*  -   */
-#define FDCCMD_FORCI(0xd0)   /*  -   TYP 4 Command   */
-
-/* command modifier bits */
-
-#define FDCCMDADD_SR6   (0x00)   /* step rate settings */
-#define FDCCMDADD_SR12  (0x01)
-#define FDCCMDADD_SR2   (0x02)
-#define FDCCMDADD_SR3   (0x03)
-#define FDCCMDADD_V (0x04)   /* verify */
-#define FDCCMDADD_H (0x08)   /* wait for spin-up */
-#define FDCCMDADD_U (0x10)   /* update track register */
-#define FDCCMDADD_M (0x10)   /* multiple sector access */
-#define FDCCMDADD_E (0x04)   /* head settling flag */
-#define FDCCMDADD_P (0x02)   /* precompensation off */
-#define FDCCMDADD_A0(0x01)   /* DAM flag */
-
-/* status register bits */
-
-#defineFDCSTAT_MOTORON (0x80)   /* motor on */
-#defineFDCSTAT_WPROT   (0x40)   /* write protected (FDCCMD_WR*) */
-#defineFDCSTAT_SPINUP  (0x20)   /* motor speed stable (Type I) */
-#defineFDCSTAT_DELDAM  (0x20)   /* sector has deleted DAM (Type 
II+III) */
-#defineFDCSTAT_RECNF   (0x10)   /* record not found */
-#defineFDCSTAT_CRC (0x08)   /* CRC error */
-#defineFDCSTAT_TR00(0x04)   /* Track 00 flag (Type I) */
-#defineFDCSTAT_LOST(0x04)   /* Lost Data (Type II+III) */
-#defineFDCSTAT_IDX (0x02)   /* Index status (Type I) */
-#defineFDCSTAT_DRQ (0x02)   /* DRQ status (Type II+III) */
-#defineFDCSTAT_BUSY(0x01)   /* FDC is busy */
-
-
-/* PSG Port A Bit Nr 0 .. Side Sel .. 0 -> Side 1  1 -> Side 2 */
-#define DSKSIDE (0x01)
-
-#define DSKDRVNONE  (0x06)
-#define DSKDRV0 (0x02)
-#define DSKDRV1 (0x04)
-
-/* step rates */
-#defineFDCSTEP_6   0x00
-#defineFDCSTEP_12  0x01
-#defineFDCSTEP_2   0x02
-#defineFDCSTEP_3   0x03
-
-#endif
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index dfb2c2622e5a..17df631c5d85 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -71,8 +71,6 @@
 #include 
 #include 
 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -85,6 +83,87 @@ static DEFINE_MUTEX(ataflop_mutex);
 static struct request *fd_request;
 static int fdc_queue;
 
+/*
+ * WD1772 stuff
+ */
+
+/* register codes */
+
+#define FDCSELREG_STP   (0x80)   /* command/status register */
+#define FDCSELREG_TRA   (0x82)   /* track register */
+#define FDCSELREG_SEC   (0x84)   /* sector register */
+#define FDCSELREG_DTA   (0x86)   /* data register */
+
+/* register names for FDC_READ/WRITE 

[PATCH v2 09/11] ataflop: fix error handling during setup

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Move queue allocation next to disk allocation to fix a couple of issues:

- If add_disk() hasn't been called, we should clear disk->queue before
  calling put_disk().
- If we fail to allocate a request queue, we still need to put all of
  the disks, not just the ones that we allocated queues for.

Signed-off-by: Omar Sandoval 
---
 drivers/block/ataflop.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 17df631c5d85..0144d598ac47 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -2014,6 +2014,11 @@ static int __init atari_floppy_init (void)
unit[i].disk = alloc_disk(1);
if (!unit[i].disk)
goto Enomem;
+
+   unit[i].disk->queue = blk_init_queue(do_fd_request,
+_lock);
+   if (!unit[i].disk->queue)
+   goto Enomem;
}
 
if (UseTrackbuffer < 0)
@@ -2045,10 +2050,6 @@ static int __init atari_floppy_init (void)
sprintf(unit[i].disk->disk_name, "fd%d", i);
unit[i].disk->fops = _fops;
unit[i].disk->private_data = [i];
-   unit[i].disk->queue = blk_init_queue(do_fd_request,
-   _lock);
-   if (!unit[i].disk->queue)
-   goto Enomem;
set_capacity(unit[i].disk, MAX_DISK_SIZE * 2);
add_disk(unit[i].disk);
}
@@ -2063,13 +2064,17 @@ static int __init atari_floppy_init (void)
 
return 0;
 Enomem:
-   while (i--) {
-   struct request_queue *q = unit[i].disk->queue;
+   do {
+   struct gendisk *disk = unit[i].disk;
 
-   put_disk(unit[i].disk);
-   if (q)
-   blk_cleanup_queue(q);
-   }
+   if (disk) {
+   if (disk->queue) {
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+   }
+   put_disk(unit[i].disk);
+   }
+   } while (i--);
 
unregister_blkdev(FLOPPY_MAJOR, "fd");
return -ENOMEM;
-- 
2.19.1



[PATCH v2 06/11] amiflop: clean up on errors during setup

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

The error handling in fd_probe_drives() doesn't clean up at all. Fix it
up in preparation for converting to blk-mq. While we're here, get rid of
the commented out amiga_floppy_remove().

Signed-off-by: Omar Sandoval 
---
 drivers/block/amiflop.c | 84 -
 1 file changed, 40 insertions(+), 44 deletions(-)

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index a7d6e6a9b12f..eef3b085e70a 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1818,11 +1818,41 @@ static const struct block_device_operations floppy_fops 
= {
.check_events   = amiga_check_events,
 };
 
+static struct gendisk *fd_alloc_disk(int drive)
+{
+   struct gendisk *disk;
+
+   disk = alloc_disk(1);
+   if (!disk)
+   goto out;
+
+   disk->queue = blk_init_queue(do_fd_request, _lock);
+   if (IS_ERR(disk->queue)) {
+   disk->queue = NULL;
+   goto out_put_disk;
+   }
+
+   unit[drive].trackbuf = kmalloc(FLOPPY_MAX_SECTORS * 512, GFP_KERNEL);
+   if (!unit[drive].trackbuf)
+   goto out_cleanup_queue;
+
+   return disk;
+
+out_cleanup_queue:
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+out_put_disk:
+   put_disk(disk);
+out:
+   unit[drive].type->code = FD_NODRIVE;
+   return NULL;
+}
+
 static int __init fd_probe_drives(void)
 {
int drive,drives,nomem;
 
-   printk(KERN_INFO "FD: probing units\nfound ");
+   pr_info("FD: probing units\nfound");
drives=0;
nomem=0;
for(drive=0;drivecode == FD_NODRIVE)
continue;
-   disk = alloc_disk(1);
+
+   disk = fd_alloc_disk(drive);
if (!disk) {
-   unit[drive].type->code = FD_NODRIVE;
+   pr_cont(" no mem for fd%d", drive);
+   nomem = 1;
continue;
}
unit[drive].gendisk = disk;
-
-   disk->queue = blk_init_queue(do_fd_request, _lock);
-   if (!disk->queue) {
-   unit[drive].type->code = FD_NODRIVE;
-   continue;
-   }
-
drives++;
-   if ((unit[drive].trackbuf = kmalloc(FLOPPY_MAX_SECTORS * 512, 
GFP_KERNEL)) == NULL) {
-   printk("no mem for ");
-   unit[drive].type = _types[num_dr_types - 1]; /* 
FD_NODRIVE */
-   drives--;
-   nomem = 1;
-   }
-   printk("fd%d ",drive);
+
+   pr_cont(" fd%d",drive);
disk->major = FLOPPY_MAJOR;
disk->first_minor = drive;
disk->fops = _fops;
@@ -1861,11 +1881,11 @@ static int __init fd_probe_drives(void)
}
if ((drives > 0) || (nomem == 0)) {
if (drives == 0)
-   printk("no drives");
-   printk("\n");
+   pr_cont(" no drives");
+   pr_cont("\n");
return drives;
}
-   printk("\n");
+   pr_cont("\n");
return -ENOMEM;
 }
  
@@ -1948,30 +1968,6 @@ static int __init amiga_floppy_probe(struct 
platform_device *pdev)
return ret;
 }
 
-#if 0 /* not safe to unload */
-static int __exit amiga_floppy_remove(struct platform_device *pdev)
-{
-   int i;
-
-   for( i = 0; i < FD_MAX_UNITS; i++) {
-   if (unit[i].type->code != FD_NODRIVE) {
-   struct request_queue *q = unit[i].gendisk->queue;
-   del_gendisk(unit[i].gendisk);
-   put_disk(unit[i].gendisk);
-   kfree(unit[i].trackbuf);
-   if (q)
-   blk_cleanup_queue(q);
-   }
-   }
-   blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256);
-   free_irq(IRQ_AMIGA_CIAA_TB, NULL);
-   free_irq(IRQ_AMIGA_DSKBLK, NULL);
-   custom.dmacon = DMAF_DISK; /* disable DMA */
-   amiga_chip_free(raw_buf);
-   unregister_blkdev(FLOPPY_MAJOR, "fd");
-}
-#endif
-
 static struct platform_driver amiga_floppy_driver = {
.driver   = {
.name   = "amiga-floppy",
-- 
2.19.1



[PATCH v2 10/11] ataflop: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

This driver is already pretty broken, in that it has two wait_events()
(one in stdma_lock()) in request_fn. Get rid of the first one by
freezing/quiescing the queue on format, and the second one by replacing
it with stdma_try_lock(). The rest is straightforward. Compile-tested
only and probably incorrect.

Cc: Laurent Vivier 
Signed-off-by: Omar Sandoval 
---
 drivers/block/ataflop.c | 183 ++--
 1 file changed, 81 insertions(+), 102 deletions(-)

diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 0144d598ac47..158d4dde6ad1 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -66,7 +66,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -81,7 +81,6 @@
 
 static DEFINE_MUTEX(ataflop_mutex);
 static struct request *fd_request;
-static int fdc_queue;
 
 /*
  * WD1772 stuff
@@ -300,6 +299,7 @@ static struct atari_floppy_struct {
struct gendisk *disk;
int ref;
int type;
+   struct blk_mq_tag_set tag_set;
 } unit[FD_MAX_UNITS];
 
 #defineUD  unit[drive]
@@ -379,9 +379,6 @@ static int IsFormatting = 0, FormatError;
 static int UserSteprate[FD_MAX_UNITS] = { -1, -1 };
 module_param_array(UserSteprate, int, NULL, 0);
 
-/* Synchronization of FDC access. */
-static volatile int fdc_busy = 0;
-static DECLARE_WAIT_QUEUE_HEAD(fdc_wait);
 static DECLARE_COMPLETION(format_wait);
 
 static unsigned long changed_floppies = 0xff, fake_change = 0;
@@ -441,7 +438,6 @@ static void fd_times_out(struct timer_list *unused);
 static void finish_fdc( void );
 static void finish_fdc_done( int dummy );
 static void setup_req_params( int drive );
-static void redo_fd_request( void);
 static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned 
int
  cmd, unsigned long param);
 static void fd_probe( int drive );
@@ -459,8 +455,11 @@ static DEFINE_TIMER(fd_timer, check_change);

 static void fd_end_request_cur(blk_status_t err)
 {
-   if (!__blk_end_request_cur(fd_request, err))
+   if (!blk_update_request(fd_request, err,
+   blk_rq_cur_bytes(fd_request))) {
+   __blk_mq_end_request(fd_request, err);
fd_request = NULL;
+   }
 }
 
 static inline void start_motor_off_timer(void)
@@ -706,7 +705,6 @@ static void fd_error( void )
if (SelectedDrive != -1)
SUD.track = -1;
}
-   redo_fd_request();
 }
 
 
@@ -724,14 +722,15 @@ static void fd_error( void )
 
 static int do_format(int drive, int type, struct atari_format_descr *desc)
 {
+   struct request_queue *q = unit[drive].disk->queue;
unsigned char   *p;
int sect, nsect;
unsigned long   flags;
+   int ret;
 
-   DPRINT(("do_format( dr=%d tr=%d he=%d offs=%d )\n",
-   drive, desc->track, desc->head, desc->sect_offset ));
+   blk_mq_freeze_queue(q);
+   blk_mq_quiesce_queue(q);
 
-   wait_event(fdc_wait, cmpxchg(_busy, 0, 1) == 0);
local_irq_save(flags);
stdma_lock(floppy_irq, NULL);
atari_turnon_irq( IRQ_MFP_FDC ); /* should be already, just to be sure 
*/
@@ -740,16 +739,16 @@ static int do_format(int drive, int type, struct 
atari_format_descr *desc)
if (type) {
if (--type >= NUM_DISK_MINORS ||
minor2disktype[type].drive_types > DriveType) {
-   redo_fd_request();
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out;
}
type = minor2disktype[type].index;
UDT = _disk_type[type];
}
 
if (!UDT || desc->track >= UDT->blocks/UDT->spt/2 || desc->head >= 2) {
-   redo_fd_request();
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out;
}
 
nsect = UDT->spt;
@@ -788,8 +787,11 @@ static int do_format(int drive, int type, struct 
atari_format_descr *desc)
 
wait_for_completion(_wait);
 
-   redo_fd_request();
-   return( FormatError ? -EIO : 0 );   
+   ret = FormatError ? -EIO : 0;
+out:
+   blk_mq_unquiesce_queue(q);
+   blk_mq_unfreeze_queue(q);
+   return ret;
 }
 
 
@@ -819,7 +821,6 @@ static void do_fd_action( int drive )
else {
/* all sectors finished */
fd_end_request_cur(BLK_STS_OK);
-   redo_fd_request();
return;
}
}
@@ -1224,7 +1225,6 @@ static void fd_rwsec_done1(int status)
else {
/* all sectors finished */
fd_end_request_cur(BLK_STS_OK);
-   redo_fd_request();
}
return;
   
@@ -1382,8 +1382,6 @@ static void finish_fdc

[PATCH v2 04/11] swim3: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Pretty simple conversion. grab_drive() could probably be replaced by
some freeze/quiesce incantation, but I left it alone, and just used
freeze/quiesce for eject. Compile-tested only.

Cc: Benjamin Herrenschmidt 
Signed-off-by: Omar Sandoval 
---
 drivers/block/swim3.c | 163 +++---
 1 file changed, 75 insertions(+), 88 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index df7ebe016e2c..cc28f17bfa5b 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -206,6 +206,7 @@ struct floppy_state {
chardbdma_cmd_space[5 * sizeof(struct dbdma_cmd)];
int index;
struct request *cur_req;
+   struct blk_mq_tag_set tag_set;
 };
 
 #define swim3_err(fmt, arg...) dev_err(>mdev->ofdev.dev, "[fd%d] " fmt, 
fs->index, arg)
@@ -260,16 +261,15 @@ static int floppy_revalidate(struct gendisk *disk);
 static bool swim3_end_request(struct floppy_state *fs, blk_status_t err, 
unsigned int nr_bytes)
 {
struct request *req = fs->cur_req;
-   int rc;
 
swim3_dbg("  end request, err=%d nr_bytes=%d, cur_req=%p\n",
  err, nr_bytes, req);
 
if (err)
nr_bytes = blk_rq_cur_bytes(req);
-   rc = __blk_end_request(req, err, nr_bytes);
-   if (rc)
+   if (blk_update_request(req, err, nr_bytes))
return true;
+   __blk_mq_end_request(req, err);
fs->cur_req = NULL;
return false;
 }
@@ -309,86 +309,58 @@ static int swim3_readbit(struct floppy_state *fs, int bit)
return (stat & DATA) == 0;
 }
 
-static void start_request(struct floppy_state *fs)
+static blk_status_t swim3_queue_rq(struct blk_mq_hw_ctx *hctx,
+  const struct blk_mq_queue_data *bd)
 {
-   struct request *req;
+   struct floppy_state *fs = hctx->queue->queuedata;
+   struct request *req = bd->rq;
unsigned long x;
 
-   swim3_dbg("start request, initial state=%d\n", fs->state);
-
-   if (fs->state == idle && fs->wanted) {
-   fs->state = available;
-   wake_up(>wait);
-   return;
+   spin_lock_irq(_lock);
+   if (fs->cur_req || fs->state != idle) {
+   spin_unlock_irq(_lock);
+   return BLK_STS_DEV_RESOURCE;
}
-   while (fs->state == idle) {
-   swim3_dbg("start request, idle loop, cur_req=%p\n", 
fs->cur_req);
-   if (!fs->cur_req) {
-   fs->cur_req = 
blk_fetch_request(disks[fs->index]->queue);
-   swim3_dbg("  fetched request %p\n", fs->cur_req);
-   if (!fs->cur_req)
-   break;
-   }
-   req = fs->cur_req;
-
-   if (fs->mdev->media_bay &&
-   check_media_bay(fs->mdev->media_bay) != MB_FD) {
-   swim3_dbg("%s", "  media bay absent, dropping req\n");
-   swim3_end_request(fs, BLK_STS_IOERR, 0);
-   continue;
-   }
-
-#if 0 /* This is really too verbose */
-   swim3_dbg("do_fd_req: dev=%s cmd=%d sec=%ld nr_sec=%u buf=%p\n",
- req->rq_disk->disk_name, req->cmd,
- (long)blk_rq_pos(req), blk_rq_sectors(req),
- bio_data(req->bio));
-   swim3_dbg("   current_nr_sectors=%u\n",
- blk_rq_cur_sectors(req));
-#endif
-
-   if (blk_rq_pos(req) >= fs->total_secs) {
-   swim3_dbg("  pos out of bounds (%ld, max is %ld)\n",
- (long)blk_rq_pos(req), (long)fs->total_secs);
-   swim3_end_request(fs, BLK_STS_IOERR, 0);
-   continue;
-   }
-   if (fs->ejected) {
-   swim3_dbg("%s", "  disk ejected\n");
+   blk_mq_start_request(req);
+   fs->cur_req = req;
+   if (fs->mdev->media_bay &&
+   check_media_bay(fs->mdev->media_bay) != MB_FD) {
+   swim3_dbg("%s", "  media bay absent, dropping req\n");
+   swim3_end_request(fs, BLK_STS_IOERR, 0);
+   goto out;
+   }
+   if (fs->ejected) {
+   swim3_dbg("%s", "  disk ejected\n");
+   swim3_end_request(fs, BLK_STS_IOERR, 0);
+   goto out;
+   }
+   if (rq_data_dir(req) == WRITE) {
+   if (fs->write_prot < 0)
+   fs->write_p

[PATCH v2 02/11] swim: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

The only interesting thing here is that there may be two floppies (i.e.,
request queues) sharing the same controller, so we use the global struct
swim_priv->lock to check whether the controller is busy. Compile-tested
only.

Cc: Finn Thain 
Cc: Laurent Vivier 
Signed-off-by: Omar Sandoval 
---
 drivers/block/swim.c | 101 +--
 1 file changed, 49 insertions(+), 52 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index cbe909c51847..f07fa993c364 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -190,6 +190,7 @@ struct floppy_state {
int ref_count;
 
struct gendisk *disk;
+   struct blk_mq_tag_set tag_set;
 
/* parent controller */
 
@@ -211,7 +212,6 @@ enum head {
 struct swim_priv {
struct swim __iomem *base;
spinlock_t lock;
-   int fdc_queue;
int floppy_count;
struct floppy_state unit[FD_MAX_UNIT];
 };
@@ -525,58 +525,36 @@ static blk_status_t floppy_read_sectors(struct 
floppy_state *fs,
return 0;
 }
 
-static struct request *swim_next_request(struct swim_priv *swd)
+static blk_status_t swim_queue_rq(struct blk_mq_hw_ctx *hctx,
+ const struct blk_mq_queue_data *bd)
 {
-   struct request_queue *q;
-   struct request *rq;
-   int old_pos = swd->fdc_queue;
+   struct floppy_state *fs = hctx->queue->queuedata;
+   struct swim_priv *swd = fs->swd;
+   struct request *req = bd->rq;
+   blk_status_t err;
 
-   do {
-   q = swd->unit[swd->fdc_queue].disk->queue;
-   if (++swd->fdc_queue == swd->floppy_count)
-   swd->fdc_queue = 0;
-   if (q) {
-   rq = blk_fetch_request(q);
-   if (rq)
-   return rq;
-   }
-   } while (swd->fdc_queue != old_pos);
+   if (!spin_trylock_irq(>lock))
+   return BLK_STS_DEV_RESOURCE;
 
-   return NULL;
-}
+   blk_mq_start_request(req);
 
-static void do_fd_request(struct request_queue *q)
-{
-   struct swim_priv *swd = q->queuedata;
-   struct request *req;
-   struct floppy_state *fs;
+   if (!fs->disk_in || rq_data_dir(req) == WRITE) {
+   err = BLK_STS_IOERR;
+   goto out;
+   }
 
-   req = swim_next_request(swd);
-   while (req) {
-   blk_status_t err = BLK_STS_IOERR;
+   do {
+   err = floppy_read_sectors(fs, blk_rq_pos(req),
+ blk_rq_cur_sectors(req),
+ bio_data(req->bio));
+   } while (blk_update_request(req, err, blk_rq_cur_bytes(req)));
+   __blk_mq_end_request(req, err);
 
-   fs = req->rq_disk->private_data;
-   if (blk_rq_pos(req) >= fs->total_secs)
-   goto done;
-   if (!fs->disk_in)
-   goto done;
-   if (rq_data_dir(req) == WRITE && fs->write_protected)
-   goto done;
+   err = BLK_STS_OK;
+out:
+   spin_unlock_irq(>lock);
+   return err;
 
-   switch (rq_data_dir(req)) {
-   case WRITE:
-   /* NOT IMPLEMENTED */
-   break;
-   case READ:
-   err = floppy_read_sectors(fs, blk_rq_pos(req),
- blk_rq_cur_sectors(req),
- bio_data(req->bio));
-   break;
-   }
-   done:
-   if (!__blk_end_request_cur(req, err))
-   req = swim_next_request(swd);
-   }
 }
 
 static struct floppy_struct floppy_type[4] = {
@@ -823,6 +801,10 @@ static int swim_add_floppy(struct swim_priv *swd, enum 
drive_location location)
return 0;
 }
 
+static const struct blk_mq_ops swim_mq_ops = {
+   .queue_rq = swim_queue_rq,
+};
+
 static int swim_floppy_init(struct swim_priv *swd)
 {
int err;
@@ -852,20 +834,33 @@ static int swim_floppy_init(struct swim_priv *swd)
spin_lock_init(>lock);
 
for (drive = 0; drive < swd->floppy_count; drive++) {
+   struct blk_mq_tag_set *set;
+
swd->unit[drive].disk = alloc_disk(1);
if (swd->unit[drive].disk == NULL) {
err = -ENOMEM;
goto exit_put_disks;
}
-   swd->unit[drive].disk->queue = blk_init_queue(do_fd_request,
- >lock);
-   if (!swd->unit[drive].disk->queue) {
- 

[PATCH v2 03/11] swim3: add real error handling in setup

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

The driver doesn't have support for removing a device that has already
been configured, but with more careful ordering we can avoid the need
for that and make sure that we don't leak generic resources.

Signed-off-by: Omar Sandoval 
---
 drivers/block/swim3.c | 60 ++-
 1 file changed, 36 insertions(+), 24 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 469541c1e51e..df7ebe016e2c 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1202,47 +1202,59 @@ static int swim3_add_device(struct macio_dev *mdev, int 
index)
 static int swim3_attach(struct macio_dev *mdev,
const struct of_device_id *match)
 {
+   struct floppy_state *fs;
struct gendisk *disk;
-   int index, rc;
+   int rc;
 
-   index = floppy_count++;
-   if (index >= MAX_FLOPPIES)
+   if (floppy_count >= MAX_FLOPPIES)
return -ENXIO;
 
-   /* Add the drive */
-   rc = swim3_add_device(mdev, index);
-   if (rc)
-   return rc;
-   /* Now register that disk. Same comment about failure handling */
-   disk = disks[index] = alloc_disk(1);
-   if (disk == NULL)
-   return -ENOMEM;
+   if (floppy_count == 0) {
+   rc = register_blkdev(FLOPPY_MAJOR, "fd");
+   if (rc)
+   return rc;
+   }
+
+   fs = _states[floppy_count];
+
+   disk = alloc_disk(1);
+   if (disk == NULL) {
+   rc = -ENOMEM;
+   goto out_unregister;
+   }
disk->queue = blk_init_queue(do_fd_request, _lock);
if (disk->queue == NULL) {
-   put_disk(disk);
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto out_put_disk;
}
blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH);
-   disk->queue->queuedata = _states[index];
+   disk->queue->queuedata = fs;
 
-   if (index == 0) {
-   /* If we failed, there isn't much we can do as the driver is 
still
-* too dumb to remove the device, just bail out
-*/
-   if (register_blkdev(FLOPPY_MAJOR, "fd"))
-   return 0;
-   }
+   rc = swim3_add_device(mdev, floppy_count);
+   if (rc)
+   goto out_cleanup_queue;
 
disk->major = FLOPPY_MAJOR;
-   disk->first_minor = index;
+   disk->first_minor = floppy_count;
disk->fops = _fops;
-   disk->private_data = _states[index];
+   disk->private_data = fs;
disk->flags |= GENHD_FL_REMOVABLE;
-   sprintf(disk->disk_name, "fd%d", index);
+   sprintf(disk->disk_name, "fd%d", floppy_count);
set_capacity(disk, 2880);
add_disk(disk);
 
+   disks[floppy_count++] = disk;
return 0;
+
+out_cleanup_queue:
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+out_put_disk:
+   put_disk(disk);
+out_unregister:
+   if (floppy_count == 0)
+   unregister_blkdev(FLOPPY_MAJOR, "fd");
+   return rc;
 }
 
 static const struct of_device_id swim3_match[] =
-- 
2.19.1



[PATCH v2 05/11] amiflop: fold headers into C file

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

amifd.h and amifdreg.h are only used from amiflop.c, and they're pretty
small, so move the contents to amiflop.c and get rid of the .h files.
This is preparation for adding a struct blk_mq_tag_set to struct
amiga_floppy_struct.

Signed-off-by: Omar Sandoval 
---
 drivers/block/amiflop.c  | 123 ++-
 include/linux/amifd.h|  63 
 include/linux/amifdreg.h |  82 --
 3 files changed, 120 insertions(+), 148 deletions(-)
 delete mode 100644 include/linux/amifd.h
 delete mode 100644 include/linux/amifdreg.h

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index 3aaf6af3ec23..a7d6e6a9b12f 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -61,10 +61,8 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -86,6 +84,125 @@
  *  Defines
  */
 
+/*
+ * CIAAPRA bits (read only)
+ */
+
+#define DSKRDY  (0x1<<5)/* disk ready when low */
+#define DSKTRACK0   (0x1<<4)/* head at track zero when low */
+#define DSKPROT (0x1<<3)/* disk protected when low */
+#define DSKCHANGE   (0x1<<2)/* low when disk removed */
+
+/*
+ * CIAAPRB bits (read/write)
+ */
+
+#define DSKMOTOR(0x1<<7)/* motor on when low */
+#define DSKSEL3 (0x1<<6)/* select drive 3 when low */
+#define DSKSEL2 (0x1<<5)/* select drive 2 when low */
+#define DSKSEL1 (0x1<<4)/* select drive 1 when low */
+#define DSKSEL0 (0x1<<3)/* select drive 0 when low */
+#define DSKSIDE (0x1<<2)/* side selection: 0 = upper, 1 = lower */
+#define DSKDIREC(0x1<<1)/* step direction: 0=in, 1=out (to trk 0) 
*/
+#define DSKSTEP (0x1)   /* pulse low to step head 1 track */
+
+/*
+ * DSKBYTR bits (read only)
+ */
+
+#define DSKBYT  (1<<15) /* register contains valid byte when set */
+#define DMAON   (1<<14) /* disk DMA enabled */
+#define DISKWRITE   (1<<13) /* disk write bit in DSKLEN enabled */
+#define WORDEQUAL   (1<<12) /* DSKSYNC register match when true */
+/* bits 7-0 are data */
+
+/*
+ * ADKCON/ADKCONR bits
+ */
+
+#ifndef SETCLR
+#define ADK_SETCLR  (1<<15) /* control bit */
+#endif
+#define ADK_PRECOMP1(1<<14) /* precompensation selection */
+#define ADK_PRECOMP0(1<<13) /* 00=none, 01=140ns, 10=280ns, 11=500ns */
+#define ADK_MFMPREC (1<<12) /* 0=GCR precomp., 1=MFM precomp. */
+#define ADK_WORDSYNC(1<<10) /* enable DSKSYNC auto DMA */
+#define ADK_MSBSYNC (1<<9)  /* when 1, enable sync on MSbit (for GCR) 
*/
+#define ADK_FAST(1<<8)  /* bit cell: 0=2us (GCR), 1=1us (MFM) */
+
+/*
+ * DSKLEN bits
+ */
+
+#define DSKLEN_DMAEN(1<<15)
+#define DSKLEN_WRITE(1<<14)
+
+/*
+ * INTENA/INTREQ bits
+ */
+
+#define DSKINDEX(0x1<<4)/* DSKINDEX bit */
+
+/*
+ * Misc
+ */
+
+#define MFM_SYNC0x4489  /* standard MFM sync value */
+
+/* Values for FD_COMMAND */
+#define FD_RECALIBRATE 0x07/* move to track 0 */
+#define FD_SEEK0x0F/* seek track */
+#define FD_READ0xE6/* read with MT, MFM, SKip 
deleted */
+#define FD_WRITE   0xC5/* write with MT, MFM */
+#define FD_SENSEI  0x08/* Sense Interrupt Status */
+#define FD_SPECIFY 0x03/* specify HUT etc */
+#define FD_FORMAT  0x4D/* format one track */
+#define FD_VERSION 0x10/* get version code */
+#define FD_CONFIGURE   0x13/* configure FIFO operation */
+#define FD_PERPENDICULAR   0x12/* perpendicular r/w mode */
+
+#define FD_MAX_UNITS4  /* Max. Number of drives */
+#define FLOPPY_MAX_SECTORS 22  /* Max. Number of sectors per track */
+
+struct fd_data_type {
+   char *name; /* description of data type */
+   int sects;  /* sectors per track */
+   int (*read_fkt)(int);   /* read whole track */
+   void (*write_fkt)(int); /* write whole track */
+};
+
+struct fd_drive_type {
+   unsigned long code; /* code returned from drive */
+   char *name; /* description of drive */
+   unsigned int tracks;/* number of tracks */
+   unsigned int heads; /* number of heads */
+   unsigned int read_size; /* raw read size for one track */
+   unsigned int write_size;/* raw write size for one track */
+   unsigned int sect_mult; /* sectors and gap multiplier (HD = 2) */
+   unsigned int precomp1;  /* start track for precomp 1 */
+   unsigned int precomp2;  /* start track for precomp 2 */
+   unsigned int step_delay;  

[PATCH v2 11/11] floppy: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

This driver likes to fetch requests from all over the place, so make
queue_rq put requests on a list so that the logic stays the same. Tested
with QEMU.

Signed-off-by: Omar Sandoval 
---
 drivers/block/floppy.c | 74 +-
 1 file changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index f2b6f4da1034..62e0c218e5c6 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -252,13 +252,13 @@ static int allowed_drive_mask = 0x33;
 
 static int irqdma_allocated;
 
-#include 
+#include 
 #include 
 #include/* for the compatibility eject ioctl */
 #include 
 
+static LIST_HEAD(floppy_reqs);
 static struct request *current_req;
-static void do_fd_request(struct request_queue *q);
 static int set_next_request(void);
 
 #ifndef fd_get_dma_residue
@@ -414,10 +414,10 @@ static struct floppy_drive_struct drive_state[N_DRIVE];
 static struct floppy_write_errors write_errors[N_DRIVE];
 static struct timer_list motor_off_timer[N_DRIVE];
 static struct gendisk *disks[N_DRIVE];
+static struct blk_mq_tag_set tag_sets[N_DRIVE];
 static struct block_device *opened_bdev[N_DRIVE];
 static DEFINE_MUTEX(open_lock);
 static struct floppy_raw_cmd *raw_cmd, default_raw_cmd;
-static int fdc_queue;
 
 /*
  * This struct defines the different floppy types.
@@ -2216,8 +2216,9 @@ static void floppy_end_request(struct request *req, 
blk_status_t error)
/* current_count_sectors can be zero if transfer failed */
if (error)
nr_sectors = blk_rq_cur_sectors(req);
-   if (__blk_end_request(req, error, nr_sectors << 9))
+   if (blk_update_request(req, error, nr_sectors << 9))
return;
+   __blk_mq_end_request(req, error);
 
/* We're done with the request */
floppy_off(drive);
@@ -2797,27 +2798,14 @@ static int make_raw_rw_request(void)
return 2;
 }
 
-/*
- * Round-robin between our available drives, doing one request from each
- */
 static int set_next_request(void)
 {
-   struct request_queue *q;
-   int old_pos = fdc_queue;
-
-   do {
-   q = disks[fdc_queue]->queue;
-   if (++fdc_queue == N_DRIVE)
-   fdc_queue = 0;
-   if (q) {
-   current_req = blk_fetch_request(q);
-   if (current_req) {
-   current_req->error_count = 0;
-   break;
-   }
-   }
-   } while (fdc_queue != old_pos);
-
+   current_req = list_first_entry_or_null(_reqs, struct request,
+  queuelist);
+   if (current_req) {
+   current_req->error_count = 0;
+   list_del_init(_req->queuelist);
+   }
return current_req != NULL;
 }
 
@@ -2901,29 +2889,38 @@ static void process_fd_request(void)
schedule_bh(redo_fd_request);
 }
 
-static void do_fd_request(struct request_queue *q)
+static blk_status_t floppy_queue_rq(struct blk_mq_hw_ctx *hctx,
+   const struct blk_mq_queue_data *bd)
 {
+   blk_mq_start_request(bd->rq);
+
if (WARN(max_buffer_sectors == 0,
 "VFS: %s called on non-open device\n", __func__))
-   return;
+   return BLK_STS_IOERR;
 
if (WARN(atomic_read(_count) == 0,
 "warning: usage count=0, current_req=%p sect=%ld flags=%llx\n",
 current_req, (long)blk_rq_pos(current_req),
 (unsigned long long) current_req->cmd_flags))
-   return;
+   return BLK_STS_IOERR;
+
+   spin_lock_irq(_lock);
+   list_add_tail(>rq->queuelist, _reqs);
+   spin_unlock_irq(_lock);
 
if (test_and_set_bit(0, _busy)) {
/* fdc busy, this new request will be treated when the
   current one is done */
is_alive(__func__, "old request running");
-   return;
+   return BLK_STS_OK;
}
+
command_status = FD_COMMAND_NONE;
__reschedule_timeout(MAXTIMEOUT, "fd_request");
set_fdc(0);
process_fd_request();
is_alive(__func__, "");
+   return BLK_STS_OK;
 }
 
 static const struct cont_t poll_cont = {
@@ -4486,6 +4483,10 @@ static struct platform_driver floppy_driver = {
},
 };
 
+static const struct blk_mq_ops floppy_mq_ops = {
+   .queue_rq = floppy_queue_rq,
+};
+
 static struct platform_device floppy_device[N_DRIVE];
 
 static bool floppy_available(int drive)
@@ -4527,15 +4528,28 @@ static int __init do_floppy_init(void)
return -ENOMEM;
 
for (drive = 0; drive < N_DRIVE; drive++) {
+   struct blk_mq_tag_set *set;
+
disks[drive] = alloc_disk(1);
   

[PATCH v2 00/11] Convert floppy drivers to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series converts the various floppy drivers to blk-mq. Save for the
last one (floppy), they're compile-tested only. If I've Cc'd you, it's
because I think you might be able to test the changes. Please test if
you can, or let me know if there's a way to use QEMU/some other emulator
to test. The full series is available at [1]. Thanks!

Changes from v1:

- Drop "swim3: end whole request on error" and "floppy: end whole
  request on error".
- Handle errors on individual bios correctly in the other drivers.

Cc: Benjamin Herrenschmidt 
Cc: Finn Thain 
Cc: Laurent Vivier 

1: https://github.com/osandov/linux/tree/mq-conversions.

Omar Sandoval (11):
  swim: fix cleanup on setup error
  swim: convert to blk-mq
  swim3: add real error handling in setup
  swim3: convert to blk-mq
  amiflop: fold headers into C file
  amiflop: clean up on errors during setup
  amiflop: convert to blk-mq
  ataflop: fold headers into C file
  ataflop: fix error handling during setup
  ataflop: convert to blk-mq
  floppy: convert to blk-mq

 arch/m68k/include/asm/atafd.h|  13 --
 arch/m68k/include/asm/atafdreg.h |  80 
 drivers/block/amiflop.c  | 328 +++
 drivers/block/ataflop.c  | 283 +++---
 drivers/block/floppy.c   |  74 ---
 drivers/block/swim.c | 114 ++-
 drivers/block/swim3.c| 219 ++---
 include/linux/amifd.h|  63 --
 include/linux/amifdreg.h |  82 
 9 files changed, 590 insertions(+), 666 deletions(-)
 delete mode 100644 arch/m68k/include/asm/atafd.h
 delete mode 100644 arch/m68k/include/asm/atafdreg.h
 delete mode 100644 include/linux/amifd.h
 delete mode 100644 include/linux/amifdreg.h

-- 
2.19.1



[PATCH v2 01/11] swim: fix cleanup on setup error

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

If we fail to allocate the request queue for a disk, we still need to
free that disk, not just the previous ones. Additionally, we need to
cleanup the previous request queues.

Signed-off-by: Omar Sandoval 
---
 drivers/block/swim.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 0e31884a9519..cbe909c51847 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -887,8 +887,17 @@ static int swim_floppy_init(struct swim_priv *swd)
 
 exit_put_disks:
unregister_blkdev(FLOPPY_MAJOR, "fd");
-   while (drive--)
-   put_disk(swd->unit[drive].disk);
+   do {
+   struct gendisk *disk = swd->unit[drive].disk;
+
+   if (disk) {
+   if (disk->queue) {
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+   }
+   put_disk(disk);
+   }
+   } while (drive--);
return err;
 }
 
-- 
2.19.1



Re: [PATCH 12/13] floppy: end whole request on error

2018-10-11 Thread Omar Sandoval
On Thu, Oct 11, 2018 at 12:39:51PM -0600, Jens Axboe wrote:
> On 10/11/18 12:30 PM, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > If floppy_end_request() gets passed an error, it should end the whole
> > request, not just the current segment.
> 
> I don't think this is correct, we should still just end the
> individual chunks.

Yup, I'll drop this and the swim3 equivalent, thanks.


[PATCH 06/13] amiflop: fold headers into C file

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

amifd.h and amifdreg.h are only used from amiflop.c, and they're pretty
small, so move the contents to amiflop.c and get rid of the .h files.
This is preparation for adding a struct blk_mq_tag_set to struct
amiga_floppy_struct.

Signed-off-by: Omar Sandoval 
---
 drivers/block/amiflop.c  | 123 ++-
 include/linux/amifd.h|  63 
 include/linux/amifdreg.h |  82 --
 3 files changed, 120 insertions(+), 148 deletions(-)
 delete mode 100644 include/linux/amifd.h
 delete mode 100644 include/linux/amifdreg.h

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index 3aaf6af3ec23..a7d6e6a9b12f 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -61,10 +61,8 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -86,6 +84,125 @@
  *  Defines
  */
 
+/*
+ * CIAAPRA bits (read only)
+ */
+
+#define DSKRDY  (0x1<<5)/* disk ready when low */
+#define DSKTRACK0   (0x1<<4)/* head at track zero when low */
+#define DSKPROT (0x1<<3)/* disk protected when low */
+#define DSKCHANGE   (0x1<<2)/* low when disk removed */
+
+/*
+ * CIAAPRB bits (read/write)
+ */
+
+#define DSKMOTOR(0x1<<7)/* motor on when low */
+#define DSKSEL3 (0x1<<6)/* select drive 3 when low */
+#define DSKSEL2 (0x1<<5)/* select drive 2 when low */
+#define DSKSEL1 (0x1<<4)/* select drive 1 when low */
+#define DSKSEL0 (0x1<<3)/* select drive 0 when low */
+#define DSKSIDE (0x1<<2)/* side selection: 0 = upper, 1 = lower */
+#define DSKDIREC(0x1<<1)/* step direction: 0=in, 1=out (to trk 0) 
*/
+#define DSKSTEP (0x1)   /* pulse low to step head 1 track */
+
+/*
+ * DSKBYTR bits (read only)
+ */
+
+#define DSKBYT  (1<<15) /* register contains valid byte when set */
+#define DMAON   (1<<14) /* disk DMA enabled */
+#define DISKWRITE   (1<<13) /* disk write bit in DSKLEN enabled */
+#define WORDEQUAL   (1<<12) /* DSKSYNC register match when true */
+/* bits 7-0 are data */
+
+/*
+ * ADKCON/ADKCONR bits
+ */
+
+#ifndef SETCLR
+#define ADK_SETCLR  (1<<15) /* control bit */
+#endif
+#define ADK_PRECOMP1(1<<14) /* precompensation selection */
+#define ADK_PRECOMP0(1<<13) /* 00=none, 01=140ns, 10=280ns, 11=500ns */
+#define ADK_MFMPREC (1<<12) /* 0=GCR precomp., 1=MFM precomp. */
+#define ADK_WORDSYNC(1<<10) /* enable DSKSYNC auto DMA */
+#define ADK_MSBSYNC (1<<9)  /* when 1, enable sync on MSbit (for GCR) 
*/
+#define ADK_FAST(1<<8)  /* bit cell: 0=2us (GCR), 1=1us (MFM) */
+
+/*
+ * DSKLEN bits
+ */
+
+#define DSKLEN_DMAEN(1<<15)
+#define DSKLEN_WRITE(1<<14)
+
+/*
+ * INTENA/INTREQ bits
+ */
+
+#define DSKINDEX(0x1<<4)/* DSKINDEX bit */
+
+/*
+ * Misc
+ */
+
+#define MFM_SYNC0x4489  /* standard MFM sync value */
+
+/* Values for FD_COMMAND */
+#define FD_RECALIBRATE 0x07/* move to track 0 */
+#define FD_SEEK0x0F/* seek track */
+#define FD_READ0xE6/* read with MT, MFM, SKip 
deleted */
+#define FD_WRITE   0xC5/* write with MT, MFM */
+#define FD_SENSEI  0x08/* Sense Interrupt Status */
+#define FD_SPECIFY 0x03/* specify HUT etc */
+#define FD_FORMAT  0x4D/* format one track */
+#define FD_VERSION 0x10/* get version code */
+#define FD_CONFIGURE   0x13/* configure FIFO operation */
+#define FD_PERPENDICULAR   0x12/* perpendicular r/w mode */
+
+#define FD_MAX_UNITS4  /* Max. Number of drives */
+#define FLOPPY_MAX_SECTORS 22  /* Max. Number of sectors per track */
+
+struct fd_data_type {
+   char *name; /* description of data type */
+   int sects;  /* sectors per track */
+   int (*read_fkt)(int);   /* read whole track */
+   void (*write_fkt)(int); /* write whole track */
+};
+
+struct fd_drive_type {
+   unsigned long code; /* code returned from drive */
+   char *name; /* description of drive */
+   unsigned int tracks;/* number of tracks */
+   unsigned int heads; /* number of heads */
+   unsigned int read_size; /* raw read size for one track */
+   unsigned int write_size;/* raw write size for one track */
+   unsigned int sect_mult; /* sectors and gap multiplier (HD = 2) */
+   unsigned int precomp1;  /* start track for precomp 1 */
+   unsigned int precomp2;  /* start track for precomp 2 */
+   unsigned int step_delay;  

[PATCH 08/13] amiflop: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Straightforward conversion, just use the existing amiflop_lock to
serialize access to the controller. Compile-tested only.

Cc: Laurent Vivier 
Signed-off-by: Omar Sandoval 
---
 drivers/block/amiflop.c | 127 +++-
 1 file changed, 48 insertions(+), 79 deletions(-)

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index eef3b085e70a..e9c701f9b32b 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -201,6 +201,7 @@ struct amiga_floppy_struct {
int dirty;  /* true when trackbuf is not on disk */
int status; /* current error code for unit */
struct gendisk *gendisk;
+   struct blk_mq_tag_set tag_set;
 };
 
 /*
@@ -281,7 +282,6 @@ static volatile int selected = -1;  /* currently selected 
drive */
 static int writepending;
 static int writefromint;
 static char *raw_buf;
-static int fdc_queue;
 
 static DEFINE_SPINLOCK(amiflop_lock);
 
@@ -1454,76 +1454,20 @@ static int get_track(int drive, int track)
return -1;
 }
 
-/*
- * Round-robin between our available drives, doing one request from each
- */
-static struct request *set_next_request(void)
+static blk_status_t amiflop_rw_cur_segment(struct amiga_floppy_struct *floppy,
+  struct request *rq)
 {
-   struct request_queue *q;
-   int cnt = FD_MAX_UNITS;
-   struct request *rq = NULL;
-
-   /* Find next queue we can dispatch from */
-   fdc_queue = fdc_queue + 1;
-   if (fdc_queue == FD_MAX_UNITS)
-   fdc_queue = 0;
-
-   for(cnt = FD_MAX_UNITS; cnt > 0; cnt--) {
-
-   if (unit[fdc_queue].type->code == FD_NODRIVE) {
-   if (++fdc_queue == FD_MAX_UNITS)
-   fdc_queue = 0;
-   continue;
-   }
-
-   q = unit[fdc_queue].gendisk->queue;
-   if (q) {
-   rq = blk_fetch_request(q);
-   if (rq)
-   break;
-   }
-
-   if (++fdc_queue == FD_MAX_UNITS)
-   fdc_queue = 0;
-   }
-
-   return rq;
-}
-
-static void redo_fd_request(void)
-{
-   struct request *rq;
+   int drive = floppy - unit;
unsigned int cnt, block, track, sector;
-   int drive;
-   struct amiga_floppy_struct *floppy;
char *data;
-   unsigned long flags;
-   blk_status_t err;
-
-next_req:
-   rq = set_next_request();
-   if (!rq) {
-   /* Nothing left to do */
-   return;
-   }
 
-   floppy = rq->rq_disk->private_data;
-   drive = floppy - unit;
-
-next_segment:
-   /* Here someone could investigate to be more efficient */
-   for (cnt = 0, err = BLK_STS_OK; cnt < blk_rq_cur_sectors(rq); cnt++) {
+   for (cnt = 0; cnt < blk_rq_cur_sectors(rq); cnt++) {
 #ifdef DEBUG
printk("fd: sector %ld + %d requested for %s\n",
   blk_rq_pos(rq), cnt,
   (rq_data_dir(rq) == READ) ? "read" : "write");
 #endif
block = blk_rq_pos(rq) + cnt;
-   if ((int)block > floppy->blocks) {
-   err = BLK_STS_IOERR;
-   break;
-   }
-
track = block / (floppy->dtype->sects * 
floppy->type->sect_mult);
sector = block % (floppy->dtype->sects * 
floppy->type->sect_mult);
data = bio_data(rq->bio) + 512 * cnt;
@@ -1532,10 +1476,8 @@ static void redo_fd_request(void)
   "0x%08lx\n", track, sector, data);
 #endif
 
-   if (get_track(drive, track) == -1) {
-   err = BLK_STS_IOERR;
-   break;
-   }
+   if (get_track(drive, track) == -1)
+   return BLK_STS_IOERR;
 
if (rq_data_dir(rq) == READ) {
memcpy(data, floppy->trackbuf + sector * 512, 512);
@@ -1543,31 +1485,42 @@ static void redo_fd_request(void)
memcpy(floppy->trackbuf + sector * 512, data, 512);
 
/* keep the drive spinning while writes are scheduled */
-   if (!fd_motor_on(drive)) {
-   err = BLK_STS_IOERR;
-   break;
-   }
+   if (!fd_motor_on(drive))
+   return BLK_STS_IOERR;
/*
 * setup a callback to write the track buffer
 * after a short (1 tick) delay.
 */
-   local_irq_save(flags);
-
floppy->dirty = 1;

[PATCH 05/13] swim3: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Pretty simple conversion. To avoid extra churn, we keep the
swim3_end_request() wrapper. grab_drive() could probably be replaced by
some freeze/quiesce incantation, but I left it alone, and just used
freeze/quiesce for eject. Compile-tested only.

Cc: Benjamin Herrenschmidt 
Signed-off-by: Omar Sandoval 
---
 drivers/block/swim3.c | 163 +++---
 1 file changed, 75 insertions(+), 88 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 24e121ee274b..e2061a4d43f7 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -206,6 +206,7 @@ struct floppy_state {
chardbdma_cmd_space[5 * sizeof(struct dbdma_cmd)];
int index;
struct request *cur_req;
+   struct blk_mq_tag_set tag_set;
 };
 
 #define swim3_err(fmt, arg...) dev_err(>mdev->ofdev.dev, "[fd%d] " fmt, 
fs->index, arg)
@@ -260,16 +261,15 @@ static int floppy_revalidate(struct gendisk *disk);
 static bool swim3_end_request(struct floppy_state *fs, blk_status_t err, 
unsigned int nr_bytes)
 {
struct request *req = fs->cur_req;
-   int rc;
 
swim3_dbg("  end request, err=%d nr_bytes=%d, cur_req=%p\n",
  err, nr_bytes, req);
 
if (err)
nr_bytes = blk_rq_bytes(req);
-   rc = __blk_end_request(req, err, nr_bytes);
-   if (rc)
+   if (blk_update_request(req, err, nr_bytes))
return true;
+   __blk_mq_end_request(req, err);
fs->cur_req = NULL;
return false;
 }
@@ -309,86 +309,58 @@ static int swim3_readbit(struct floppy_state *fs, int bit)
return (stat & DATA) == 0;
 }
 
-static void start_request(struct floppy_state *fs)
+static blk_status_t swim3_queue_rq(struct blk_mq_hw_ctx *hctx,
+  const struct blk_mq_queue_data *bd)
 {
-   struct request *req;
+   struct floppy_state *fs = hctx->queue->queuedata;
+   struct request *req = bd->rq;
unsigned long x;
 
-   swim3_dbg("start request, initial state=%d\n", fs->state);
-
-   if (fs->state == idle && fs->wanted) {
-   fs->state = available;
-   wake_up(>wait);
-   return;
+   spin_lock_irq(_lock);
+   if (fs->cur_req || fs->state != idle) {
+   spin_unlock_irq(_lock);
+   return BLK_STS_DEV_RESOURCE;
}
-   while (fs->state == idle) {
-   swim3_dbg("start request, idle loop, cur_req=%p\n", 
fs->cur_req);
-   if (!fs->cur_req) {
-   fs->cur_req = 
blk_fetch_request(disks[fs->index]->queue);
-   swim3_dbg("  fetched request %p\n", fs->cur_req);
-   if (!fs->cur_req)
-   break;
-   }
-   req = fs->cur_req;
-
-   if (fs->mdev->media_bay &&
-   check_media_bay(fs->mdev->media_bay) != MB_FD) {
-   swim3_dbg("%s", "  media bay absent, dropping req\n");
-   swim3_end_request(fs, BLK_STS_IOERR, 0);
-   continue;
-   }
-
-#if 0 /* This is really too verbose */
-   swim3_dbg("do_fd_req: dev=%s cmd=%d sec=%ld nr_sec=%u buf=%p\n",
- req->rq_disk->disk_name, req->cmd,
- (long)blk_rq_pos(req), blk_rq_sectors(req),
- bio_data(req->bio));
-   swim3_dbg("   current_nr_sectors=%u\n",
- blk_rq_cur_sectors(req));
-#endif
-
-   if (blk_rq_pos(req) >= fs->total_secs) {
-   swim3_dbg("  pos out of bounds (%ld, max is %ld)\n",
- (long)blk_rq_pos(req), (long)fs->total_secs);
-   swim3_end_request(fs, BLK_STS_IOERR, 0);
-   continue;
-   }
-   if (fs->ejected) {
-   swim3_dbg("%s", "  disk ejected\n");
+   blk_mq_start_request(req);
+   fs->cur_req = req;
+   if (fs->mdev->media_bay &&
+   check_media_bay(fs->mdev->media_bay) != MB_FD) {
+   swim3_dbg("%s", "  media bay absent, dropping req\n");
+   swim3_end_request(fs, BLK_STS_IOERR, 0);
+   goto out;
+   }
+   if (fs->ejected) {
+   swim3_dbg("%s", "  disk ejected\n");
+   swim3_end_request(fs, BLK_STS_IOERR, 0);
+   goto out;
+   }
+   if (rq_data_dir(req) == WRITE) {
+   i

[PATCH 07/13] amiflop: clean up on errors during setup

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

The error handling in fd_probe_drives() doesn't clean up at all. Fix it
up in preparation for converting to blk-mq. While we're here, get rid of
the commented out amiga_floppy_remove().

Signed-off-by: Omar Sandoval 
---
 drivers/block/amiflop.c | 84 -
 1 file changed, 40 insertions(+), 44 deletions(-)

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index a7d6e6a9b12f..eef3b085e70a 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1818,11 +1818,41 @@ static const struct block_device_operations floppy_fops 
= {
.check_events   = amiga_check_events,
 };
 
+static struct gendisk *fd_alloc_disk(int drive)
+{
+   struct gendisk *disk;
+
+   disk = alloc_disk(1);
+   if (!disk)
+   goto out;
+
+   disk->queue = blk_init_queue(do_fd_request, _lock);
+   if (IS_ERR(disk->queue)) {
+   disk->queue = NULL;
+   goto out_put_disk;
+   }
+
+   unit[drive].trackbuf = kmalloc(FLOPPY_MAX_SECTORS * 512, GFP_KERNEL);
+   if (!unit[drive].trackbuf)
+   goto out_cleanup_queue;
+
+   return disk;
+
+out_cleanup_queue:
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+out_put_disk:
+   put_disk(disk);
+out:
+   unit[drive].type->code = FD_NODRIVE;
+   return NULL;
+}
+
 static int __init fd_probe_drives(void)
 {
int drive,drives,nomem;
 
-   printk(KERN_INFO "FD: probing units\nfound ");
+   pr_info("FD: probing units\nfound");
drives=0;
nomem=0;
for(drive=0;drivecode == FD_NODRIVE)
continue;
-   disk = alloc_disk(1);
+
+   disk = fd_alloc_disk(drive);
if (!disk) {
-   unit[drive].type->code = FD_NODRIVE;
+   pr_cont(" no mem for fd%d", drive);
+   nomem = 1;
continue;
}
unit[drive].gendisk = disk;
-
-   disk->queue = blk_init_queue(do_fd_request, _lock);
-   if (!disk->queue) {
-   unit[drive].type->code = FD_NODRIVE;
-   continue;
-   }
-
drives++;
-   if ((unit[drive].trackbuf = kmalloc(FLOPPY_MAX_SECTORS * 512, 
GFP_KERNEL)) == NULL) {
-   printk("no mem for ");
-   unit[drive].type = _types[num_dr_types - 1]; /* 
FD_NODRIVE */
-   drives--;
-   nomem = 1;
-   }
-   printk("fd%d ",drive);
+
+   pr_cont(" fd%d",drive);
disk->major = FLOPPY_MAJOR;
disk->first_minor = drive;
disk->fops = _fops;
@@ -1861,11 +1881,11 @@ static int __init fd_probe_drives(void)
}
if ((drives > 0) || (nomem == 0)) {
if (drives == 0)
-   printk("no drives");
-   printk("\n");
+   pr_cont(" no drives");
+   pr_cont("\n");
return drives;
}
-   printk("\n");
+   pr_cont("\n");
return -ENOMEM;
 }
  
@@ -1948,30 +1968,6 @@ static int __init amiga_floppy_probe(struct 
platform_device *pdev)
return ret;
 }
 
-#if 0 /* not safe to unload */
-static int __exit amiga_floppy_remove(struct platform_device *pdev)
-{
-   int i;
-
-   for( i = 0; i < FD_MAX_UNITS; i++) {
-   if (unit[i].type->code != FD_NODRIVE) {
-   struct request_queue *q = unit[i].gendisk->queue;
-   del_gendisk(unit[i].gendisk);
-   put_disk(unit[i].gendisk);
-   kfree(unit[i].trackbuf);
-   if (q)
-   blk_cleanup_queue(q);
-   }
-   }
-   blk_unregister_region(MKDEV(FLOPPY_MAJOR, 0), 256);
-   free_irq(IRQ_AMIGA_CIAA_TB, NULL);
-   free_irq(IRQ_AMIGA_DSKBLK, NULL);
-   custom.dmacon = DMAF_DISK; /* disable DMA */
-   amiga_chip_free(raw_buf);
-   unregister_blkdev(FLOPPY_MAJOR, "fd");
-}
-#endif
-
 static struct platform_driver amiga_floppy_driver = {
.driver   = {
.name   = "amiga-floppy",
-- 
2.19.1



[PATCH 13/13] floppy: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

This driver likes to fetch requests from all over the place, so make
queue_rq put requests on a list so that the logic stays the same. Tested
with QEMU.

Signed-off-by: Omar Sandoval 
---
 drivers/block/floppy.c | 74 +-
 1 file changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index acbafd831ea3..eb19125beb90 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -252,13 +252,13 @@ static int allowed_drive_mask = 0x33;
 
 static int irqdma_allocated;
 
-#include 
+#include 
 #include 
 #include/* for the compatibility eject ioctl */
 #include 
 
+static LIST_HEAD(floppy_reqs);
 static struct request *current_req;
-static void do_fd_request(struct request_queue *q);
 static int set_next_request(void);
 
 #ifndef fd_get_dma_residue
@@ -414,10 +414,10 @@ static struct floppy_drive_struct drive_state[N_DRIVE];
 static struct floppy_write_errors write_errors[N_DRIVE];
 static struct timer_list motor_off_timer[N_DRIVE];
 static struct gendisk *disks[N_DRIVE];
+static struct blk_mq_tag_set tag_sets[N_DRIVE];
 static struct block_device *opened_bdev[N_DRIVE];
 static DEFINE_MUTEX(open_lock);
 static struct floppy_raw_cmd *raw_cmd, default_raw_cmd;
-static int fdc_queue;
 
 /*
  * This struct defines the different floppy types.
@@ -2218,8 +2218,9 @@ static void floppy_end_request(struct request *req, 
blk_status_t error)
nr_bytes = blk_rq_bytes(req);
else
nr_bytes = current_count_sectors << 9;
-   if (__blk_end_request(req, error, nr_bytes))
+   if (blk_update_request(req, error, nr_bytes))
return;
+   __blk_mq_end_request(req, error);
 
/* We're done with the request */
floppy_off(drive);
@@ -2799,27 +2800,14 @@ static int make_raw_rw_request(void)
return 2;
 }
 
-/*
- * Round-robin between our available drives, doing one request from each
- */
 static int set_next_request(void)
 {
-   struct request_queue *q;
-   int old_pos = fdc_queue;
-
-   do {
-   q = disks[fdc_queue]->queue;
-   if (++fdc_queue == N_DRIVE)
-   fdc_queue = 0;
-   if (q) {
-   current_req = blk_fetch_request(q);
-   if (current_req) {
-   current_req->error_count = 0;
-   break;
-   }
-   }
-   } while (fdc_queue != old_pos);
-
+   current_req = list_first_entry_or_null(_reqs, struct request,
+  queuelist);
+   if (current_req) {
+   current_req->error_count = 0;
+   list_del_init(_req->queuelist);
+   }
return current_req != NULL;
 }
 
@@ -2903,29 +2891,38 @@ static void process_fd_request(void)
schedule_bh(redo_fd_request);
 }
 
-static void do_fd_request(struct request_queue *q)
+static blk_status_t floppy_queue_rq(struct blk_mq_hw_ctx *hctx,
+   const struct blk_mq_queue_data *bd)
 {
+   blk_mq_start_request(bd->rq);
+
if (WARN(max_buffer_sectors == 0,
 "VFS: %s called on non-open device\n", __func__))
-   return;
+   return BLK_STS_IOERR;
 
if (WARN(atomic_read(_count) == 0,
 "warning: usage count=0, current_req=%p sect=%ld flags=%llx\n",
 current_req, (long)blk_rq_pos(current_req),
 (unsigned long long) current_req->cmd_flags))
-   return;
+   return BLK_STS_IOERR;
+
+   spin_lock_irq(_lock);
+   list_add_tail(>rq->queuelist, _reqs);
+   spin_unlock_irq(_lock);
 
if (test_and_set_bit(0, _busy)) {
/* fdc busy, this new request will be treated when the
   current one is done */
is_alive(__func__, "old request running");
-   return;
+   return BLK_STS_OK;
}
+
command_status = FD_COMMAND_NONE;
__reschedule_timeout(MAXTIMEOUT, "fd_request");
set_fdc(0);
process_fd_request();
is_alive(__func__, "");
+   return BLK_STS_OK;
 }
 
 static const struct cont_t poll_cont = {
@@ -4488,6 +4485,10 @@ static struct platform_driver floppy_driver = {
},
 };
 
+static const struct blk_mq_ops floppy_mq_ops = {
+   .queue_rq = floppy_queue_rq,
+};
+
 static struct platform_device floppy_device[N_DRIVE];
 
 static bool floppy_available(int drive)
@@ -4529,15 +4530,28 @@ static int __init do_floppy_init(void)
return -ENOMEM;
 
for (drive = 0; drive < N_DRIVE; drive++) {
+   struct blk_mq_tag_set *set;
+
disks[drive] = alloc_disk(1);
if (!disks[drive])

[PATCH 10/13] ataflop: fix error handling during setup

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Move queue allocation next to disk allocation to fix a couple of issues:

- If add_disk() hasn't been called, we should clear disk->queue before
  calling put_disk().
- If we fail to allocate a request queue, we still need to put all of
  the disks, not just the ones that we allocated queues for.

Signed-off-by: Omar Sandoval 
---
 drivers/block/ataflop.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 17df631c5d85..0144d598ac47 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -2014,6 +2014,11 @@ static int __init atari_floppy_init (void)
unit[i].disk = alloc_disk(1);
if (!unit[i].disk)
goto Enomem;
+
+   unit[i].disk->queue = blk_init_queue(do_fd_request,
+_lock);
+   if (!unit[i].disk->queue)
+   goto Enomem;
}
 
if (UseTrackbuffer < 0)
@@ -2045,10 +2050,6 @@ static int __init atari_floppy_init (void)
sprintf(unit[i].disk->disk_name, "fd%d", i);
unit[i].disk->fops = _fops;
unit[i].disk->private_data = [i];
-   unit[i].disk->queue = blk_init_queue(do_fd_request,
-   _lock);
-   if (!unit[i].disk->queue)
-   goto Enomem;
set_capacity(unit[i].disk, MAX_DISK_SIZE * 2);
add_disk(unit[i].disk);
}
@@ -2063,13 +2064,17 @@ static int __init atari_floppy_init (void)
 
return 0;
 Enomem:
-   while (i--) {
-   struct request_queue *q = unit[i].disk->queue;
+   do {
+   struct gendisk *disk = unit[i].disk;
 
-   put_disk(unit[i].disk);
-   if (q)
-   blk_cleanup_queue(q);
-   }
+   if (disk) {
+   if (disk->queue) {
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+   }
+   put_disk(unit[i].disk);
+   }
+   } while (i--);
 
unregister_blkdev(FLOPPY_MAJOR, "fd");
return -ENOMEM;
-- 
2.19.1



[PATCH 11/13] ataflop: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

This driver is already pretty broken, in that it has two wait_events()
(one in stdma_lock()) in request_fn. Get rid of the first one by
freezing/quiescing the queue on format, and the second one by replacing
it with stdma_try_lock(). The rest is straightforward. Compile-tested
only and probably incorrect.

Cc: Laurent Vivier 
Signed-off-by: Omar Sandoval 
---
 drivers/block/ataflop.c | 188 ++--
 1 file changed, 86 insertions(+), 102 deletions(-)

diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 0144d598ac47..5934e30d4805 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -66,7 +66,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -81,7 +81,6 @@
 
 static DEFINE_MUTEX(ataflop_mutex);
 static struct request *fd_request;
-static int fdc_queue;
 
 /*
  * WD1772 stuff
@@ -300,6 +299,7 @@ static struct atari_floppy_struct {
struct gendisk *disk;
int ref;
int type;
+   struct blk_mq_tag_set tag_set;
 } unit[FD_MAX_UNITS];
 
 #defineUD  unit[drive]
@@ -379,9 +379,6 @@ static int IsFormatting = 0, FormatError;
 static int UserSteprate[FD_MAX_UNITS] = { -1, -1 };
 module_param_array(UserSteprate, int, NULL, 0);
 
-/* Synchronization of FDC access. */
-static volatile int fdc_busy = 0;
-static DECLARE_WAIT_QUEUE_HEAD(fdc_wait);
 static DECLARE_COMPLETION(format_wait);
 
 static unsigned long changed_floppies = 0xff, fake_change = 0;
@@ -441,7 +438,6 @@ static void fd_times_out(struct timer_list *unused);
 static void finish_fdc( void );
 static void finish_fdc_done( int dummy );
 static void setup_req_params( int drive );
-static void redo_fd_request( void);
 static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, unsigned 
int
  cmd, unsigned long param);
 static void fd_probe( int drive );
@@ -459,8 +455,16 @@ static DEFINE_TIMER(fd_timer, check_change);

 static void fd_end_request_cur(blk_status_t err)
 {
-   if (!__blk_end_request_cur(fd_request, err))
+   if (err) {
+   blk_mq_end_request(fd_request, err);
+   return;
+   }
+
+   if (!blk_update_request(fd_request, err,
+   blk_rq_cur_bytes(fd_request))) {
+   __blk_mq_end_request(fd_request, err);
fd_request = NULL;
+   }
 }
 
 static inline void start_motor_off_timer(void)
@@ -706,7 +710,6 @@ static void fd_error( void )
if (SelectedDrive != -1)
SUD.track = -1;
}
-   redo_fd_request();
 }
 
 
@@ -724,14 +727,15 @@ static void fd_error( void )
 
 static int do_format(int drive, int type, struct atari_format_descr *desc)
 {
+   struct request_queue *q = unit[drive].disk->queue;
unsigned char   *p;
int sect, nsect;
unsigned long   flags;
+   int ret;
 
-   DPRINT(("do_format( dr=%d tr=%d he=%d offs=%d )\n",
-   drive, desc->track, desc->head, desc->sect_offset ));
+   blk_mq_freeze_queue(q);
+   blk_mq_quiesce_queue(q);
 
-   wait_event(fdc_wait, cmpxchg(_busy, 0, 1) == 0);
local_irq_save(flags);
stdma_lock(floppy_irq, NULL);
atari_turnon_irq( IRQ_MFP_FDC ); /* should be already, just to be sure 
*/
@@ -740,16 +744,16 @@ static int do_format(int drive, int type, struct 
atari_format_descr *desc)
if (type) {
if (--type >= NUM_DISK_MINORS ||
minor2disktype[type].drive_types > DriveType) {
-   redo_fd_request();
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out;
}
type = minor2disktype[type].index;
UDT = _disk_type[type];
}
 
if (!UDT || desc->track >= UDT->blocks/UDT->spt/2 || desc->head >= 2) {
-   redo_fd_request();
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out;
}
 
nsect = UDT->spt;
@@ -788,8 +792,11 @@ static int do_format(int drive, int type, struct 
atari_format_descr *desc)
 
wait_for_completion(_wait);
 
-   redo_fd_request();
-   return( FormatError ? -EIO : 0 );   
+   ret = FormatError ? -EIO : 0;
+out:
+   blk_mq_unquiesce_queue(q);
+   blk_mq_unfreeze_queue(q);
+   return ret;
 }
 
 
@@ -819,7 +826,6 @@ static void do_fd_action( int drive )
else {
/* all sectors finished */
fd_end_request_cur(BLK_STS_OK);
-   redo_fd_request();
return;
}
}
@@ -1224,7 +1230,6 @@ static void fd_rwsec_done1(int status)
else {
/* all sectors finished */
fd_end_request_cur(BLK_STS_OK);
- 

[PATCH 09/13] ataflop: fold headers into C file

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

atafd.h and atafdreg.h are only used from ataflop.c, so merge them in
there.

Signed-off-by: Omar Sandoval 
---
 arch/m68k/include/asm/atafd.h| 13 -
 arch/m68k/include/asm/atafdreg.h | 80 --
 drivers/block/ataflop.c  | 83 +++-
 3 files changed, 81 insertions(+), 95 deletions(-)
 delete mode 100644 arch/m68k/include/asm/atafd.h
 delete mode 100644 arch/m68k/include/asm/atafdreg.h

diff --git a/arch/m68k/include/asm/atafd.h b/arch/m68k/include/asm/atafd.h
deleted file mode 100644
index ad7014cad633..
--- a/arch/m68k/include/asm/atafd.h
+++ /dev/null
@@ -1,13 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_M68K_FD_H
-#define _ASM_M68K_FD_H
-
-/* Definitions for the Atari Floppy driver */
-
-struct atari_format_descr {
-int track; /* to be formatted */
-int head;  /*   "" "" */
-int sect_offset;   /* offset of first sector */
-};
-
-#endif
diff --git a/arch/m68k/include/asm/atafdreg.h b/arch/m68k/include/asm/atafdreg.h
deleted file mode 100644
index c31b4919ed2d..
--- a/arch/m68k/include/asm/atafdreg.h
+++ /dev/null
@@ -1,80 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _LINUX_FDREG_H
-#define _LINUX_FDREG_H
-
-/*
-** WD1772 stuff
- */
-
-/* register codes */
-
-#define FDCSELREG_STP   (0x80)   /* command/status register */
-#define FDCSELREG_TRA   (0x82)   /* track register */
-#define FDCSELREG_SEC   (0x84)   /* sector register */
-#define FDCSELREG_DTA   (0x86)   /* data register */
-
-/* register names for FDC_READ/WRITE macros */
-
-#define FDCREG_CMD 0
-#define FDCREG_STATUS  0
-#define FDCREG_TRACK   2
-#define FDCREG_SECTOR  4
-#define FDCREG_DATA6
-
-/* command opcodes */
-
-#define FDCCMD_RESTORE  (0x00)   /*  -   */
-#define FDCCMD_SEEK (0x10)   /*   |  */
-#define FDCCMD_STEP (0x20)   /*   |  TYP 1 Commands  */
-#define FDCCMD_STIN (0x40)   /*   |  */
-#define FDCCMD_STOT (0x60)   /*  -   */
-#define FDCCMD_RDSEC(0x80)   /*  -   TYP 2 Commands  */
-#define FDCCMD_WRSEC(0xa0)   /*  -  "*/
-#define FDCCMD_RDADR(0xc0)   /*  -   */
-#define FDCCMD_RDTRA(0xe0)   /*   |  TYP 3 Commands  */
-#define FDCCMD_WRTRA(0xf0)   /*  -   */
-#define FDCCMD_FORCI(0xd0)   /*  -   TYP 4 Command   */
-
-/* command modifier bits */
-
-#define FDCCMDADD_SR6   (0x00)   /* step rate settings */
-#define FDCCMDADD_SR12  (0x01)
-#define FDCCMDADD_SR2   (0x02)
-#define FDCCMDADD_SR3   (0x03)
-#define FDCCMDADD_V (0x04)   /* verify */
-#define FDCCMDADD_H (0x08)   /* wait for spin-up */
-#define FDCCMDADD_U (0x10)   /* update track register */
-#define FDCCMDADD_M (0x10)   /* multiple sector access */
-#define FDCCMDADD_E (0x04)   /* head settling flag */
-#define FDCCMDADD_P (0x02)   /* precompensation off */
-#define FDCCMDADD_A0(0x01)   /* DAM flag */
-
-/* status register bits */
-
-#defineFDCSTAT_MOTORON (0x80)   /* motor on */
-#defineFDCSTAT_WPROT   (0x40)   /* write protected (FDCCMD_WR*) */
-#defineFDCSTAT_SPINUP  (0x20)   /* motor speed stable (Type I) */
-#defineFDCSTAT_DELDAM  (0x20)   /* sector has deleted DAM (Type 
II+III) */
-#defineFDCSTAT_RECNF   (0x10)   /* record not found */
-#defineFDCSTAT_CRC (0x08)   /* CRC error */
-#defineFDCSTAT_TR00(0x04)   /* Track 00 flag (Type I) */
-#defineFDCSTAT_LOST(0x04)   /* Lost Data (Type II+III) */
-#defineFDCSTAT_IDX (0x02)   /* Index status (Type I) */
-#defineFDCSTAT_DRQ (0x02)   /* DRQ status (Type II+III) */
-#defineFDCSTAT_BUSY(0x01)   /* FDC is busy */
-
-
-/* PSG Port A Bit Nr 0 .. Side Sel .. 0 -> Side 1  1 -> Side 2 */
-#define DSKSIDE (0x01)
-
-#define DSKDRVNONE  (0x06)
-#define DSKDRV0 (0x02)
-#define DSKDRV1 (0x04)
-
-/* step rates */
-#defineFDCSTEP_6   0x00
-#defineFDCSTEP_12  0x01
-#defineFDCSTEP_2   0x02
-#defineFDCSTEP_3   0x03
-
-#endif
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index dfb2c2622e5a..17df631c5d85 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -71,8 +71,6 @@
 #include 
 #include 
 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -85,6 +83,87 @@ static DEFINE_MUTEX(ataflop_mutex);
 static struct request *fd_request;
 static int fdc_queue;
 
+/*
+ * WD1772 stuff
+ */
+
+/* register codes */
+
+#define FDCSELREG_STP   (0x80)   /* command/status register */
+#define FDCSELREG_TRA   (0x82)   /* track register */
+#define FDCSELREG_SEC   (0x84)   /* sector register */
+#define FDCSELREG_DTA   (0x86)   /* data register */
+
+/* register names for FDC_READ/WRITE 

[PATCH 12/13] floppy: end whole request on error

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

If floppy_end_request() gets passed an error, it should end the whole
request, not just the current segment.

Signed-off-by: Omar Sandoval 
---
 drivers/block/floppy.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index f2b6f4da1034..acbafd831ea3 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -2210,13 +2210,15 @@ static int do_format(int drive, struct format_descr 
*tmp_format_req)
 
 static void floppy_end_request(struct request *req, blk_status_t error)
 {
-   unsigned int nr_sectors = current_count_sectors;
+   unsigned int nr_bytes;
unsigned int drive = (unsigned long)req->rq_disk->private_data;
 
/* current_count_sectors can be zero if transfer failed */
if (error)
-   nr_sectors = blk_rq_cur_sectors(req);
-   if (__blk_end_request(req, error, nr_sectors << 9))
+   nr_bytes = blk_rq_bytes(req);
+   else
+   nr_bytes = current_count_sectors << 9;
+   if (__blk_end_request(req, error, nr_bytes))
return;
 
/* We're done with the request */
-- 
2.19.1



[PATCH 04/13] swim3: end whole request on error

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

When swim3_end_request() gets passed an error, it seems that the intent
is to end the whole request, but we're only ending the current segment.

Signed-off-by: Omar Sandoval 
---
 drivers/block/swim3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index df7ebe016e2c..24e121ee274b 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -266,7 +266,7 @@ static bool swim3_end_request(struct floppy_state *fs, 
blk_status_t err, unsigne
  err, nr_bytes, req);
 
if (err)
-   nr_bytes = blk_rq_cur_bytes(req);
+   nr_bytes = blk_rq_bytes(req);
rc = __blk_end_request(req, err, nr_bytes);
if (rc)
return true;
-- 
2.19.1



[PATCH 03/13] swim3: add real error handling in setup

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

The driver doesn't have support for removing a device that has already
been configured, but with more careful ordering we can avoid the need
for that and make sure that we don't leak generic resources.

Signed-off-by: Omar Sandoval 
---
 drivers/block/swim3.c | 60 ++-
 1 file changed, 36 insertions(+), 24 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 469541c1e51e..df7ebe016e2c 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1202,47 +1202,59 @@ static int swim3_add_device(struct macio_dev *mdev, int 
index)
 static int swim3_attach(struct macio_dev *mdev,
const struct of_device_id *match)
 {
+   struct floppy_state *fs;
struct gendisk *disk;
-   int index, rc;
+   int rc;
 
-   index = floppy_count++;
-   if (index >= MAX_FLOPPIES)
+   if (floppy_count >= MAX_FLOPPIES)
return -ENXIO;
 
-   /* Add the drive */
-   rc = swim3_add_device(mdev, index);
-   if (rc)
-   return rc;
-   /* Now register that disk. Same comment about failure handling */
-   disk = disks[index] = alloc_disk(1);
-   if (disk == NULL)
-   return -ENOMEM;
+   if (floppy_count == 0) {
+   rc = register_blkdev(FLOPPY_MAJOR, "fd");
+   if (rc)
+   return rc;
+   }
+
+   fs = _states[floppy_count];
+
+   disk = alloc_disk(1);
+   if (disk == NULL) {
+   rc = -ENOMEM;
+   goto out_unregister;
+   }
disk->queue = blk_init_queue(do_fd_request, _lock);
if (disk->queue == NULL) {
-   put_disk(disk);
-   return -ENOMEM;
+   rc = -ENOMEM;
+   goto out_put_disk;
}
blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH);
-   disk->queue->queuedata = _states[index];
+   disk->queue->queuedata = fs;
 
-   if (index == 0) {
-   /* If we failed, there isn't much we can do as the driver is 
still
-* too dumb to remove the device, just bail out
-*/
-   if (register_blkdev(FLOPPY_MAJOR, "fd"))
-   return 0;
-   }
+   rc = swim3_add_device(mdev, floppy_count);
+   if (rc)
+   goto out_cleanup_queue;
 
disk->major = FLOPPY_MAJOR;
-   disk->first_minor = index;
+   disk->first_minor = floppy_count;
disk->fops = _fops;
-   disk->private_data = _states[index];
+   disk->private_data = fs;
disk->flags |= GENHD_FL_REMOVABLE;
-   sprintf(disk->disk_name, "fd%d", index);
+   sprintf(disk->disk_name, "fd%d", floppy_count);
set_capacity(disk, 2880);
add_disk(disk);
 
+   disks[floppy_count++] = disk;
return 0;
+
+out_cleanup_queue:
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+out_put_disk:
+   put_disk(disk);
+out_unregister:
+   if (floppy_count == 0)
+   unregister_blkdev(FLOPPY_MAJOR, "fd");
+   return rc;
 }
 
 static const struct of_device_id swim3_match[] =
-- 
2.19.1



[PATCH 02/13] swim: convert to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

The only interesting thing here is that there may be two floppies (i.e.,
request queues) sharing the same controller, so we use the global struct
swim_priv->lock to check whether the controller is busy. Compile-tested
only.

Cc: Finn Thain 
Cc: Laurent Vivier 
Signed-off-by: Omar Sandoval 
---
 drivers/block/swim.c | 103 +--
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index cbe909c51847..bb3ac4da5b8d 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -190,6 +190,7 @@ struct floppy_state {
int ref_count;
 
struct gendisk *disk;
+   struct blk_mq_tag_set tag_set;
 
/* parent controller */
 
@@ -211,7 +212,6 @@ enum head {
 struct swim_priv {
struct swim __iomem *base;
spinlock_t lock;
-   int fdc_queue;
int floppy_count;
struct floppy_state unit[FD_MAX_UNIT];
 };
@@ -525,58 +525,38 @@ static blk_status_t floppy_read_sectors(struct 
floppy_state *fs,
return 0;
 }
 
-static struct request *swim_next_request(struct swim_priv *swd)
+static blk_status_t swim_queue_rq(struct blk_mq_hw_ctx *hctx,
+ const struct blk_mq_queue_data *bd)
 {
-   struct request_queue *q;
-   struct request *rq;
-   int old_pos = swd->fdc_queue;
+   struct floppy_state *fs = hctx->queue->queuedata;
+   struct swim_priv *swd = fs->swd;
+   struct request *req = bd->rq;
+   blk_status_t err;
 
-   do {
-   q = swd->unit[swd->fdc_queue].disk->queue;
-   if (++swd->fdc_queue == swd->floppy_count)
-   swd->fdc_queue = 0;
-   if (q) {
-   rq = blk_fetch_request(q);
-   if (rq)
-   return rq;
-   }
-   } while (swd->fdc_queue != old_pos);
+   if (!spin_trylock_irq(>lock))
+   return BLK_STS_DEV_RESOURCE;
 
-   return NULL;
-}
+   blk_mq_start_request(req);
 
-static void do_fd_request(struct request_queue *q)
-{
-   struct swim_priv *swd = q->queuedata;
-   struct request *req;
-   struct floppy_state *fs;
+   if (!fs->disk_in || rq_data_dir(req) == WRITE) {
+   err = BLK_STS_IOERR;
+   goto out;
+   }
 
-   req = swim_next_request(swd);
-   while (req) {
-   blk_status_t err = BLK_STS_IOERR;
+   do {
+   err = floppy_read_sectors(fs, blk_rq_pos(req),
+ blk_rq_cur_sectors(req),
+ bio_data(req->bio));
+   if (err)
+   break;
+   } while (blk_update_request(req, err, blk_rq_cur_bytes(req)));
+   blk_mq_end_request(req, err);
 
-   fs = req->rq_disk->private_data;
-   if (blk_rq_pos(req) >= fs->total_secs)
-   goto done;
-   if (!fs->disk_in)
-   goto done;
-   if (rq_data_dir(req) == WRITE && fs->write_protected)
-   goto done;
+   err = BLK_STS_OK;
+out:
+   spin_unlock_irq(>lock);
+   return err;
 
-   switch (rq_data_dir(req)) {
-   case WRITE:
-   /* NOT IMPLEMENTED */
-   break;
-   case READ:
-   err = floppy_read_sectors(fs, blk_rq_pos(req),
- blk_rq_cur_sectors(req),
- bio_data(req->bio));
-   break;
-   }
-   done:
-   if (!__blk_end_request_cur(req, err))
-   req = swim_next_request(swd);
-   }
 }
 
 static struct floppy_struct floppy_type[4] = {
@@ -823,6 +803,10 @@ static int swim_add_floppy(struct swim_priv *swd, enum 
drive_location location)
return 0;
 }
 
+static const struct blk_mq_ops swim_mq_ops = {
+   .queue_rq = swim_queue_rq,
+};
+
 static int swim_floppy_init(struct swim_priv *swd)
 {
int err;
@@ -852,20 +836,33 @@ static int swim_floppy_init(struct swim_priv *swd)
spin_lock_init(>lock);
 
for (drive = 0; drive < swd->floppy_count; drive++) {
+   struct blk_mq_tag_set *set;
+
swd->unit[drive].disk = alloc_disk(1);
if (swd->unit[drive].disk == NULL) {
err = -ENOMEM;
goto exit_put_disks;
}
-   swd->unit[drive].disk->queue = blk_init_queue(do_fd_request,
- >lock);
-   if (

[PATCH 00/13] Convert floppy drivers to blk-mq

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This series converts the various floppy drivers to blk-mq. Save for the
last one (floppy), they're compile-tested only. If I've Cc'd you, it's
because I think you might be able to test the changes. Please test if
you can, or let me know if there's a way to use QEMU/some other emulator
to test. The full series is available at [1]. Thanks!

Cc: Benjamin Herrenschmidt 
Cc: Finn Thain 
Cc: Laurent Vivier 

1: https://github.com/osandov/linux/tree/mq-conversions.

Omar Sandoval (13):
  swim: fix cleanup on setup error
  swim: convert to blk-mq
  swim3: add real error handling in setup
  swim3: end whole request on error
  swim3: convert to blk-mq
  amiflop: fold headers into C file
  amiflop: clean up on errors during setup
  amiflop: convert to blk-mq
  ataflop: fold headers into C file
  ataflop: fix error handling during setup
  ataflop: convert to blk-mq
  floppy: end whole request on error
  floppy: convert to blk-mq

 arch/m68k/include/asm/atafd.h|  13 --
 arch/m68k/include/asm/atafdreg.h |  80 
 drivers/block/amiflop.c  | 330 +++
 drivers/block/ataflop.c  | 288 ---
 drivers/block/floppy.c   |  80 +---
 drivers/block/swim.c | 116 ++-
 drivers/block/swim3.c| 221 +++--
 include/linux/amifd.h|  63 --
 include/linux/amifdreg.h |  82 
 9 files changed, 604 insertions(+), 669 deletions(-)
 delete mode 100644 arch/m68k/include/asm/atafd.h
 delete mode 100644 arch/m68k/include/asm/atafdreg.h
 delete mode 100644 include/linux/amifd.h
 delete mode 100644 include/linux/amifdreg.h

-- 
2.19.1



[PATCH 01/13] swim: fix cleanup on setup error

2018-10-11 Thread Omar Sandoval
From: Omar Sandoval 

If we fail to allocate the request queue for a disk, we still need to
free that disk, not just the previous ones. Additionally, we need to
cleanup the previous request queues.

Signed-off-by: Omar Sandoval 
---
 drivers/block/swim.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 0e31884a9519..cbe909c51847 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -887,8 +887,17 @@ static int swim_floppy_init(struct swim_priv *swd)
 
 exit_put_disks:
unregister_blkdev(FLOPPY_MAJOR, "fd");
-   while (drive--)
-   put_disk(swd->unit[drive].disk);
+   do {
+   struct gendisk *disk = swd->unit[drive].disk;
+
+   if (disk) {
+   if (disk->queue) {
+   blk_cleanup_queue(disk->queue);
+   disk->queue = NULL;
+   }
+   put_disk(disk);
+   }
+   } while (drive--);
return err;
 }
 
-- 
2.19.1



Re: [PATCH] blk-mq-debugfs: Also show requests that have not yet been started

2018-10-05 Thread Omar Sandoval
On Fri, Oct 05, 2018 at 08:18:00AM -0600, Jens Axboe wrote:
> On 10/4/18 11:35 AM, Bart Van Assche wrote:
> > When debugging e.g. the SCSI timeout handler it is important that
> > requests that have not yet been started or that already have
> > completed are also reported through debugfs.
> 
> Thanks, I like this better - applied. BTW, what's up with the
> reverse ordering on this:
> 
> > Signed-off-by: Bart Van Assche 
> > Cc: Christoph Hellwig 
> > Cc: Ming Lei 
> > Cc: Hannes Reinecke 
> > Cc: Johannes Thumshirn 
> > Cc: Martin K. Petersen 
> 
> For some reason that really annoys me, and I see it in various
> patches these days. IMHO the SOB should be last, with whatever
> acks, reviews, CC, before that.

I could've sworn that this guideline was even documented somewhere, but
I can't find it now ¯\_(ツ)_/¯


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-10-05 Thread Omar Sandoval
On Thu, Sep 27, 2018 at 04:26:42PM -0700, Bart Van Assche wrote:
> On Tue, 2018-09-18 at 17:18 -0700, Omar Sandoval wrote:
> > On Tue, Sep 18, 2018 at 05:02:47PM -0700, Bart Van Assche wrote:
> > > On 9/18/18 4:24 PM, Omar Sandoval wrote:
> > > > On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:
> > > > > Can you have a look at the updated master branch of
> > > > > https://github.com/bvanassche/blktests? That code should no longer 
> > > > > fail if
> > > > > unloading the nvme kernel module fails. Please note that you will need
> > > > > kernel v4.18 to test these scripts - a KASAN complaint appears if I 
> > > > > run
> > > > > these tests against kernel v4.19-rc4.
> > > > 
> > > > Thanks, these pass now. Is it expected that my nvme device gets a
> > > > multipath device configured after running these tests?
> > > > 
> > > > $ lsblk
> > > > NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> > > > vda  254:00  16G  0 disk
> > > > └─vda1   254:10  16G  0 part  /
> > > > vdb  254:16   0   8G  0 disk
> > > > vdc  254:32   0   8G  0 disk
> > > > vdd  254:48   0   8G  0 disk
> > > > nvme0n1  259:00   8G  0 disk
> > > > └─mpatha 253:00   8G  0 mpath
> > > 
> > > No, all multipath devices that were created during a test should be 
> > > removed
> > > before that test finishes. I will look into this.
> > > 
> > > > Also, can you please fix:
> > > > 
> > > > _have_kernel_option NVME_MULTIPATH && exit 1
> > > > 
> > > > to not exit on failure? It should use SKIP_REASON and return 1. You
> > > > might need to add something like _dont_have_kernel_option to properly
> > > > handle the case where the config is not found.
> > > 
> > > OK, I will change this.
> > > 
> > > > Side note which isn't a blocker for merging is that there's a lot of
> > > > duplicated code between these helpers and the srp helpers. How hard
> > > > would it be to refactor that?
> > > 
> > > Are you perhaps referring to the code that is shared between the
> > > tests/srp/rc tests/nvmeof-mp/rc shell scripts?
> > 
> > Yes, those.
> > 
> > > The hardest part is probably
> > > to chose a location where to store these functions. Should I create a file
> > > with common code under common/, under tests/srp/, under tests/nvmeof-mp/ 
> > > or
> > > perhaps somewhere else?
> > 
> > Just put it under common.
> 
> Hi Omar,
> 
> All feedback mentioned above has been addressed. The following pull request 
> has
> been updated: https://github.com/osandov/blktests/pull/33. Please let me know
> if you want me to post these patches on the linux-block mailing list.
> 
> Note: neither the upstream kernel v4.18 nor v4.19-rc4 are stable enough to 
> pass
> all nvmeof-mp tests if kernel debugging options like KASAN are enabled.
> Additionally, the NVMe device_add_disk() race condition often causes 
> multipathd
> to refuse to consider /dev/nvme... devices. The output on my test setup is as
> follows (all tests pass):
> 
> # ./check -q nvmeof-mp
> nvmeof-mp/001 (Log in and log out)   [passed]
> runtime  1.528s  ...  1.909s
> nvmeof-mp/002 (File I/O on top of multipath concurrently with logout and 
> login (mq)) [
> passed]time  38.968s  ...
> runtime  38.968s  ...  38.571s
> nvmeof-mp/004 (File I/O on top of multipath concurrently with logout and 
> login (sq-on-
> nvmeof-mp/004 (File I/O on top of multipath concurrently with logout and 
> login (sq-on-
> mq)) [passed]38.632s  ...
> runtime  38.632s  ...  37.529s
> nvmeof-mp/005 (Direct I/O with large transfer sizes and bs=4M) [passed]
> runtime  13.382s  ...  13.684s
> nvmeof-mp/006 (Direct I/O with large transfer sizes and bs=8M) [passed]
> runtime  13.511s  ...  13.480s
> nvmeof-mp/009 (Buffered I/O with large transfer sizes and bs=4M) [passed]
> runtime  13.665s  ...  13.763s
> nvmeof-mp/010 (Buffered I/O with large transfer sizes and bs=8M) [passed]
> runtime  13.442s  ...  13.900s
> nvmeof-mp/011 (Block I/O on top of multipath concurrently with logout and 
> login) [pass
> ed] runtime  37.988s  ...
> runtime  37.988s  ...  37.945s
> nvmeof-mp/012 (dm-mpath on top of multiple I/O schedulers)   [passed]
> runtime  21.659s  ...  21.733s

Thanks, Bart, merged.


[PATCH] kyber: fix integer overflow of latency targets on 32-bit

2018-09-28 Thread Omar Sandoval
From: Omar Sandoval 

NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
32-bit long. Make sure all of the targets are calculated as 64-bit
values.

Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics")
Reported-by: Stephen Rothwell 
Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 2b62e362fb36..eccac01a10b6 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -79,9 +79,9 @@ static const unsigned int kyber_depth[] = {
  * Default latency targets for each scheduling domain.
  */
 static const u64 kyber_latency_targets[] = {
-   [KYBER_READ] = 2 * NSEC_PER_MSEC,
-   [KYBER_WRITE] = 10 * NSEC_PER_MSEC,
-   [KYBER_DISCARD] = 5 * NSEC_PER_SEC,
+   [KYBER_READ] = 2ULL * NSEC_PER_MSEC,
+   [KYBER_WRITE] = 10ULL * NSEC_PER_MSEC,
+   [KYBER_DISCARD] = 5ULL * NSEC_PER_SEC,
 };
 
 /*
-- 
2.19.0



[PATCH v2 0/5] kyber: better heuristics

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This is my series to improve the heuristics used by Kyber. Patches 1 and
2 are preparation. Patch 3 is a minor optimization. Patch 4 is the main
change, and includes a detailed description of the new heuristics. Patch
5 adds tracepoints for debugging. This is basically the same as the RFC
I posted back in August [1] with one added tracepoint (kyber_throttled)
and rebased on linux-block/for-next.

Thanks!

1: https://www.spinics.net/lists/linux-block/msg29453.html

Omar Sandoval (5):
  block: move call of scheduler's ->completed_request() hook
  block: export blk_stat_enable_accounting()
  kyber: don't make domain token sbitmap larger than necessary
  kyber: implement improved heuristics
  kyber: add tracepoints

 block/blk-mq-sched.h |   4 +-
 block/blk-mq.c   |   5 +-
 block/blk-stat.c |   1 +
 block/kyber-iosched.c| 547 ---
 include/linux/elevator.h |   2 +-
 include/trace/events/kyber.h |  96 ++
 6 files changed, 409 insertions(+), 246 deletions(-)
 create mode 100644 include/trace/events/kyber.h

-- 
2.19.0



[PATCH v2 1/5] block: move call of scheduler's ->completed_request() hook

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Commit 4bc6339a583c ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-sched.h | 4 ++--
 block/blk-mq.c   | 5 +++--
 block/kyber-iosched.c| 5 ++---
 include/linux/elevator.h | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 4e028ee42430..8a9544203173 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -49,12 +49,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct 
request *rq,
return true;
 }
 
-static inline void blk_mq_sched_completed_request(struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)
 {
struct elevator_queue *e = rq->q->elevator;
 
if (e && e->type->ops.mq.completed_request)
-   e->type->ops.mq.completed_request(rq);
+   e->type->ops.mq.completed_request(rq, now);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d384ab700afd..1e72d53e8f2d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -528,6 +528,9 @@ inline void __blk_mq_end_request(struct request *rq, 
blk_status_t error)
blk_stat_add(rq, now);
}
 
+   if (rq->internal_tag != -1)
+   blk_mq_sched_completed_request(rq, now);
+
blk_account_io_done(rq, now);
 
if (rq->end_io) {
@@ -564,8 +567,6 @@ static void __blk_mq_complete_request(struct request *rq)
 
if (!blk_mq_mark_complete(rq))
return;
-   if (rq->internal_tag != -1)
-   blk_mq_sched_completed_request(rq);
 
if (!test_bit(QUEUE_FLAG_SAME_COMP, >q->queue_flags)) {
rq->q->softirq_done_fn(rq);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index a1660bafc912..95d062c07c61 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -558,12 +558,12 @@ static void kyber_finish_request(struct request *rq)
rq_clear_domain_token(kqd, rq);
 }
 
-static void kyber_completed_request(struct request *rq)
+static void kyber_completed_request(struct request *rq, u64 now)
 {
struct request_queue *q = rq->q;
struct kyber_queue_data *kqd = q->elevator->elevator_data;
unsigned int sched_domain;
-   u64 now, latency, target;
+   u64 latency, target;
 
/*
 * Check if this request met our latency goal. If not, quickly gather
@@ -585,7 +585,6 @@ static void kyber_completed_request(struct request *rq)
if (blk_stat_is_active(kqd->cb))
return;
 
-   now = ktime_get_ns();
if (now < rq->io_start_time_ns)
return;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index a02deea30185..015bb59c0331 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -111,7 +111,7 @@ struct elevator_mq_ops {
void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, 
bool);
struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
bool (*has_work)(struct blk_mq_hw_ctx *);
-   void (*completed_request)(struct request *);
+   void (*completed_request)(struct request *, u64);
void (*started_request)(struct request *);
void (*requeue_request)(struct request *);
struct request *(*former_request)(struct request_queue *, struct 
request *);
-- 
2.19.0



[PATCH v2 2/5] block: export blk_stat_enable_accounting()

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Kyber will need this in a future change if it is built as a module.

Signed-off-by: Omar Sandoval 
---
 block/blk-stat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 7587b1c3caaf..90561af85a62 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -190,6 +190,7 @@ void blk_stat_enable_accounting(struct request_queue *q)
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
spin_unlock(>stats->lock);
 }
+EXPORT_SYMBOL_GPL(blk_stat_enable_accounting);
 
 struct blk_queue_stats *blk_alloc_queue_stats(void)
 {
-- 
2.19.0



[PATCH v2 5/5] kyber: add tracepoints

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

When debugging Kyber, it's really useful to know what latencies we've
been having, how the domain depths have been adjusted, and if we've
actually been throttling. Add three tracepoints, kyber_latency,
kyber_adjust, and kyber_throttled, to record that.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c| 52 ---
 include/trace/events/kyber.h | 96 
 2 files changed, 130 insertions(+), 18 deletions(-)
 create mode 100644 include/trace/events/kyber.h

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index adc8e6393829..2b62e362fb36 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -30,6 +30,9 @@
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * Scheduling domains: the device is divided into multiple domains based on the
  * request type.
@@ -42,6 +45,13 @@ enum {
KYBER_NUM_DOMAINS,
 };
 
+static const char *kyber_domain_names[] = {
+   [KYBER_READ] = "READ",
+   [KYBER_WRITE] = "WRITE",
+   [KYBER_DISCARD] = "DISCARD",
+   [KYBER_OTHER] = "OTHER",
+};
+
 enum {
/*
 * In order to prevent starvation of synchronous requests by a flood of
@@ -122,6 +132,11 @@ enum {
KYBER_IO_LATENCY,
 };
 
+static const char *kyber_latency_type_names[] = {
+   [KYBER_TOTAL_LATENCY] = "total",
+   [KYBER_IO_LATENCY] = "I/O",
+};
+
 /*
  * Per-cpu latency histograms: total latency and I/O latency for each 
scheduling
  * domain except for KYBER_OTHER.
@@ -144,6 +159,8 @@ struct kyber_ctx_queue {
 } cacheline_aligned_in_smp;
 
 struct kyber_queue_data {
+   struct request_queue *q;
+
/*
 * Each scheduling domain has a limited number of in-flight requests
 * device-wide, limited by these tokens.
@@ -249,6 +266,10 @@ static int calculate_percentile(struct kyber_queue_data 
*kqd,
}
memset(buckets, 0, sizeof(kqd->latency_buckets[sched_domain][type]));
 
+   trace_kyber_latency(kqd->q, kyber_domain_names[sched_domain],
+   kyber_latency_type_names[type], percentile,
+   bucket + 1, 1 << KYBER_LATENCY_SHIFT, samples);
+
return bucket;
 }
 
@@ -256,8 +277,11 @@ static void kyber_resize_domain(struct kyber_queue_data 
*kqd,
unsigned int sched_domain, unsigned int depth)
 {
depth = clamp(depth, 1U, kyber_depth[sched_domain]);
-   if (depth != kqd->domain_tokens[sched_domain].sb.depth)
+   if (depth != kqd->domain_tokens[sched_domain].sb.depth) {
sbitmap_queue_resize(>domain_tokens[sched_domain], depth);
+   trace_kyber_adjust(kqd->q, kyber_domain_names[sched_domain],
+  depth);
+   }
 }
 
 static void kyber_timer_fn(struct timer_list *t)
@@ -360,6 +384,8 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd)
goto err;
 
+   kqd->q = q;
+
kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency,
GFP_KERNEL | __GFP_ZERO);
if (!kqd->cpu_latency)
@@ -756,6 +782,9 @@ kyber_dispatch_cur_domain(struct kyber_queue_data *kqd,
rq_set_domain_token(rq, nr);
list_del_init(>queuelist);
return rq;
+   } else {
+   trace_kyber_throttled(kqd->q,
+ 
kyber_domain_names[khd->cur_domain]);
}
} else if (sbitmap_any_bit_set(>kcq_map[khd->cur_domain])) {
nr = kyber_get_domain_token(kqd, khd, hctx);
@@ -766,6 +795,9 @@ kyber_dispatch_cur_domain(struct kyber_queue_data *kqd,
rq_set_domain_token(rq, nr);
list_del_init(>queuelist);
return rq;
+   } else {
+   trace_kyber_throttled(kqd->q,
+ 
kyber_domain_names[khd->cur_domain]);
}
}
 
@@ -944,23 +976,7 @@ static int kyber_cur_domain_show(void *data, struct 
seq_file *m)
struct blk_mq_hw_ctx *hctx = data;
struct kyber_hctx_data *khd = hctx->sched_data;
 
-   switch (khd->cur_domain) {
-   case KYBER_READ:
-   seq_puts(m, "READ\n");
-   break;
-   case KYBER_WRITE:
-   seq_puts(m, "WRITE\n");
-   break;
-   case KYBER_DISCARD:
-   seq_puts(m, "DISCARD\n");
-   break;
-   case KYBER_OTHER:
-   seq_puts(m, "OTHER\n");
-   break;
-   default:
-   seq_printf(m, "%u\n", 

[PATCH v2 4/5] kyber: implement improved heuristics

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

Kyber's current heuristics have a few flaws:

- It's based on the mean latency, but p99 latency tends to be more
  meaningful to anyone who cares about latency. The mean can also be
  skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
  This works for steady, high load, but is more sensitive to outliers
  with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
  device, but the user cares about the time spent in the kernel, as
  well.

These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:

- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
  to the target latency (e.g., if the p99 latency is double the target
  latency, we will double the queue depth; if the p99 latency is half of
  the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
  depths down: we will only scale down if any domain's I/O latency
  exceeds the target latency, which is an indicator of congestion in the
  device.

These new heuristics are just as scalable as the heuristics they
replace.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 497 --
 1 file changed, 279 insertions(+), 218 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 08eb5295c18d..adc8e6393829 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -29,13 +29,16 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
-#include "blk-stat.h"
 
-/* Scheduling domains. */
+/*
+ * Scheduling domains: the device is divided into multiple domains based on the
+ * request type.
+ */
 enum {
KYBER_READ,
-   KYBER_SYNC_WRITE,
-   KYBER_OTHER, /* Async writes, discard, etc. */
+   KYBER_WRITE,
+   KYBER_DISCARD,
+   KYBER_OTHER,
KYBER_NUM_DOMAINS,
 };
 
@@ -49,25 +52,82 @@ enum {
 };
 
 /*
- * Initial device-wide depths for each scheduling domain.
+ * Maximum device-wide depth for each scheduling domain.
  *
- * Even for fast devices with lots of tags like NVMe, you can saturate
- * the device with only a fraction of the maximum possible queue depth.
- * So, we cap these to a reasonable value.
+ * Even for fast devices with lots of tags like NVMe, you can saturate the
+ * device with only a fraction of the maximum possible queue depth. So, we cap
+ * these to a reasonable value.
  */
 static const unsigned int kyber_depth[] = {
[KYBER_READ] = 256,
-   [KYBER_SYNC_WRITE] = 128,
-   [KYBER_OTHER] = 64,
+   [KYBER_WRITE] = 128,
+   [KYBER_DISCARD] = 64,
+   [KYBER_OTHER] = 16,
 };
 
 /*
- * Scheduling domain batch sizes. We favor reads.
+ * Default latency targets for each scheduling domain.
+ */
+static const u64 kyber_latency_targets[] = {
+   [KYBER_READ] = 2 * NSEC_PER_MSEC,
+   [KYBER_WRITE] = 10 * NSEC_PER_MSEC,
+   [KYBER_DISCARD] = 5 * NSEC_PER_SEC,
+};
+
+/*
+ * Batch size (number of requests we'll dispatch in a row) for each scheduling
+ * domain.
  */
 static const unsigned int kyber_batch_size[] = {
[KYBER_READ] = 16,
-   [KYBER_SYNC_WRITE] = 8,
-   [KYBER_OTHER] = 8,
+   [KYBER_WRITE] = 8,
+   [KYBER_DISCARD] = 1,
+   [KYBER_OTHER] = 1,
+};
+
+/*
+ * Requests latencies are recorded in a histogram with buckets defined relative
+ * to the target latency:
+ *
+ * <= 1/4 * target latency
+ * <= 1/2 * target latency
+ * <= 3/4 * target latency
+ * <= target latency
+ * <= 1 1/4 * target latency
+ * <= 1 1/2 * target latency
+ * <= 1 3/4 * target latency
+ * > 1 3/4 * target latency
+ */
+enum {
+   /*
+* The width of the latency histogram buckets is
+* 1 / (1 << KYBER_LATENCY_SHIFT) * target latency.
+*/
+   KYBER_LATENCY_SHIFT = 2,
+   /*
+* The first (1 << KYBER_LATENCY_SHIFT) buckets are <= target latency,
+* thus, "good".
+*/
+   KYBER_GOOD_BUCKETS = 1 << KYBER_LATENCY_SHIFT,
+   /* There are also (1 << KYBER_LATENCY_SHIFT) "bad" buckets. */
+   KYBER_LATENCY_BUCKETS = 2 << KYBER_LATENCY_SHIFT,
+};
+
+/*
+ * We measure both the total latency and the I/O latency (i.e., latency after
+ * submitting to the device).
+ */
+enum {
+   KYBER_TOTAL_LATENCY,
+   KYBER_IO_LATENCY,
+};
+
+/*
+ * Per-cpu latency histograms: total latency and I/O latency for each 
scheduling
+ * domain except for KYBER_OTHER.
+ */
+struct kybe

[PATCH v2 3/5] kyber: don't make domain token sbitmap larger than necessary

2018-09-27 Thread Omar Sandoval
From: Omar Sandoval 

The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 95d062c07c61..08eb5295c18d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -40,8 +40,6 @@ enum {
 };
 
 enum {
-   KYBER_MIN_DEPTH = 256,
-
/*
 * In order to prevent starvation of synchronous requests by a flood of
 * asynchronous requests, we reserve 25% of requests for synchronous
@@ -305,7 +303,6 @@ static int kyber_bucket_fn(const struct request *rq)
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 {
struct kyber_queue_data *kqd;
-   unsigned int max_tokens;
unsigned int shift;
int ret = -ENOMEM;
int i;
@@ -320,25 +317,17 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd->cb)
goto err_kqd;
 
-   /*
-* The maximum number of tokens for any scheduling domain is at least
-* the queue depth of a single hardware queue. If the hardware doesn't
-* have many tags, still provide a reasonable number.
-*/
-   max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
-  KYBER_MIN_DEPTH);
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
WARN_ON(!kyber_depth[i]);
WARN_ON(!kyber_batch_size[i]);
ret = sbitmap_queue_init_node(>domain_tokens[i],
- max_tokens, -1, false, GFP_KERNEL,
- q->node);
+ kyber_depth[i], -1, false,
+ GFP_KERNEL, q->node);
if (ret) {
while (--i >= 0)
sbitmap_queue_free(>domain_tokens[i]);
goto err_cb;
}
-   sbitmap_queue_resize(>domain_tokens[i], kyber_depth[i]);
}
 
shift = kyber_sched_tags_shift(kqd);
-- 
2.19.0



Re: [PATCH] blk-mq: I/O and timer unplugs are inverted in blktrace

2018-09-27 Thread Omar Sandoval
On Wed, Sep 26, 2018 at 02:35:50PM +0200, Ilya Dryomov wrote:
> trace_block_unplug() takes true for explicit unplugs and false for
> implicit unplugs.  schedule() unplugs are implicit and should be
> reported as timer unplugs.  While correct in the legacy code, this has
> been inverted in blk-mq since 4.11.
> 
> Cc: sta...@vger.kernel.org
> Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO 
> schedulers")

Reviewed-by: Omar Sandoval 

> Signed-off-by: Ilya Dryomov 
> ---
>  block/blk-mq.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 85a1c1a59c72..e3c39ea8e17b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1628,7 +1628,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
> from_schedule)
>   BUG_ON(!rq->q);
>   if (rq->mq_ctx != this_ctx) {
>   if (this_ctx) {
> - trace_block_unplug(this_q, depth, 
> from_schedule);
> + trace_block_unplug(this_q, depth, 
> !from_schedule);
>   blk_mq_sched_insert_requests(this_q, this_ctx,
>   _list,
>   from_schedule);
> @@ -1648,7 +1648,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool 
> from_schedule)
>* on 'ctx_list'. Do those.
>*/
>   if (this_ctx) {
> - trace_block_unplug(this_q, depth, from_schedule);
> + trace_block_unplug(this_q, depth, !from_schedule);
>   blk_mq_sched_insert_requests(this_q, this_ctx, _list,
>   from_schedule);
>   }
> -- 
> 2.14.4
> 


Re: [PATCH 1/4] block/loop: Don't grab "struct file" for vfs_getattr() operation.

2018-09-25 Thread Omar Sandoval
On Wed, Sep 26, 2018 at 12:26:46AM +0900, Tetsuo Handa wrote:
> vfs_getattr() needs "struct path" rather than "struct file".
> Let's use path_get()/path_put() rather than get_file()/fput().

Reviewed-by: Omar Sandoval 

> Signed-off-by: Tetsuo Handa 
> Reviewed-by: Jan Kara 
> ---
>  drivers/block/loop.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index abad6d1..c2745e6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1206,7 +1206,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  static int
>  loop_get_status(struct loop_device *lo, struct loop_info64 *info)
>  {
> - struct file *file;
> + struct path path;
>   struct kstat stat;
>   int ret;
>  
> @@ -1231,16 +1231,16 @@ static int loop_clr_fd(struct loop_device *lo)
>   }
>  
>   /* Drop lo_ctl_mutex while we call into the filesystem. */
> - file = get_file(lo->lo_backing_file);
> + path = lo->lo_backing_file->f_path;
> + path_get();
>   mutex_unlock(>lo_ctl_mutex);
> - ret = vfs_getattr(>f_path, , STATX_INO,
> -   AT_STATX_SYNC_AS_STAT);
> + ret = vfs_getattr(, , STATX_INO, AT_STATX_SYNC_AS_STAT);
>   if (!ret) {
>   info->lo_device = huge_encode_dev(stat.dev);
>   info->lo_inode = stat.ino;
>   info->lo_rdevice = huge_encode_dev(stat.rdev);
>   }
> - fput(file);
> + path_put();
>   return ret;
>  }
>  
> -- 
> 1.8.3.1
> 


Re: [PATCH] block: use nanosecond resolution for iostat

2018-09-24 Thread Omar Sandoval
On Fri, Sep 21, 2018 at 08:27:17PM -0600, Jens Axboe wrote:
> On 9/21/18 5:44 PM, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
> > updating properly on 4.18. This is because we started using ktime to
> > track elapsed time, and we convert nanoseconds to jiffies when we update
> > the partition counter. However, this gets rounded down, so any I/Os that
> > take less than a jiffy are not accounted for. Previously in this case,
> > the value of jiffies would sometimes increment while we were doing I/O,
> > so at least some I/Os were accounted for.
> > 
> > Let's convert the stats to use nanoseconds internally. We still report
> > milliseconds as before, now more accurately than ever. The value is
> > still truncated to 32 bits for backwards compatibility.
> 
> Thanks Omar, applied for 4.19.

Thanks, Jens. I also just pushed a regression test to blktests.


[PATCH] block: use nanosecond resolution for iostat

2018-09-21 Thread Omar Sandoval
From: Omar Sandoval 

Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
Cc: sta...@vger.kernel.org
Reported-by: Klaus Kusche 
Signed-off-by: Omar Sandoval 
---
 block/bio.c   | 2 +-
 block/blk-core.c  | 4 +---
 block/genhd.c | 6 +++---
 block/partition-generic.c | 6 +++---
 include/linux/genhd.h | 5 -
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 8c680a776171..0093bed81c0e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1684,7 +1684,7 @@ void generic_end_io_acct(struct request_queue *q, int 
req_op,
const int sgrp = op_stat_group(req_op);
int cpu = part_stat_lock();
 
-   part_stat_add(cpu, part, ticks[sgrp], duration);
+   part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_round_stats(q, cpu, part);
part_dec_in_flight(q, part, op_is_write(req_op));
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f43b38..cff0a60ee200 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2733,17 +2733,15 @@ void blk_account_io_done(struct request *req, u64 now)
 * containing request is enough.
 */
if (blk_do_io_stat(req) && !(req->rq_flags & RQF_FLUSH_SEQ)) {
-   unsigned long duration;
const int sgrp = op_stat_group(req_op(req));
struct hd_struct *part;
int cpu;
 
-   duration = nsecs_to_jiffies(now - req->start_time_ns);
cpu = part_stat_lock();
part = req->part;
 
part_stat_inc(cpu, part, ios[sgrp]);
-   part_stat_add(cpu, part, ticks[sgrp], duration);
+   part_stat_add(cpu, part, nsecs[sgrp], now - req->start_time_ns);
part_round_stats(req->q, cpu, part);
part_dec_in_flight(req->q, part, rq_data_dir(req));
 
diff --git a/block/genhd.c b/block/genhd.c
index 8cc719a37b32..be5bab20b2ab 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1343,18 +1343,18 @@ static int diskstats_show(struct seq_file *seqf, void 
*v)
   part_stat_read(hd, ios[STAT_READ]),
   part_stat_read(hd, merges[STAT_READ]),
   part_stat_read(hd, sectors[STAT_READ]),
-  jiffies_to_msecs(part_stat_read(hd, 
ticks[STAT_READ])),
+  (unsigned int)part_stat_read_msecs(hd, STAT_READ),
   part_stat_read(hd, ios[STAT_WRITE]),
   part_stat_read(hd, merges[STAT_WRITE]),
   part_stat_read(hd, sectors[STAT_WRITE]),
-  jiffies_to_msecs(part_stat_read(hd, 
ticks[STAT_WRITE])),
+  (unsigned int)part_stat_read_msecs(hd, STAT_WRITE),
   inflight[0],
   jiffies_to_msecs(part_stat_read(hd, io_ticks)),
   jiffies_to_msecs(part_stat_read(hd, time_in_queue)),
   part_stat_read(hd, ios[STAT_DISCARD]),
   part_stat_read(hd, merges[STAT_DISCARD]),
   part_stat_read(hd, sectors[STAT_DISCARD]),
-  jiffies_to_msecs(part_stat_read(hd, 
ticks[STAT_DISCARD]))
+  (unsigned int)part_stat_read_msecs(hd, STAT_DISCARD)
);
}
disk_part_iter_exit();
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 5a8975a1201c..d3d14e81fb12 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -136,18 +136,18 @@ ssize_t part_stat_show(struct device *dev,
part_stat_read(p, ios[STAT_READ]),
part_stat_read(p, merges[STAT_READ]),
(unsigned long long)part_stat_read(p, sectors[STAT_READ]),
-   jiffies_to_msecs(part_stat_read(p, ticks[STAT_READ])),
+   (unsigned int)part_stat_read_msecs(p, STAT_READ),
part_stat_read(p, ios[STAT_WRITE]),
part_stat_read(p, merges[STAT_WRITE]),
(unsigned long long)part_stat_read(p, sectors[STAT_WRITE]),
-   jiffies_to_msecs(part_stat_read(p, ticks[STAT

Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Omar Sandoval
On Tue, Sep 18, 2018 at 05:02:47PM -0700, Bart Van Assche wrote:
> On 9/18/18 4:24 PM, Omar Sandoval wrote:
> > On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:
> > > Can you have a look at the updated master branch of
> > > https://github.com/bvanassche/blktests? That code should no longer fail if
> > > unloading the nvme kernel module fails. Please note that you will need
> > > kernel v4.18 to test these scripts - a KASAN complaint appears if I run
> > > these tests against kernel v4.19-rc4.
> > 
> > Thanks, these pass now. Is it expected that my nvme device gets a
> > multipath device configured after running these tests?
> > 
> > $ lsblk
> > NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
> > vda  254:00  16G  0 disk
> > └─vda1   254:10  16G  0 part  /
> > vdb  254:16   0   8G  0 disk
> > vdc  254:32   0   8G  0 disk
> > vdd  254:48   0   8G  0 disk
> > nvme0n1  259:00   8G  0 disk
> > └─mpatha 253:00   8G  0 mpath
> 
> No, all multipath devices that were created during a test should be removed
> before that test finishes. I will look into this.
> 
> > Also, can you please fix:
> > 
> > _have_kernel_option NVME_MULTIPATH && exit 1
> > 
> > to not exit on failure? It should use SKIP_REASON and return 1. You
> > might need to add something like _dont_have_kernel_option to properly
> > handle the case where the config is not found.
> 
> OK, I will change this.
> 
> > Side note which isn't a blocker for merging is that there's a lot of
> > duplicated code between these helpers and the srp helpers. How hard
> > would it be to refactor that?
> 
> Are you perhaps referring to the code that is shared between the
> tests/srp/rc tests/nvmeof-mp/rc shell scripts?

Yes, those.

> The hardest part is probably
> to chose a location where to store these functions. Should I create a file
> with common code under common/, under tests/srp/, under tests/nvmeof-mp/ or
> perhaps somewhere else?

Just put it under common.

Thanks!


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-18 Thread Omar Sandoval
On Tue, Sep 18, 2018 at 02:20:59PM -0700, Bart Van Assche wrote:
> On 8/23/18 5:21 PM, Omar Sandoval wrote:
> > On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote:
> > > On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote:
> > > > On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote:
> > > > > Moving these tests into the nvme directory is possible but will make 
> > > > > it
> > > > > harder to run the NVMeOF multipath tests separately. Are you fine 
> > > > > with this?
> > > > 
> > > > Both way's have it's up and downsides, I agree.
> > > > 
> > > > Having two distinct groups requires to run './check nvme nvmeof-mp' to
> > > > run full coverage with nvme.
> > > > 
> > > > Having it all in one group would require to run './check nvme 18 19 20
> > > > 21 22 23 24 ...' to get only the dm-mpath ones.
> > > > 
> > > > Honestly I hate both but your's (the two distinct groups) is probably
> > > > easier to handle in the end, I have to admit.
> > > 
> > > Omar, do you have a preference for one of the two aforementioned 
> > > approaches?
> > 
> > Let's keep it in a separate category, since lots of people running nvme
> > tests probably aren't interested in testing multipath.
> > 
> > A bunch of the tests failed with
> > 
> > modprobe: FATAL: Module nvme is in use.
> > 
> > Maybe related to my test VM having an nvme device?
> 
> Hello Omar,
> 
> Can you have a look at the updated master branch of
> https://github.com/bvanassche/blktests? That code should no longer fail if
> unloading the nvme kernel module fails. Please note that you will need
> kernel v4.18 to test these scripts - a KASAN complaint appears if I run
> these tests against kernel v4.19-rc4.

Thanks, these pass now. Is it expected that my nvme device gets a
multipath device configured after running these tests?

$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE  MOUNTPOINT
vda  254:00  16G  0 disk
└─vda1   254:10  16G  0 part  /
vdb  254:16   0   8G  0 disk
vdc  254:32   0   8G  0 disk
vdd  254:48   0   8G  0 disk
nvme0n1  259:00   8G  0 disk
└─mpatha 253:00   8G  0 mpath

Also, can you please fix:

_have_kernel_option NVME_MULTIPATH && exit 1

to not exit on failure? It should use SKIP_REASON and return 1. You
might need to add something like _dont_have_kernel_option to properly
handle the case where the config is not found.

Side note which isn't a blocker for merging is that there's a lot of
duplicated code between these helpers and the srp helpers. How hard
would it be to refactor that?


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-09-12 Thread Omar Sandoval
On Thu, Aug 23, 2018 at 05:21:33PM -0700, Omar Sandoval wrote:
> On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote:
> > On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote:
> > > On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote:
> > > > Moving these tests into the nvme directory is possible but will make it
> > > > harder to run the NVMeOF multipath tests separately. Are you fine with 
> > > > this?
> > > 
> > > Both way's have it's up and downsides, I agree.
> > > 
> > > Having two distinct groups requires to run './check nvme nvmeof-mp' to
> > > run full coverage with nvme.
> > > 
> > > Having it all in one group would require to run './check nvme 18 19 20
> > > 21 22 23 24 ...' to get only the dm-mpath ones.
> > > 
> > > Honestly I hate both but your's (the two distinct groups) is probably
> > > easier to handle in the end, I have to admit.
> > 
> > Omar, do you have a preference for one of the two aforementioned approaches?
> > 
> > Thanks,
> > 
> > Bart.
> > 
> 
> Let's keep it in a separate category, since lots of people running nvme
> tests probably aren't interested in testing multipath.
> 
> A bunch of the tests failed with
> 
> modprobe: FATAL: Module nvme is in use.
> 
> Maybe related to my test VM having an nvme device?

Ping, Bart, can you look into this? It'd be nice to get this in.


Re: [PATCH] null_blk: fix zoned support for non-rq based operation

2018-09-12 Thread Omar Sandoval
On Wed, Sep 12, 2018 at 04:15:28PM -0600, Jens Axboe wrote:
> The supported added for zones in null_blk seem to assume that
> only rq based operation is possible. But this depends on the
> queue_mode setting, if this is set to 0, then cmd->bio is what
> we need to be operating on. Right now any attempt to load null_blk
> with queue_mode=0 will insta-crash, since cmd->rq is NULL and
> null_handle_cmd() assumes it to always be set.
> 
> Make the zoned code deal with bio's instead, or pass in the
> appropriate sector/nr_sectors instead.
> 
> Fixes: ca4b2a011948 ("null_blk: add zone support")

I just added block/023 to blktests for this, so

Tested-by: Omar Sandoval 

> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/drivers/block/null_blk.h b/drivers/block/null_blk.h
> index d81781f22dba..34e0030f0592 100644
> --- a/drivers/block/null_blk.h
> +++ b/drivers/block/null_blk.h
> @@ -87,10 +87,10 @@ struct nullb {
>  #ifdef CONFIG_BLK_DEV_ZONED
>  int null_zone_init(struct nullb_device *dev);
>  void null_zone_exit(struct nullb_device *dev);
> -blk_status_t null_zone_report(struct nullb *nullb,
> - struct nullb_cmd *cmd);
> -void null_zone_write(struct nullb_cmd *cmd);
> -void null_zone_reset(struct nullb_cmd *cmd);
> +blk_status_t null_zone_report(struct nullb *nullb, struct bio *bio);
> +void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
> + unsigned int nr_sectors);
> +void null_zone_reset(struct nullb_cmd *cmd, sector_t sector);
>  #else
>  static inline int null_zone_init(struct nullb_device *dev)
>  {
> @@ -98,11 +98,14 @@ static inline int null_zone_init(struct nullb_device *dev)
>  }
>  static inline void null_zone_exit(struct nullb_device *dev) {}
>  static inline blk_status_t null_zone_report(struct nullb *nullb,
> - struct nullb_cmd *cmd)
> + struct bio *bio)
>  {
>   return BLK_STS_NOTSUPP;
>  }
> -static inline void null_zone_write(struct nullb_cmd *cmd) {}
> -static inline void null_zone_reset(struct nullb_cmd *cmd) {}
> +static inline void null_zone_write(struct nullb_cmd *cmd, sector_t sector,
> +unsigned int nr_sectors)
> +{
> +}
> +static inline void null_zone_reset(struct nullb_cmd *cmd, sector_t sector) {}
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  #endif /* __NULL_BLK_H */
> diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
> index 6127e3ff7b4b..093b614d6524 100644
> --- a/drivers/block/null_blk_main.c
> +++ b/drivers/block/null_blk_main.c
> @@ -1157,16 +1157,33 @@ static void null_restart_queue_async(struct nullb 
> *nullb)
>   }
>  }
>  
> +static bool cmd_report_zone(struct nullb *nullb, struct nullb_cmd *cmd)
> +{
> + struct nullb_device *dev = cmd->nq->dev;
> +
> + if (dev->queue_mode == NULL_Q_BIO) {
> + if (bio_op(cmd->bio) == REQ_OP_ZONE_REPORT) {
> + cmd->error = null_zone_report(nullb, cmd->bio);
> + return true;
> + }
> + } else {
> + if (req_op(cmd->rq) == REQ_OP_ZONE_REPORT) {
> + cmd->error = null_zone_report(nullb, cmd->rq->bio);
> + return true;
> + }
> + }
> +
> + return false;
> +}
> +
>  static blk_status_t null_handle_cmd(struct nullb_cmd *cmd)
>  {
>   struct nullb_device *dev = cmd->nq->dev;
>   struct nullb *nullb = dev->nullb;
>   int err = 0;
>  
> - if (req_op(cmd->rq) == REQ_OP_ZONE_REPORT) {
> - cmd->error = null_zone_report(nullb, cmd);
> + if (cmd_report_zone(nullb, cmd))
>   goto out;
> - }
>  
>   if (test_bit(NULLB_DEV_FL_THROTTLED, >flags)) {
>   struct request *rq = cmd->rq;
> @@ -1234,10 +1251,24 @@ static blk_status_t null_handle_cmd(struct nullb_cmd 
> *cmd)
>   cmd->error = errno_to_blk_status(err);
>  
>   if (!cmd->error && dev->zoned) {
> - if (req_op(cmd->rq) == REQ_OP_WRITE)
> - null_zone_write(cmd);
> - else if (req_op(cmd->rq) == REQ_OP_ZONE_RESET)
> - null_zone_reset(cmd);
> + sector_t sector;
> + unsigned int nr_sectors;
> + int op;
> +
> + if (dev->queue_mode == NULL_Q_BIO) {
> + op = bio_op(cmd->bio);
> + sector = cmd->bio->bi_iter.bi_sector;
> + nr_sectors = cmd->bio->bi_iter.bi_size >> 9;
> +  

Re: [PATCH] blktests: remove unused null_blk parameter for _init_null_blk in block/016

2018-08-28 Thread Omar Sandoval
On Sat, Aug 25, 2018 at 06:06:19PM +0800, Yi Zhang wrote:
> Signed-off-by: Yi Zhang 
> ---
>  tests/block/016 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/block/016 b/tests/block/016
> index 0e6f2e1..e0a63ad 100755
> --- a/tests/block/016
> +++ b/tests/block/016
> @@ -32,7 +32,7 @@ requires() {
>  test() {
>   echo "Running ${TEST_NAME}"
>  
> - if ! _init_null_blk null_blk queue_mode=2 irqmode=2 
> completion_nsec=20; then
> + if ! _init_null_blk queue_mode=2 irqmode=2 completion_nsec=20; 
> then
>   return 1
>   fi

Applied, thanks!


Re: [PATCH blktests 0/3] Add NVMeOF multipath tests

2018-08-23 Thread Omar Sandoval
On Thu, Aug 23, 2018 at 01:53:33AM +, Bart Van Assche wrote:
> On Tue, 2018-08-21 at 08:46 +0200, Johannes Thumshirn wrote:
> > On Mon, Aug 20, 2018 at 03:46:45PM +, Bart Van Assche wrote:
> > > Moving these tests into the nvme directory is possible but will make it
> > > harder to run the NVMeOF multipath tests separately. Are you fine with 
> > > this?
> > 
> > Both way's have it's up and downsides, I agree.
> > 
> > Having two distinct groups requires to run './check nvme nvmeof-mp' to
> > run full coverage with nvme.
> > 
> > Having it all in one group would require to run './check nvme 18 19 20
> > 21 22 23 24 ...' to get only the dm-mpath ones.
> > 
> > Honestly I hate both but your's (the two distinct groups) is probably
> > easier to handle in the end, I have to admit.
> 
> Omar, do you have a preference for one of the two aforementioned approaches?
> 
> Thanks,
> 
> Bart.
> 

Let's keep it in a separate category, since lots of people running nvme
tests probably aren't interested in testing multipath.

A bunch of the tests failed with

modprobe: FATAL: Module nvme is in use.

Maybe related to my test VM having an nvme device?


Re: [PATCH] block: remove duplicate initialization

2018-08-17 Thread Omar Sandoval
On Thu, Aug 16, 2018 at 03:45:29PM -0700, Chaitanya Kulkarni wrote:
> This patch removes the duplicate initialization of q->queue_head
> in the blk_alloc_queue_node(). This removes the 2nd initialization
> so that we preserve the initialization order same as declaration
> present in struct request_queue.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Chaitanya Kulkarni 
> ---
>  block/blk-core.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 21c7cb35d3b4..dee56c282efb 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1036,7 +1036,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id,
>   laptop_mode_timer_fn, 0);
>   timer_setup(>timeout, blk_rq_timed_out_timer, 0);
>   INIT_WORK(>timeout_work, NULL);
> - INIT_LIST_HEAD(>queue_head);
>   INIT_LIST_HEAD(>timeout_list);
>   INIT_LIST_HEAD(>icq_list);
>  #ifdef CONFIG_BLK_CGROUP
> -- 
> 2.17.0
> 


Re: [RFC PATCH 1/5] block: move call of scheduler's ->completed_request() hook

2018-08-10 Thread Omar Sandoval
On Fri, Aug 10, 2018 at 10:59:37AM +0800, jianchao.wang wrote:
> Hi Omar
> 
> On 08/10/2018 04:26 AM, Omar Sandoval wrote:
> > @@ -524,6 +524,9 @@ inline void __blk_mq_end_request(struct request *rq, 
> > blk_status_t error)
> > blk_stat_add(rq, now);
> > }
> >  
> > +   if (rq->internal_tag != -1)
> > +   blk_mq_sched_completed_request(rq, now);
> > +
> 
> Is it OK to move the io scheduler completed callback into 
> __blk_mq_end_request ?

It's ok in the sense that only Kyber uses it.

> There is a relatively long distance between the __blk_mq_complete_request and 
> __blk_mq_end_request,
> especially the driver's mq_ops.complete and the bio_endio.

That's a fair point. We could maybe consolidate all of the I/O time
measurements to __blk_mq_complete_request() and have a separate callback
for the total time measurement in __blk_mq_end_request(), but I'm not
sure it's worth adding another hook for that.


[RFC PATCH 4/5] kyber: implement improved heuristics

2018-08-09 Thread Omar Sandoval
From: Omar Sandoval 

Kyber's current heuristics have a few flaws:

- It's based on the mean latency, but p99 latency tends to be more
  meaningful to anyone who cares about latency. The mean can also be
  skewed by rare outliers that the scheduler can't do anything about.
- The statistics calculations are purely time-based with a short window.
  This works for steady, high load, but is more sensitive to outliers
  with bursty workloads.
- It only considers the latency once an I/O has been submitted to the
  device, but the user cares about the time spent in the kernel, as
  well.

These are shortcomings of the generic blk-stat code which doesn't quite
fit the ideal use case for Kyber. So, this replaces the statistics with
a histogram used to calculate percentiles of total latency and I/O
latency, which we then use to adjust depths in a slightly more
intelligent manner:

- Sync and async writes are now the same domain.
- Discards are a separate domain.
- Domain queue depths are scaled by the ratio of the p99 total latency
  to the target latency (e.g., if the p99 latency is double the target
  latency, we will double the queue depth; if the p99 latency is half of
  the target latency, we can halve the queue depth).
- We use the I/O latency to determine whether we should scale queue
  depths down: we will only scale down if any domain's I/O latency
  exceeds the target latency, which is an indicator of congestion in the
  device.

These new heuristics are just as scalable as the heuristics they
replace.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 497 --
 1 file changed, 279 insertions(+), 218 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 08eb5295c18d..adc8e6393829 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -29,13 +29,16 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
-#include "blk-stat.h"
 
-/* Scheduling domains. */
+/*
+ * Scheduling domains: the device is divided into multiple domains based on the
+ * request type.
+ */
 enum {
KYBER_READ,
-   KYBER_SYNC_WRITE,
-   KYBER_OTHER, /* Async writes, discard, etc. */
+   KYBER_WRITE,
+   KYBER_DISCARD,
+   KYBER_OTHER,
KYBER_NUM_DOMAINS,
 };
 
@@ -49,25 +52,82 @@ enum {
 };
 
 /*
- * Initial device-wide depths for each scheduling domain.
+ * Maximum device-wide depth for each scheduling domain.
  *
- * Even for fast devices with lots of tags like NVMe, you can saturate
- * the device with only a fraction of the maximum possible queue depth.
- * So, we cap these to a reasonable value.
+ * Even for fast devices with lots of tags like NVMe, you can saturate the
+ * device with only a fraction of the maximum possible queue depth. So, we cap
+ * these to a reasonable value.
  */
 static const unsigned int kyber_depth[] = {
[KYBER_READ] = 256,
-   [KYBER_SYNC_WRITE] = 128,
-   [KYBER_OTHER] = 64,
+   [KYBER_WRITE] = 128,
+   [KYBER_DISCARD] = 64,
+   [KYBER_OTHER] = 16,
 };
 
 /*
- * Scheduling domain batch sizes. We favor reads.
+ * Default latency targets for each scheduling domain.
+ */
+static const u64 kyber_latency_targets[] = {
+   [KYBER_READ] = 2 * NSEC_PER_MSEC,
+   [KYBER_WRITE] = 10 * NSEC_PER_MSEC,
+   [KYBER_DISCARD] = 5 * NSEC_PER_SEC,
+};
+
+/*
+ * Batch size (number of requests we'll dispatch in a row) for each scheduling
+ * domain.
  */
 static const unsigned int kyber_batch_size[] = {
[KYBER_READ] = 16,
-   [KYBER_SYNC_WRITE] = 8,
-   [KYBER_OTHER] = 8,
+   [KYBER_WRITE] = 8,
+   [KYBER_DISCARD] = 1,
+   [KYBER_OTHER] = 1,
+};
+
+/*
+ * Requests latencies are recorded in a histogram with buckets defined relative
+ * to the target latency:
+ *
+ * <= 1/4 * target latency
+ * <= 1/2 * target latency
+ * <= 3/4 * target latency
+ * <= target latency
+ * <= 1 1/4 * target latency
+ * <= 1 1/2 * target latency
+ * <= 1 3/4 * target latency
+ * > 1 3/4 * target latency
+ */
+enum {
+   /*
+* The width of the latency histogram buckets is
+* 1 / (1 << KYBER_LATENCY_SHIFT) * target latency.
+*/
+   KYBER_LATENCY_SHIFT = 2,
+   /*
+* The first (1 << KYBER_LATENCY_SHIFT) buckets are <= target latency,
+* thus, "good".
+*/
+   KYBER_GOOD_BUCKETS = 1 << KYBER_LATENCY_SHIFT,
+   /* There are also (1 << KYBER_LATENCY_SHIFT) "bad" buckets. */
+   KYBER_LATENCY_BUCKETS = 2 << KYBER_LATENCY_SHIFT,
+};
+
+/*
+ * We measure both the total latency and the I/O latency (i.e., latency after
+ * submitting to the device).
+ */
+enum {
+   KYBER_TOTAL_LATENCY,
+   KYBER_IO_LATENCY,
+};
+
+/*
+ * Per-cpu latency histograms: total latency and I/O latency for each 
scheduling
+ * domain except for KYBER_OTHER.
+ */
+struct kybe

[RFC PATCH 5/5] kyber: add tracepoints

2018-08-09 Thread Omar Sandoval
From: Omar Sandoval 

When debugging Kyber, it's really useful to know what latencies we've
been having and how the domain depths have been adjusted. Add two
tracepoints, kyber_latency and kyber_adjust, to record that.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c| 46 +-
 include/trace/events/kyber.h | 76 
 2 files changed, 104 insertions(+), 18 deletions(-)
 create mode 100644 include/trace/events/kyber.h

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index adc8e6393829..d8ddcb4f2435 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -30,6 +30,9 @@
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * Scheduling domains: the device is divided into multiple domains based on the
  * request type.
@@ -42,6 +45,13 @@ enum {
KYBER_NUM_DOMAINS,
 };
 
+static const char *kyber_domain_names[] = {
+   [KYBER_READ] = "READ",
+   [KYBER_WRITE] = "WRITE",
+   [KYBER_DISCARD] = "DISCARD",
+   [KYBER_OTHER] = "OTHER",
+};
+
 enum {
/*
 * In order to prevent starvation of synchronous requests by a flood of
@@ -122,6 +132,11 @@ enum {
KYBER_IO_LATENCY,
 };
 
+static const char *kyber_latency_type_names[] = {
+   [KYBER_TOTAL_LATENCY] = "total",
+   [KYBER_IO_LATENCY] = "I/O",
+};
+
 /*
  * Per-cpu latency histograms: total latency and I/O latency for each 
scheduling
  * domain except for KYBER_OTHER.
@@ -144,6 +159,8 @@ struct kyber_ctx_queue {
 } cacheline_aligned_in_smp;
 
 struct kyber_queue_data {
+   struct request_queue *q;
+
/*
 * Each scheduling domain has a limited number of in-flight requests
 * device-wide, limited by these tokens.
@@ -249,6 +266,10 @@ static int calculate_percentile(struct kyber_queue_data 
*kqd,
}
memset(buckets, 0, sizeof(kqd->latency_buckets[sched_domain][type]));
 
+   trace_kyber_latency(kqd->q, kyber_domain_names[sched_domain],
+   kyber_latency_type_names[type], percentile,
+   bucket + 1, 1 << KYBER_LATENCY_SHIFT, samples);
+
return bucket;
 }
 
@@ -256,8 +277,11 @@ static void kyber_resize_domain(struct kyber_queue_data 
*kqd,
unsigned int sched_domain, unsigned int depth)
 {
depth = clamp(depth, 1U, kyber_depth[sched_domain]);
-   if (depth != kqd->domain_tokens[sched_domain].sb.depth)
+   if (depth != kqd->domain_tokens[sched_domain].sb.depth) {
sbitmap_queue_resize(>domain_tokens[sched_domain], depth);
+   trace_kyber_adjust(kqd->q, kyber_domain_names[sched_domain],
+  depth);
+   }
 }
 
 static void kyber_timer_fn(struct timer_list *t)
@@ -360,6 +384,8 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd)
goto err;
 
+   kqd->q = q;
+
kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency,
GFP_KERNEL | __GFP_ZERO);
if (!kqd->cpu_latency)
@@ -944,23 +970,7 @@ static int kyber_cur_domain_show(void *data, struct 
seq_file *m)
struct blk_mq_hw_ctx *hctx = data;
struct kyber_hctx_data *khd = hctx->sched_data;
 
-   switch (khd->cur_domain) {
-   case KYBER_READ:
-   seq_puts(m, "READ\n");
-   break;
-   case KYBER_WRITE:
-   seq_puts(m, "WRITE\n");
-   break;
-   case KYBER_DISCARD:
-   seq_puts(m, "DISCARD\n");
-   break;
-   case KYBER_OTHER:
-   seq_puts(m, "OTHER\n");
-   break;
-   default:
-   seq_printf(m, "%u\n", khd->cur_domain);
-   break;
-   }
+   seq_printf(m, "%s\n", kyber_domain_names[khd->cur_domain]);
return 0;
 }
 
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
new file mode 100644
index ..9bf85dbef492
--- /dev/null
+++ b/include/trace/events/kyber.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kyber
+
+#if !defined(_TRACE_KYBER_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KYBER_H
+
+#include 
+#include 
+
+#define DOMAIN_LEN 16
+#define LATENCY_TYPE_LEN   8
+
+TRACE_EVENT(kyber_latency,
+
+   TP_PROTO(struct request_queue *q, const char *domain, const char *type,
+unsigned int percentile, unsigned int numerator,
+unsigned int denominator, unsigned int samples),
+
+   TP_ARGS(q, domain, type, percentile, numerator, denominator, samples),
+
+   TP_STRUCT__entry(
+   __field(   

[RFC PATCH 3/5] kyber: don't make domain token sbitmap larger than necessary

2018-08-09 Thread Omar Sandoval
From: Omar Sandoval 

The domain token sbitmaps are currently initialized to the device queue
depth or 256, whichever is larger, and immediately resized to the
maximum depth for that domain (256, 128, or 64 for read, write, and
other, respectively). The sbitmap is never resized larger than that, so
it's unnecessary to allocate a bitmap larger than the maximum depth.
Let's just allocate it to the maximum depth to begin with. This will use
marginally less memory, and more importantly, give us a more appropriate
number of bits per sbitmap word.

Signed-off-by: Omar Sandoval 
---
 block/kyber-iosched.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 95d062c07c61..08eb5295c18d 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -40,8 +40,6 @@ enum {
 };
 
 enum {
-   KYBER_MIN_DEPTH = 256,
-
/*
 * In order to prevent starvation of synchronous requests by a flood of
 * asynchronous requests, we reserve 25% of requests for synchronous
@@ -305,7 +303,6 @@ static int kyber_bucket_fn(const struct request *rq)
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
 {
struct kyber_queue_data *kqd;
-   unsigned int max_tokens;
unsigned int shift;
int ret = -ENOMEM;
int i;
@@ -320,25 +317,17 @@ static struct kyber_queue_data 
*kyber_queue_data_alloc(struct request_queue *q)
if (!kqd->cb)
goto err_kqd;
 
-   /*
-* The maximum number of tokens for any scheduling domain is at least
-* the queue depth of a single hardware queue. If the hardware doesn't
-* have many tags, still provide a reasonable number.
-*/
-   max_tokens = max_t(unsigned int, q->tag_set->queue_depth,
-  KYBER_MIN_DEPTH);
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
WARN_ON(!kyber_depth[i]);
WARN_ON(!kyber_batch_size[i]);
ret = sbitmap_queue_init_node(>domain_tokens[i],
- max_tokens, -1, false, GFP_KERNEL,
- q->node);
+ kyber_depth[i], -1, false,
+ GFP_KERNEL, q->node);
if (ret) {
while (--i >= 0)
sbitmap_queue_free(>domain_tokens[i]);
goto err_cb;
}
-   sbitmap_queue_resize(>domain_tokens[i], kyber_depth[i]);
}
 
shift = kyber_sched_tags_shift(kqd);
-- 
2.18.0



[RFC PATCH 1/5] block: move call of scheduler's ->completed_request() hook

2018-08-09 Thread Omar Sandoval
From: Omar Sandoval 

Commit 4bc6339a583c ("block: move blk_stat_add() to
__blk_mq_end_request()") consolidated some calls using ktime_get() so
we'd only need to call it once. Kyber's ->completed_request() hook also
calls ktime_get(), so let's move it to the same place, too.

Signed-off-by: Omar Sandoval 
---
 block/blk-mq-sched.h | 4 ++--
 block/blk-mq.c   | 5 +++--
 block/kyber-iosched.c| 5 ++---
 include/linux/elevator.h | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 0cb8f938dff9..74fb6ff9a30d 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -54,12 +54,12 @@ blk_mq_sched_allow_merge(struct request_queue *q, struct 
request *rq,
return true;
 }
 
-static inline void blk_mq_sched_completed_request(struct request *rq)
+static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)
 {
struct elevator_queue *e = rq->q->elevator;
 
if (e && e->type->ops.mq.completed_request)
-   e->type->ops.mq.completed_request(rq);
+   e->type->ops.mq.completed_request(rq, now);
 }
 
 static inline void blk_mq_sched_started_request(struct request *rq)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 654b0dc7e001..0c5be0001d0f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -524,6 +524,9 @@ inline void __blk_mq_end_request(struct request *rq, 
blk_status_t error)
blk_stat_add(rq, now);
}
 
+   if (rq->internal_tag != -1)
+   blk_mq_sched_completed_request(rq, now);
+
blk_account_io_done(rq, now);
 
if (rq->end_io) {
@@ -560,8 +563,6 @@ static void __blk_mq_complete_request(struct request *rq)
 
if (!blk_mq_mark_complete(rq))
return;
-   if (rq->internal_tag != -1)
-   blk_mq_sched_completed_request(rq);
 
if (!test_bit(QUEUE_FLAG_SAME_COMP, >q->queue_flags)) {
rq->q->softirq_done_fn(rq);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index a1660bafc912..95d062c07c61 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -558,12 +558,12 @@ static void kyber_finish_request(struct request *rq)
rq_clear_domain_token(kqd, rq);
 }
 
-static void kyber_completed_request(struct request *rq)
+static void kyber_completed_request(struct request *rq, u64 now)
 {
struct request_queue *q = rq->q;
struct kyber_queue_data *kqd = q->elevator->elevator_data;
unsigned int sched_domain;
-   u64 now, latency, target;
+   u64 latency, target;
 
/*
 * Check if this request met our latency goal. If not, quickly gather
@@ -585,7 +585,6 @@ static void kyber_completed_request(struct request *rq)
if (blk_stat_is_active(kqd->cb))
return;
 
-   now = ktime_get_ns();
if (now < rq->io_start_time_ns)
return;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index a02deea30185..015bb59c0331 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -111,7 +111,7 @@ struct elevator_mq_ops {
void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, 
bool);
struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
bool (*has_work)(struct blk_mq_hw_ctx *);
-   void (*completed_request)(struct request *);
+   void (*completed_request)(struct request *, u64);
void (*started_request)(struct request *);
void (*requeue_request)(struct request *);
struct request *(*former_request)(struct request_queue *, struct 
request *);
-- 
2.18.0



[RFC PATCH 0/5] kyber: better heuristics

2018-08-09 Thread Omar Sandoval
From: Omar Sandoval 

Hello,

I've spent the past few weeks experimenting with different heuristics
for Kyber in order to deal with some edge cases we've hit here. This
series is my progress so far, implementing less handwavy heuristics
while keeping the same basic mechanisms. Patches 1 and 2 are
preparation. Patch 3 is a minor optimization. Patch 4 is the main
change, and includes a detailed description of the new heuristics. Patch
5 adds tracepoints for debugging.

Please, take a look and try it out.

Thanks!

Omar Sandoval (5):
  block: move call of scheduler's ->completed_request() hook
  block: export blk_stat_enable_accounting()
  kyber: don't make domain token sbitmap larger than necessary
  kyber: implement improved heuristics
  kyber: add tracepoints

 block/blk-mq-sched.h |   4 +-
 block/blk-mq.c   |   5 +-
 block/blk-stat.c |   1 +
 block/kyber-iosched.c| 541 +++
 include/linux/elevator.h |   2 +-
 include/trace/events/kyber.h |  76 +
 6 files changed, 383 insertions(+), 246 deletions(-)
 create mode 100644 include/trace/events/kyber.h

-- 
2.18.0



[RFC PATCH 2/5] block: export blk_stat_enable_accounting()

2018-08-09 Thread Omar Sandoval
From: Omar Sandoval 

Kyber will need this in a future change if it is built as a module.

Signed-off-by: Omar Sandoval 
---
 block/blk-stat.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 175c143ac5b9..d98f3ad6794e 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -190,6 +190,7 @@ void blk_stat_enable_accounting(struct request_queue *q)
blk_queue_flag_set(QUEUE_FLAG_STATS, q);
spin_unlock(>stats->lock);
 }
+EXPORT_SYMBOL_GPL(blk_stat_enable_accounting);
 
 struct blk_queue_stats *blk_alloc_queue_stats(void)
 {
-- 
2.18.0



Re: [PATCH 0/2] blktests: test ANA base support

2018-07-26 Thread Omar Sandoval
On Thu, Jul 26, 2018 at 02:31:32PM -0700, Omar Sandoval wrote:
> On Thu, Jul 26, 2018 at 09:35:53AM -0700, James Smart wrote:
> > make sure this fix has been picked up in your kernel:
> > http://git.infradead.org/nvme.git/commit/6cdefc6e2ad52170f89a8d0e8b1a1339f91834dc
> > 
> > deletes were broken otherwise and have the exact trace below.
> > 
> > -- james
> 
> Thanks, James, that indeed fixes the problem. I'll go ahead and merge
> these tests since there's a fix in flight.

By "these tests", I mean Chaitanya's tests, Hannes still has some
comments to resolve :)


  1   2   3   4   5   6   7   >