Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-05 Thread Mike Galbraith
On Tue, 2018-02-06 at 08:44 +0100, Oleksandr Natalenko wrote:
> Hi, Paolo.
> 
> I can confirm that this patch fixes cfdisk hang for me. I've also tried 
> to trigger the issue Mike has encountered, but with no luck (maybe, I 
> wasn't insistent enough, just was doing dd on usb-storage device in the 
> VM).

I was doing kbuilds, and it blew up on me twice.  Switching back to cfq
seemed to confirm it was indeed the patch causing trouble, but that's
by no means a certainty.

-Mike


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-05 Thread Oleksandr Natalenko

Hi, Paolo.

I can confirm that this patch fixes cfdisk hang for me. I've also tried 
to trigger the issue Mike has encountered, but with no luck (maybe, I 
wasn't insistent enough, just was doing dd on usb-storage device in the 
VM).


So, with regard to cfdisk hang on usb-storage:

Tested-by: Oleksandr Natalenko 

Thank you.


Re: [PATCH rfc 1/5] irq-am: Introduce library implementing generic adaptive moderation

2018-02-05 Thread Or Gerlitz
On Tue, Feb 6, 2018 at 12:03 AM, Sagi Grimberg  wrote:
> irq-am library helps I/O devices implement interrupt moderation in
> an adaptive fashion, based on online stats.
>
> The consumer can initialize an irq-am context with a callback that
> performs the device specific moderation programming and also the number
> of am (adaptive moderation) levels which are also, abstracted and allows
> for device specific tuning.
>
> The irq-am code will sample once every nr_events and will check for 
> significant
> change in workload characteristics (completions per second, events per second)
> and if it detects one, will perform an am level update(called a step).
>
> The irq-am code  assumes that the am levels are sorted in an increasing order 
> when
> the lowest level corresponds to the optimum latency tuning (short time and low
> completion-count) and gradually increasing towards the throughput optimum 
> tuning
> (longer time and higher completion-count). So there is a trend and tuning 
> direction
> tracked by the moderator. When the moderator collects sufficient statistics 
> (also
> controlled by the consumer defining nr_events), it compares the current stats 
> with the
> previous stats and if a significant changed was observed in the load, the 
> moderator
> attempts to increment/decrement its current level (step) and schedules a 
> program
> dispatch work.
>
> Signed-off-by: Sagi Grimberg 
> ---
>  include/linux/irq-am.h | 116 +++

Talking to Tal, it seems that this is what landed in upstream as
include/linux/net_dim.h
and can have few adjustments for you, I suggest you take a look


[PATCH] blk-throttle: fix race between blkcg_bio_issue_check and cgroup_rmdir

2018-02-05 Thread Joseph Qi
We've triggered a WARNING in blk_throtl_bio when throttling writeback
io, which complains blkg->refcnt is already 0 when calling blkg_get, and
then kernel crashes with invalid page request.
After investigating this issue, we've found there is a race between
blkcg_bio_issue_check and cgroup_rmdir. The race is described below.

writeback kworker
  blkcg_bio_issue_check
rcu_read_lock
blkg_lookup
<<< *race window*
blk_throtl_bio
  spin_lock_irq(q->queue_lock)
  spin_unlock_irq(q->queue_lock)
rcu_read_unlock

cgroup_rmdir
  cgroup_destroy_locked
kill_css
  css_killed_ref_fn
css_killed_work_fn
  offline_css
blkcg_css_offline
  spin_trylock(q->queue_lock)
  blkg_destroy
  spin_unlock(q->queue_lock)

Since rcu can only prevent blkg from releasing when it is being used,
the blkg->refcnt can be decreased to 0 during blkg_destroy and schedule
blkg release.
Then trying to blkg_get in blk_throtl_bio will complains the WARNING.
And then the corresponding blkg_put will schedule blkg release again,
which result in double free.
This race is introduced by commit ae1188963611 ("blkcg: consolidate blkg
creation in blkcg_bio_issue_check()"). Before this commit, it will lookup
first and then try to lookup/create again with queue_lock. So revive
this logic to fix the race.

Fixes: ae1188963611 ("blkcg: consolidate blkg creation in 
blkcg_bio_issue_check()")
Reported-by: Jiufei Xue 
Signed-off-by: Joseph Qi 
CC: sta...@vger.kernel.org
---
 block/blk-cgroup.c |  2 +-
 block/blk-throttle.c   | 29 +
 block/cfq-iosched.c| 33 +++--
 include/linux/blk-cgroup.h | 27 +--
 4 files changed, 54 insertions(+), 37 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524..b1d22e5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -162,7 +162,6 @@ struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 
return NULL;
 }
-EXPORT_SYMBOL_GPL(blkg_lookup_slowpath);
 
 /*
  * If @new_blkg is %NULL, this function tries to allocate a new one as
@@ -306,6 +305,7 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
return blkg;
}
 }
+EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c5a1316..ec830be 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2143,26 +2143,41 @@ static void blk_throtl_assoc_bio(struct throtl_grp *tg, 
struct bio *bio)
 #endif
 }
 
-bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
+bool blk_throtl_bio(struct request_queue *q, struct blkcg *blkcg,
struct bio *bio)
 {
+   struct throtl_data *td = q->td;
struct throtl_qnode *qn = NULL;
-   struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
+   struct throtl_grp *tg;
+   struct blkcg_gq *blkg;
struct throtl_service_queue *sq;
bool rw = bio_data_dir(bio);
bool throttled = false;
-   struct throtl_data *td = tg->td;
 
WARN_ON_ONCE(!rcu_read_lock_held());
 
+   /*
+* If a group has no rules, just update the dispatch stats in lockless
+* manner and return.
+*/
+   blkg = blkg_lookup(blkcg, q);
+   tg = blkg_to_tg(blkg);
+   if (tg && !tg->has_rules[rw])
+   goto out;
+
/* see throtl_charge_bio() */
-   if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
+   if (bio_flagged(bio, BIO_THROTTLED))
goto out;
 
spin_lock_irq(q->queue_lock);
 
throtl_update_latency_buckets(td);
 
+   blkg = blkg_lookup_create(blkcg, q);
+   if (IS_ERR(blkg))
+   blkg = q->root_blkg;
+   tg = blkg_to_tg(blkg);
+
if (unlikely(blk_queue_bypass(q)))
goto out_unlock;
 
@@ -2253,6 +2268,12 @@ bool blk_throtl_bio(struct request_queue *q, struct 
blkcg_gq *blkg,
if (throttled || !td->track_bio_latency)
bio->bi_issue_stat.stat |= SKIP_LATENCY;
 #endif
+   if (!throttled) {
+   blkg_rwstat_add(>stat_bytes, bio->bi_opf,
+   bio->bi_iter.bi_size);
+   blkg_rwstat_add(>stat_ios, bio->bi_opf, 1);
+   }
+
return throttled;
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef..60f53b5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1674,15 +1674,28 @@ static void cfq_pd_reset_stats(struct blkg_policy_data 
*pd)
cfqg_stats_reset(>stats);
 }
 
-static struct cfq_group *cfq_lookup_cfqg(struct cfq_data *cfqd,
-struct blkcg *blkcg)
+/*
+ * Search for the cfq group current task belongs to. request_queue lock must
+ * be held.
+ */
+static struct 

Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-05 Thread Or Gerlitz
On Tue, Feb 6, 2018 at 12:03 AM, Sagi Grimberg  wrote:
[...]
> In the networking stack, each device driver implements adaptive IRQ moderation
> on its own. The approach here is a bit different, it tries to take the common 
> denominator,
> which is per-queue statistics gathering and workload change identification
> (basically decides if the moderation scheme needs to change).

not any more, Andy and Tal made the mlx5 AM code a kernel library
which is called DIM

f4e5f0e MAINTAINERS: add entry for Dynamic Interrupt Moderation
6a8788f bnxt_en: add support for software dynamic interrupt moderation
8115b75 net/dim: use struct net_dim_sample as arg to net_dim
4c4dbb4 net/mlx5e: Move dynamic interrupt coalescing code to include/linux
9a31742 net/mlx5e: Change Mellanox references in DIM code
b9c872f net/mlx5e: Move generic functions to new file
f5e7f67 net/mlx5e: Move AM logic enums
138968e net/mlx5e: Remove rq references in mlx5e_rx_am
f58ee09 net/mlx5e: Move interrupt moderation forward declarations
98dd1ed net/mlx5e: Move interrupt moderation structs to new file

can you make use of that? cc-ing them in case you have questions/comments


> The library is targeted to multi-queue devices, but should work on single 
> queue
> devices as well, however I'm not sure that these devices will need something
> like interrupt moderation.


[PATCH v5 09/10] bcache: add io_disable to struct cached_dev

2018-02-05 Thread Coly Li
If a bcache device is configured to writeback mode, current code does not
handle write I/O errors on backing devices properly.

In writeback mode, write request is written to cache device, and
latter being flushed to backing device. If I/O failed when writing from
cache device to the backing device, bcache code just ignores the error and
upper layer code is NOT noticed that the backing device is broken.

This patch tries to handle backing device failure like how the cache device
failure is handled,
- Add a error counter 'io_errors' and error limit 'error_limit' in struct
  cached_dev. Add another io_disable to struct cached_dev to disable I/Os
  on the problematic backing device.
- When I/O error happens on backing device, increase io_errors counter. And
  if io_errors reaches error_limit, set cache_dev->io_disable to true, and
  stop the bcache device.

The result is, if backing device is broken of disconnected, and I/O errors
reach its error limit, backing device will be disabled and the associated
bcache device will be removed from system.

Changelog:
v2: remove "bcache: " prefix in pr_error(), and use correct name string to
print out bcache device gendisk name.
v1: indeed this is new added in v2 patch set.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Cc: Michael Lyle 
Cc: Junhui Tang 
---
 drivers/md/bcache/bcache.h  |  6 ++
 drivers/md/bcache/io.c  | 14 ++
 drivers/md/bcache/request.c | 14 --
 drivers/md/bcache/super.c   | 23 ++-
 drivers/md/bcache/sysfs.c   | 15 ++-
 5 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 263164490833..c59ce168bd82 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -366,6 +366,7 @@ struct cached_dev {
unsignedsequential_cutoff;
unsignedreadahead;
 
+   unsignedio_disable:1;
unsignedverify:1;
unsignedbypass_torture_test:1;
 
@@ -387,6 +388,9 @@ struct cached_dev {
unsignedwriteback_rate_minimum;
 
enum stop_on_faliurestop_when_cache_set_failed;
+#define DEFAULT_CACHED_DEV_ERROR_LIMIT 64
+   atomic_tio_errors;
+   unsignederror_limit;
 };
 
 enum alloc_reserve {
@@ -896,6 +900,7 @@ static inline void closure_bio_submit(struct cache_set *c,
 
 /* Forward declarations */
 
+void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio);
 void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
 void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
  blk_status_t, const char *);
@@ -923,6 +928,7 @@ int bch_bucket_alloc_set(struct cache_set *, unsigned,
 struct bkey *, int, bool);
 bool bch_alloc_sectors(struct cache_set *, struct bkey *, unsigned,
   unsigned, unsigned, bool);
+bool bch_cached_dev_error(struct cached_dev *dc);
 
 __printf(2, 3)
 bool bch_cache_set_error(struct cache_set *, const char *, ...);
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index 8013ecbcdbda..7fac97ae036e 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -50,6 +50,20 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
 }
 
 /* IO errors */
+void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio)
+{
+   char buf[BDEVNAME_SIZE];
+   unsigned errors;
+
+   WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
+
+   errors = atomic_add_return(1, >io_errors);
+   if (errors < dc->error_limit)
+   pr_err("%s: IO error on backing device, unrecoverable",
+   bio_devname(bio, buf));
+   else
+   bch_cached_dev_error(dc);
+}
 
 void bch_count_io_errors(struct cache *ca,
 blk_status_t error,
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 9c6dda3b0068..03245e6980a6 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -637,6 +637,8 @@ static void backing_request_endio(struct bio *bio)
 
if (bio->bi_status) {
struct search *s = container_of(cl, struct search, cl);
+   struct cached_dev *dc = container_of(s->d,
+struct cached_dev, disk);
/*
 * If a bio has REQ_PREFLUSH for writeback mode, it is
 * speically assembled in cached_dev_write() for a non-zero
@@ -657,6 +659,7 @@ static void backing_request_endio(struct bio *bio)
}
s->recoverable = false;
/* should count I/O error for backing device here */
+   bch_count_backing_io_errors(dc, bio);
}
 
bio_put(bio);

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-05 Thread Mike Galbraith
Hi Paolo,

I applied this to master.today, flipped udev back to bfq and took it
for a spin.  Unfortunately, box fairly quickly went boom under load.

[  454.739975] [ cut here ]
[  454.739979] list_add corruption. prev->next should be next 
(5f99a42a), but was   (null). (prev=fc569ec9).
[  454.739989] WARNING: CPU: 3 PID: 0 at lib/list_debug.c:28 
__list_add_valid+0x6a/0x70
[  454.739990] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) 
xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) 
nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) snd_hda_codec_hdmi(E) coretemp(E) snd_hda_codec_realtek(E) 
snd_hda_codec_generic(E) kvm_intel(E) kvm(E) snd_hda_intel(E) snd_hda_codec(E) 
snd_hwdep(E) snd_hda_core(E) snd_pcm(E) irqbypass(E) snd_timer(E) joydev(E) 
crct10dif_pclmul(E) snd(E) r8169(E) crc32_pclmul(E) mii(E) mei_me(E) 
soundcore(E)
[  454.740011]  crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) 
iTCO_vendor_support(E) pcbc(E) mei(E) lpc_ich(E) aesni_intel(E) i2c_i801(E) 
mfd_core(E) aes_x86_64(E) shpchp(E) intel_smartconnect(E) crypto_simd(E) 
glue_helper(E) cryptd(E) pcspkr(E) fan(E) thermal(E) nfsd(E) auth_rpcgss(E) 
nfs_acl(E) lockd(E) grace(E) sunrpc(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) 
hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) 
wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) 
sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ttm(E) libahci(E) ehci_pci(E) 
xhci_hcd(E) ehci_hcd(E) libata(E) drm(E) usbcore(E) video(E) button(E) 
sd_mod(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_pci(E) 
virtio_ring(E) virtio(E) ext4(E) crc16(E) mbcache(E) jbd2(E) loop(E) sg(E)
[  454.740038]  dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) 
scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E)
[  454.740043] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GE
4.15.0.ge237f98-master #605
[  454.740044] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  454.740046] RIP: 0010:__list_add_valid+0x6a/0x70
[  454.740047] RSP: 0018:88041ecc3ca8 EFLAGS: 00010096
[  454.740048] RAX: 0075 RBX: 8803f33fa8c0 RCX: 0006
[  454.740049] RDX:  RSI: 0082 RDI: 88041ecd5570
[  454.740050] RBP: 8803f596d7e0 R08:  R09: 0368
[  454.740051] R10:  R11: 88041ecc3a30 R12: 8803eb1c8828
[  454.740052] R13: 8803f33fa940 R14: 8803f5852600 R15: 8803f596d810
[  454.740053] FS:  () GS:88041ecc() 
knlGS:
[  454.740054] CS:  0010 DS:  ES:  CR0: 80050033
[  454.740055] CR2: 014d9788 CR3: 01e0a006 CR4: 001606e0
[  454.740056] Call Trace:
[  454.740058]  
[  454.740062]  blk_flush_complete_seq+0x2b1/0x370
[  454.740065]  flush_end_io+0x18c/0x280
[  454.740074]  scsi_end_request+0x95/0x1e0 [scsi_mod]
[  454.740079]  scsi_io_completion+0xbb/0x5d0 [scsi_mod]
[  454.740082]  __blk_mq_complete_request+0xb7/0x180
[  454.740084]  blk_mq_complete_request+0x50/0x90
[  454.740087]  ? scsi_vpd_tpg_id+0x90/0x90 [scsi_mod]
[  454.740095]  ata_scsi_qc_complete+0x1d8/0x470 [libata]
[  454.740100]  ata_qc_complete_multiple+0x87/0xd0 [libata]
[  454.740103]  ahci_handle_port_interrupt+0xd4/0x4e0 [libahci]
[  454.740105]  ahci_handle_port_intr+0x6f/0xb0 [libahci]
[  454.740107]  ahci_single_level_irq_intr+0x3b/0x60 [libahci]
[  454.740110]  __handle_irq_event_percpu+0x40/0x1a0
[  454.740112]  handle_irq_event_percpu+0x20/0x50
[  454.740114]  handle_irq_event+0x36/0x60
[  454.740116]  handle_edge_irq+0x90/0x190
[  454.740118]  handle_irq+0x1c/0x30
[  454.740120]  do_IRQ+0x43/0xd0
[  454.740122]  common_interrupt+0xa2/0xa2
[  454.740123]  
[  454.740125] RIP: 0010:cpuidle_enter_state+0xec/0x250
[  454.740126] RSP: 0018:880187febec0 EFLAGS: 0246 ORIG_RAX: 
ffdd
[  454.740127] RAX: 88041ece0040 RBX: 88041ece77e8 RCX: 001f
[  454.740128] RDX:  RSI: fffd9cb7bc38 RDI: 
[  454.740129] RBP: 0005 R08: 0006 R09: 024f
[  454.740130] R10: 0205 R11: 0018 R12: 0003
[  454.740131] R13: 0069e09a3c87 R14: 0003 R15: 0069e09d450c
[  454.740134]  do_idle+0x16a/0x1d0
[  454.740136]  cpu_startup_entry+0x19/0x20
[  454.740138]  start_secondary+0x14e/0x190
[  454.740140]  secondary_startup_64+0xa5/0xb0
[  454.740141] Code: fe 31 c0 48 c7 c7 a0 61 bf 81 e8 12 f7 d8 ff 0f ff 31 c0 
c3 48 89 d1 48 c7 c7 50 61 bf 81 48 

[PATCH v5 10/10] bcache: stop bcache device when backing device is offline

2018-02-05 Thread Coly Li
Currently bcache does not handle backing device failure, if backing
device is offline and disconnected from system, its bcache device can still
be accessible. If the bcache device is in writeback mode, I/O requests even
can success if the requests hit on cache device. That is to say, when and
how bcache handles offline backing device is undefined.

This patch tries to handle backing device offline in a rather simple way,
- Add cached_dev->status_update_thread kernel thread to update backing
  device status in every 1 second.
- Add cached_dev->offline_seconds to record how many seconds the backing
  device is observed to be offline. If the backing device is offline for
  BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and
  call bcache_device_stop() to stop the bache device which linked to the
  offline backing device.

Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds,
its bcache device will be removed, then user space application writing on
it will get error immediately, and handler the device failure in time.

This patch is quite simple, does not handle more complicated situations.
Once the bcache device is stopped, users need to recovery the backing
device, register and attach it manually.

Changelog:
v2: remove "bcache: " prefix when calling pr_warn().
v1: initial version.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Cc: Michael Lyle 
Cc: Junhui Tang 
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 55 ++
 2 files changed, 57 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index c59ce168bd82..aa83dd0f682f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -344,6 +344,7 @@ struct cached_dev {
 
struct keybuf   writeback_keys;
 
+   struct task_struct  *status_update_thread;
/*
 * Order the write-half of writeback operations strongly in dispatch
 * order.  (Maintain LBA order; don't allow reads completing out of
@@ -391,6 +392,7 @@ struct cached_dev {
 #define DEFAULT_CACHED_DEV_ERROR_LIMIT 64
atomic_tio_errors;
unsignederror_limit;
+   unsignedoffline_seconds;
 };
 
 enum alloc_reserve {
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 40b07d980a20..6d672329efce 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -654,6 +654,11 @@ static int ioctl_dev(struct block_device *b, fmode_t mode,
 unsigned int cmd, unsigned long arg)
 {
struct bcache_device *d = b->bd_disk->private_data;
+   struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+
+   if (dc->io_disable)
+   return -EIO;
+
return d->ioctl(d, mode, cmd, arg);
 }
 
@@ -864,6 +869,45 @@ static void calc_cached_dev_sectors(struct cache_set *c)
c->cached_dev_sectors = sectors;
 }
 
+#define BACKING_DEV_OFFLINE_TIMEOUT 5
+static int cached_dev_status_update(void *arg)
+{
+   struct cached_dev *dc = arg;
+   struct request_queue *q;
+   char buf[BDEVNAME_SIZE];
+
+   /*
+* If this delayed worker is stopping outside, directly quit here.
+* dc->io_disable might be set via sysfs interface, so check it
+* here too.
+*/
+   while (!kthread_should_stop() && !dc->io_disable) {
+   q = bdev_get_queue(dc->bdev);
+   if (blk_queue_dying(q))
+   dc->offline_seconds++;
+   else
+   dc->offline_seconds = 0;
+
+   if (dc->offline_seconds >= BACKING_DEV_OFFLINE_TIMEOUT) {
+   pr_err("%s: device offline for %d seconds",
+   bdevname(dc->bdev, buf),
+   BACKING_DEV_OFFLINE_TIMEOUT);
+   pr_err("%s: disable I/O request due to backing "
+   "device offline", dc->disk.name);
+   dc->io_disable = true;
+   /* let others know earlier that io_disable is true */
+   smp_mb();
+   bcache_device_stop(>disk);
+   break;
+   }
+
+   schedule_timeout_interruptible(HZ);
+   }
+
+   dc->status_update_thread = NULL;
+   return 0;
+}
+
 void bch_cached_dev_run(struct cached_dev *dc)
 {
struct bcache_device *d = >disk;
@@ -906,6 +950,15 @@ void bch_cached_dev_run(struct cached_dev *dc)
if (sysfs_create_link(>kobj, _to_dev(d->disk)->kobj, "dev") ||
sysfs_create_link(_to_dev(d->disk)->kobj, >kobj, "bcache"))
pr_debug("error creating sysfs link");
+
+   dc->status_update_thread = kthread_run(cached_dev_status_update,
+  dc,
+  

[PATCH v5 08/10] bcache: add backing_request_endio() for bi_end_io of attached backing device I/O

2018-02-05 Thread Coly Li
In order to catch I/O error of backing device, a separate bi_end_io
call back is required. Then a per backing device counter can record I/O
errors number and retire the backing device if the counter reaches a
per backing device I/O error limit.

This patch adds backing_request_endio() to bcache backing device I/O code
path, this is a preparation for further complicated backing device failure
handling. So far there is no real code logic change, I make this change a
separate patch to make sure it is stable and reliable for further work.

Changelog:
v2: Fix code comments typo, remove a redundant bch_writeback_add() line
added in v4 patch set.
v1: indeed this is new added in this patch set.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Cc: Junhui Tang 
Cc: Michael Lyle 
---
 drivers/md/bcache/request.c   | 93 +++
 drivers/md/bcache/super.c |  1 +
 drivers/md/bcache/writeback.c |  1 +
 3 files changed, 79 insertions(+), 16 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index e09c5ae745be..9c6dda3b0068 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -139,6 +139,7 @@ static void bch_data_invalidate(struct closure *cl)
}
 
op->insert_data_done = true;
+   /* get in bch_data_insert() */
bio_put(bio);
 out:
continue_at(cl, bch_data_insert_keys, op->wq);
@@ -630,6 +631,38 @@ static void request_endio(struct bio *bio)
closure_put(cl);
 }
 
+static void backing_request_endio(struct bio *bio)
+{
+   struct closure *cl = bio->bi_private;
+
+   if (bio->bi_status) {
+   struct search *s = container_of(cl, struct search, cl);
+   /*
+* If a bio has REQ_PREFLUSH for writeback mode, it is
+* speically assembled in cached_dev_write() for a non-zero
+* write request which has REQ_PREFLUSH. we don't set
+* s->iop.status by this failure, the status will be decided
+* by result of bch_data_insert() operation.
+*/
+   if (unlikely(s->iop.writeback &&
+bio->bi_opf & REQ_PREFLUSH)) {
+   char buf[BDEVNAME_SIZE];
+
+   bio_devname(bio, buf);
+   pr_err("Can't flush %s: returned bi_status %i",
+   buf, bio->bi_status);
+   } else {
+   /* set to orig_bio->bi_status in bio_complete() */
+   s->iop.status = bio->bi_status;
+   }
+   s->recoverable = false;
+   /* should count I/O error for backing device here */
+   }
+
+   bio_put(bio);
+   closure_put(cl);
+}
+
 static void bio_complete(struct search *s)
 {
if (s->orig_bio) {
@@ -644,13 +677,21 @@ static void bio_complete(struct search *s)
}
 }
 
-static void do_bio_hook(struct search *s, struct bio *orig_bio)
+static void do_bio_hook(struct search *s,
+   struct bio *orig_bio,
+   bio_end_io_t *end_io_fn)
 {
struct bio *bio = >bio.bio;
 
bio_init(bio, NULL, 0);
__bio_clone_fast(bio, orig_bio);
-   bio->bi_end_io  = request_endio;
+   /*
+* bi_end_io can be set separately somewhere else, e.g. the
+* variants in,
+* - cache_bio->bi_end_io from cached_dev_cache_miss()
+* - n->bi_end_io from cache_lookup_fn()
+*/
+   bio->bi_end_io  = end_io_fn;
bio->bi_private = >cl;
 
bio_cnt_set(bio, 3);
@@ -676,7 +717,7 @@ static inline struct search *search_alloc(struct bio *bio,
s = mempool_alloc(d->c->search, GFP_NOIO);
 
closure_init(>cl, NULL);
-   do_bio_hook(s, bio);
+   do_bio_hook(s, bio, request_endio);
 
s->orig_bio = bio;
s->cache_miss   = NULL;
@@ -743,10 +784,11 @@ static void cached_dev_read_error(struct closure *cl)
trace_bcache_read_retry(s->orig_bio);
 
s->iop.status = 0;
-   do_bio_hook(s, s->orig_bio);
+   do_bio_hook(s, s->orig_bio, backing_request_endio);
 
/* XXX: invalidate cache */
 
+   /* I/O request sent to backing device */
closure_bio_submit(s->iop.c, bio, cl);
}
 
@@ -859,7 +901,7 @@ static int cached_dev_cache_miss(struct btree *b, struct 
search *s,
bio_copy_dev(cache_bio, miss);
cache_bio->bi_iter.bi_size  = s->insert_bio_sectors << 9;
 
-   cache_bio->bi_end_io= request_endio;
+   cache_bio->bi_end_io= backing_request_endio;
cache_bio->bi_private   = >cl;
 
bch_bio_map(cache_bio, NULL);
@@ -872,14 +914,16 @@ static int cached_dev_cache_miss(struct btree *b, struct 
search *s,

[PATCH v5 05/10] bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags

2018-02-05 Thread Coly Li
When too many I/Os failed on cache device, bch_cache_set_error() is called
in the error handling code path to retire whole problematic cache set. If
new I/O requests continue to come and take refcount dc->count, the cache
set won't be retired immediately, this is a problem.

Further more, there are several kernel thread and self-armed kernel work
may still running after bch_cache_set_error() is called. It needs to wait
quite a while for them to stop, or they won't stop at all. They also
prevent the cache set from being retired.

The solution in this patch is, to add per cache set flag to disable I/O
request on this cache and all attached backing devices. Then new coming I/O
requests can be rejected in *_make_request() before taking refcount, kernel
threads and self-armed kernel worker can stop very fast when flags bit
CACHE_SET_IO_DISABLE is set.

Because bcache also do internal I/Os for writeback, garbage collection,
bucket allocation, journaling, this kind of I/O should be disabled after
bch_cache_set_error() is called. So closure_bio_submit() is modified to
check whether CACHE_SET_IO_DISABLE is set on cache_set->flags. If set,
closure_bio_submit() will set bio->bi_status to BLK_STS_IOERR and
return, generic_make_request() won't be called.

A sysfs interface is also added to set or clear CACHE_SET_IO_DISABLE bit
from cache_set->flags, to disable or enable cache set I/O for debugging. It
is helpful to trigger more corner case issues for failed cache device.

Changelog
v3, change CACHE_SET_IO_DISABLE from 4 to 3, since it is bit index.
remove "bcache: " prefix when printing out kernel message.
v2, more changes by previous review,
- Use CACHE_SET_IO_DISABLE of cache_set->flags, suggested by Junhui.
- Check CACHE_SET_IO_DISABLE in bch_btree_gc() to stop a while-loop, this
  is reported and inspired from origal patch of Pavel Vazharov.
v1, initial version.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Cc: Junhui Tang 
Cc: Michael Lyle 
Cc: Pavel Vazharov 
---
 drivers/md/bcache/alloc.c |  3 ++-
 drivers/md/bcache/bcache.h| 18 ++
 drivers/md/bcache/btree.c | 10 +++---
 drivers/md/bcache/io.c|  2 +-
 drivers/md/bcache/journal.c   |  4 ++--
 drivers/md/bcache/request.c   | 26 +++---
 drivers/md/bcache/super.c |  6 +-
 drivers/md/bcache/sysfs.c | 20 
 drivers/md/bcache/util.h  |  6 --
 drivers/md/bcache/writeback.c | 35 +++
 10 files changed, 101 insertions(+), 29 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index 458e1d38577d..004cc3cc6123 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -287,7 +287,8 @@ do {
\
break;  \
\
mutex_unlock(&(ca)->set->bucket_lock);  \
-   if (kthread_should_stop()) {\
+   if (kthread_should_stop() ||\
+   test_bit(CACHE_SET_IO_DISABLE, >set->flags)) {  \
set_current_state(TASK_RUNNING);\
return 0;   \
}   \
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 0380626bf525..7917b3820dd5 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -475,10 +475,15 @@ struct gc_stat {
  *
  * CACHE_SET_RUNNING means all cache devices have been registered and journal
  * replay is complete.
+ *
+ * CACHE_SET_IO_DISABLE is set when bcache is stopping the whold cache set, all
+ * external and internal I/O should be denied when this flag is set.
+ *
  */
 #define CACHE_SET_UNREGISTERING0
 #defineCACHE_SET_STOPPING  1
 #defineCACHE_SET_RUNNING   2
+#define CACHE_SET_IO_DISABLE   3
 
 struct cache_set {
struct closure  cl;
@@ -868,6 +873,19 @@ static inline void wake_up_allocators(struct cache_set *c)
wake_up_process(ca->alloc_thread);
 }
 
+static inline void closure_bio_submit(struct cache_set *c,
+ struct bio *bio,
+ struct closure *cl)
+{
+   closure_get(cl);
+   if (unlikely(test_bit(CACHE_SET_IO_DISABLE, >flags))) {
+   bio->bi_status = BLK_STS_IOERR;
+   bio_endio(bio);
+   return;
+   }
+   generic_make_request(bio);
+}
+
 /* Forward declarations */
 
 void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
diff --git 

[PATCH v5 07/10] bcache: fix inaccurate io state for detached bcache devices

2018-02-05 Thread Coly Li
From: Tang Junhui 

When we run IO in a detached device,  and run iostat to shows IO status,
normally it will show like bellow (Omitted some fields):
Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd... 15.89 0.531.820.202.23   1.81  52.30
bcache0... 15.89   115.420.000.000.00   2.40  69.60
but after IO stopped, there are still very big avgqu-sz and %util
values as bellow:
Device: ... avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
bcache0   ...  0   5326.320.000.000.00   0.00 100.10

The reason for this issue is that, only generic_start_io_acct() called
and no generic_end_io_acct() called for detached device in
cached_dev_make_request(). See the code:
//start generic_start_io_acct()
generic_start_io_acct(q, rw, bio_sectors(bio), >disk->part0);
if (cached_dev_get(dc)) {
//will callback generic_end_io_acct()
}
else {
//will not call generic_end_io_acct()
}

This patch calls generic_end_io_acct() in the end of IO for detached
devices, so we can show IO state correctly.

(Modified to use GFP_NOIO in kzalloc() by Coly Li)

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/bcache/request.c | 58 +++--
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 02296bda6384..e09c5ae745be 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -986,6 +986,55 @@ static void cached_dev_nodata(struct closure *cl)
continue_at(cl, cached_dev_bio_complete, NULL);
 }
 
+struct detached_dev_io_private {
+   struct bcache_device*d;
+   unsigned long   start_time;
+   bio_end_io_t*bi_end_io;
+   void*bi_private;
+};
+
+static void detatched_dev_end_io(struct bio *bio)
+{
+   struct detached_dev_io_private *ddip;
+
+   ddip = bio->bi_private;
+   bio->bi_end_io = ddip->bi_end_io;
+   bio->bi_private = ddip->bi_private;
+
+   generic_end_io_acct(ddip->d->disk->queue,
+   bio_data_dir(bio),
+   >d->disk->part0, ddip->start_time);
+
+   kfree(ddip);
+
+   bio->bi_end_io(bio);
+}
+
+static void detached_dev_do_request(struct bcache_device *d, struct bio *bio)
+{
+   struct detached_dev_io_private *ddip;
+   struct cached_dev *dc = container_of(d, struct cached_dev, disk);
+
+   /*
+* no need to call closure_get(>disk.cl),
+* because upper layer had already opened bcache device,
+* which would call closure_get(>disk.cl)
+*/
+   ddip = kzalloc(sizeof(struct detached_dev_io_private), GFP_NOIO);
+   ddip->d = d;
+   ddip->start_time = jiffies;
+   ddip->bi_end_io = bio->bi_end_io;
+   ddip->bi_private = bio->bi_private;
+   bio->bi_end_io = detatched_dev_end_io;
+   bio->bi_private = ddip;
+
+   if ((bio_op(bio) == REQ_OP_DISCARD) &&
+   !blk_queue_discard(bdev_get_queue(dc->bdev)))
+   bio->bi_end_io(bio);
+   else
+   generic_make_request(bio);
+}
+
 /* Cached devices - read & write stuff */
 
 static blk_qc_t cached_dev_make_request(struct request_queue *q,
@@ -1028,13 +1077,8 @@ static blk_qc_t cached_dev_make_request(struct 
request_queue *q,
else
cached_dev_read(dc, s);
}
-   } else {
-   if ((bio_op(bio) == REQ_OP_DISCARD) &&
-   !blk_queue_discard(bdev_get_queue(dc->bdev)))
-   bio_endio(bio);
-   else
-   generic_make_request(bio);
-   }
+   } else
+   detached_dev_do_request(d, bio);
 
return BLK_QC_T_NONE;
 }
-- 
2.16.1



[PATCH v5 02/10] bcache: fix cached_dev->count usage for bch_cache_set_error()

2018-02-05 Thread Coly Li
When bcache metadata I/O fails, bcache will call bch_cache_set_error()
to retire the whole cache set. The expected behavior to retire a cache
set is to unregister the cache set, and unregister all backing device
attached to this cache set, then remove sysfs entries of the cache set
and all attached backing devices, finally release memory of structs
cache_set, cache, cached_dev and bcache_device.

In my testing when journal I/O failure triggered by disconnected cache
device, sometimes the cache set cannot be retired, and its sysfs
entry /sys/fs/bcache/ still exits and the backing device also
references it. This is not expected behavior.

When metadata I/O failes, the call senquence to retire whole cache set is,
bch_cache_set_error()
bch_cache_set_unregister()
bch_cache_set_stop()
__cache_set_unregister() <- called as callback by calling
clousre_queue(>caching)
cache_set_flush()<- called as a callback when refcount
of cache_set->caching is 0
cache_set_free() <- called as a callback when refcount
of catch_set->cl is 0
bch_cache_set_release()  <- called as a callback when refcount
of catch_set->kobj is 0

I find if kernel thread bch_writeback_thread() quits while-loop when
kthread_should_stop() is true and searched_full_index is false, clousre
callback cache_set_flush() set by continue_at() will never be called. The
result is, bcache fails to retire whole cache set.

cache_set_flush() will be called when refcount of closure c->caching is 0,
and in function bcache_device_detach() refcount of closure c->caching is
released to 0 by clousre_put(). In metadata error code path, function
bcache_device_detach() is called by cached_dev_detach_finish(). This is a
callback routine being called when cached_dev->count is 0. This refcount
is decreased by cached_dev_put().

The above dependence indicates, cache_set_flush() will be called when
refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
when refcount of cache_dev->count is 0.

The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
of cache_dev is not decreased properly.

In bch_writeback_thread(), cached_dev_put() is called only when
searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
there is no dirty data on cache. In most of run time it is correct, but
when bch_writeback_thread() quits the while-loop while cache is still
dirty, current code forget to call cached_dev_put() before this kernel
thread exits. This is why sometimes cache_set_flush() is not executed and
cache set fails to be retired.

The reason to call cached_dev_put() in bch_writeback_rate() is, when the
cache device changes from clean to dirty, cached_dev_get() is called, to
make sure during writeback operatiions both backing and cache devices
won't be released.

Adding following code in bch_writeback_thread() does not work,
   static int bch_writeback_thread(void *arg)
}

+   if (atomic_read(>has_dirty))
+   cached_dev_put()
+
return 0;
 }
because writeback kernel thread can be waken up and start via sysfs entry:
echo 1 > /sys/block/bcache/bcache/writeback_running
It is difficult to check whether backing device is dirty without race and
extra lock. So the above modification will introduce potential refcount
underflow in some conditions.

The correct fix is, to take cached dev refcount when creating the kernel
thread, and put it before the kernel thread exits. Then bcache does not
need to take a cached dev refcount when cache turns from clean to dirty,
or to put a cached dev refcount when cache turns from ditry to clean. The
writeback kernel thread is alwasy safe to reference data structure from
cache set, cache and cached device (because a refcount of cache device is
taken for it already), and no matter the kernel thread is stopped by I/O
errors or system reboot, cached_dev->count can always be used correctly.

The patch is simple, but understanding how it works is quite complicated.

Changelog:
v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes.
v1: initial version for review.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Cc: Michael Lyle 
Cc: Junhui Tang 
---
 drivers/md/bcache/super.c |  1 -
 drivers/md/bcache/writeback.c | 11 ---
 drivers/md/bcache/writeback.h |  2 --
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a2ad37a8afc0..7d96dc6860fa 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1052,7 +1052,6 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 

[PATCH v5 03/10] bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set

2018-02-05 Thread Coly Li
In patch "bcache: fix cached_dev->count usage for bch_cache_set_error()",
cached_dev_get() is called when creating dc->writeback_thread, and
cached_dev_put() is called when exiting dc->writeback_thread. This
modification works well unless people detach the bcache device manually by
'echo 1 > /sys/block/bcache/bcache/detach'
Because this sysfs interface only calls bch_cached_dev_detach() which wakes
up dc->writeback_thread but does not stop it. The reason is, before patch
"bcache: fix cached_dev->count usage for bch_cache_set_error()", inside
bch_writeback_thread(), if cache is not dirty after writeback,
cached_dev_put() will be called here. And in cached_dev_make_request() when
a new write request makes cache from clean to dirty, cached_dev_get() will
be called there. Since we don't operate dc->count in these locations,
refcount d->count cannot be dropped after cache becomes clean, and
cached_dev_detach_finish() won't be called to detach bcache device.

This patch fixes the issue by checking whether BCACHE_DEV_DETACHING is
set inside bch_writeback_thread(). If this bit is set and cache is clean
(no existing writeback_keys), break the while-loop, call cached_dev_put()
and quit the writeback thread.

Please note if cache is still dirty, even BCACHE_DEV_DETACHING is set the
writeback thread should continue to perform writeback, this is the original
design of manually detach.

It is safe to do the following check without locking, let me explain why,
+   if (!test_bit(BCACHE_DEV_DETACHING, >disk.flags) &&
+   (!atomic_read(>has_dirty) || !dc->writeback_running)) {

If the kenrel thread does not sleep and continue to run due to conditions
are not updated in time on the running CPU core, it just consumes more CPU
cycles and has no hurt. This should-sleep-but-run is safe here. We just
focus on the should-run-but-sleep condition, which means the writeback
thread goes to sleep in mistake while it should continue to run.
1, First of all, no matter the writeback thread is hung or not, kthread_stop() 
from
   cached_dev_detach_finish() will wake up it and terminate by making
   kthread_should_stop() return true. And in normal run time, bit on index
   BCACHE_DEV_DETACHING is always cleared, the condition
!test_bit(BCACHE_DEV_DETACHING, >disk.flags)
   is always true and can be ignored as constant value.
2, If one of the following conditions is true, the writeback thread should
   go to sleep,
   "!atomic_read(>has_dirty)" or "!dc->writeback_running)"
   each of them independently controls the writeback thread should sleep or
   not, let's analyse them one by one.
2.1 condition "!atomic_read(>has_dirty)"
   If dc->has_dirty is set from 0 to 1 on another CPU core, bcache will
   call bch_writeback_queue() immediately or call bch_writeback_add() which
   indirectly calls bch_writeback_queue() too. In bch_writeback_queue(),
   wake_up_process(dc->writeback_thread) is called. It sets writeback
   thread's task state to TASK_RUNNING and following an implicit memory
   barrier, then tries to wake up the writeback thread.
   In writeback thread, its task state is set to TASK_INTERRUPTIBLE before
   doing the condition check. If other CPU core sets the TASK_RUNNING state
   after writeback thread setting TASK_INTERRUPTIBLE, the writeback thread
   will be scheduled to run very soon because its state is not
   TASK_INTERRUPTIBLE. If other CPU core sets the TASK_RUNNING state before
   writeback thread setting TASK_INTERRUPTIBLE, the implict memory barrier
   of wake_up_process() will make sure modification of dc->has_dirty on
   other CPU core is updated and observed on the CPU core of writeback
   thread. Therefore the condition check will correctly be false, and
   continue writeback code without sleeping.
2.2 condition "!dc->writeback_running)"
   dc->writeback_running can be changed via sysfs file, every time it is
   modified, a following bch_writeback_queue() is alwasy called. So the
   change is always observed on the CPU core of writeback thread. If
   dc->writeback_running is changed from 0 to 1 on other CPU core, this
   condition check will observe the modification and allow writeback
   thread to continue to run without sleeping.
Now we can see, even without a locking protection, multiple conditions
check is safe here, no deadlock or process hang up will happen.

I compose a separte patch because that patch "bcache: fix cached_dev->count
usage for bch_cache_set_error()" already gets a "Reviewed-by:" from Hannes
Reinecke. Also this fix is not trivial and good for a separate patch.

Signed-off-by: Coly Li 
Cc: Michael Lyle 
Cc: Hannes Reinecke 
Cc: Huijun Tang 
---
 drivers/md/bcache/writeback.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index b280c134dd4d..4dbeaaa575bf 100644
--- a/drivers/md/bcache/writeback.c
+++ 

[PATCH v5 04/10] bcache: stop dc->writeback_rate_update properly

2018-02-05 Thread Coly Li
struct delayed_work writeback_rate_update in struct cache_dev is a delayed
worker to call function update_writeback_rate() in period (the interval is
defined by dc->writeback_rate_update_seconds).

When a metadate I/O error happens on cache device, bcache error handling
routine bch_cache_set_error() will call bch_cache_set_unregister() to
retire whole cache set. On the unregister code path, this delayed work is
stopped by calling cancel_delayed_work_sync(>writeback_rate_update).

dc->writeback_rate_update is a special delayed work from others in bcache.
In its routine update_writeback_rate(), this delayed work is re-armed
itself. That means when cancel_delayed_work_sync() returns, this delayed
work can still be executed after several seconds defined by
dc->writeback_rate_update_seconds.

The problem is, after cancel_delayed_work_sync() returns, the cache set
unregister code path will continue and release memory of struct cache set.
Then the delayed work is scheduled to run, __update_writeback_rate()
will reference the already released cache_set memory, and trigger a NULL
pointer deference fault.

This patch introduces two more bcache device flags,
- BCACHE_DEV_WB_RUNNING
  bit set:  bcache device is in writeback mode and running, it is OK for
dc->writeback_rate_update to re-arm itself.
  bit clear:bcache device is trying to stop dc->writeback_rate_update,
this delayed work should not re-arm itself and quit.
- BCACHE_DEV_RATE_DW_RUNNING
  bit set:  routine update_writeback_rate() is executing.
  bit clear: routine update_writeback_rate() quits.

This patch also adds a function cancel_writeback_rate_update_dwork() to
wait for dc->writeback_rate_update quits before cancel it by calling
cancel_delayed_work_sync(). In order to avoid a deadlock by unexpected
quit dc->writeback_rate_update, after time_out seconds this function will
give up and continue to call cancel_delayed_work_sync().

And here I explain how this patch stops self re-armed delayed work properly
with the above stuffs.

update_writeback_rate() sets BCACHE_DEV_RATE_DW_RUNNING at its beginning
and clears BCACHE_DEV_RATE_DW_RUNNING at its end. Before calling
cancel_writeback_rate_update_dwork() clear flag BCACHE_DEV_WB_RUNNING.

Before calling cancel_delayed_work_sync() wait utill flag
BCACHE_DEV_RATE_DW_RUNNING is clear. So when calling
cancel_delayed_work_sync(), dc->writeback_rate_update must be already re-
armed, or quite by seeing BCACHE_DEV_WB_RUNNING cleared. In both cases
delayed work routine update_writeback_rate() won't be executed after
cancel_delayed_work_sync() returns.

Inside update_writeback_rate() before calling schedule_delayed_work(), flag
BCACHE_DEV_WB_RUNNING is checked before. If this flag is cleared, it means
someone is about to stop the delayed work. Because flag
BCACHE_DEV_RATE_DW_RUNNING is set already and cancel_delayed_work_sync()
has to wait for this flag to be cleared, we don't need to worry about race
condition here.

If update_writeback_rate() is scheduled to run after checking
BCACHE_DEV_RATE_DW_RUNNING and before calling cancel_delayed_work_sync()
in cancel_writeback_rate_update_dwork(), it is also safe. Because at this
moment BCACHE_DEV_WB_RUNNING is cleared with memory barrier. As I mentioned
previously, update_writeback_rate() will see BCACHE_DEV_WB_RUNNING is clear
and quit immediately.

Because there are more dependences inside update_writeback_rate() to struct
cache_set memory, dc->writeback_rate_update is not a simple self re-arm
delayed work. After trying many different methods (e.g. hold dc->count, or
use locks), this is the only way I can find which works to properly stop
dc->writeback_rate_update delayed work.

Changelog:
v3: change values of BCACHE_DEV_WB_RUNNING and BCACHE_DEV_RATE_DW_RUNNING
to bit index, for test_bit().
v2: Try to fix the race issue which is pointed out by Junhui.
v1: The initial version for review

Signed-off-by: Coly Li 
Reviewed-by: Junhui Tang 
Cc: Michael Lyle 
Cc: Hannes Reinecke 
---
 drivers/md/bcache/bcache.h|  9 +
 drivers/md/bcache/super.c | 39 +++
 drivers/md/bcache/sysfs.c |  3 ++-
 drivers/md/bcache/writeback.c | 29 -
 4 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index b8c2e1bef1f1..0380626bf525 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -258,10 +258,11 @@ struct bcache_device {
struct gendisk  *disk;
 
unsigned long   flags;
-#define BCACHE_DEV_CLOSING 0
-#define BCACHE_DEV_DETACHING   1
-#define BCACHE_DEV_UNLINK_DONE 2
-
+#define BCACHE_DEV_CLOSING 0
+#define BCACHE_DEV_DETACHING   1
+#define BCACHE_DEV_UNLINK_DONE 2
+#define BCACHE_DEV_WB_RUNNING  3
+#define BCACHE_DEV_RATE_DW_RUNNING 4
unsigned   

[PATCH v5 01/10] bcache: set writeback_rate_update_seconds in range [1, 60] seconds

2018-02-05 Thread Coly Li
dc->writeback_rate_update_seconds can be set via sysfs and its value can
be set to [1, ULONG_MAX].  It does not make sense to set such a large
value, 60 seconds is long enough value considering the default 5 seconds
works well for long time.

Because dc->writeback_rate_update is a special delayed work, it re-arms
itself inside the delayed work routine update_writeback_rate(). When
stopping it by cancel_delayed_work_sync(), there should be a timeout to
wait and make sure the re-armed delayed work is stopped too. A small max
value of dc->writeback_rate_update_seconds is also helpful to decide a
reasonable small timeout.

This patch limits sysfs interface to set dc->writeback_rate_update_seconds
in range of [1, 60] seconds, and replaces the hand-coded number by macros.

Changelog:
v2: fix a rebase typo in v4, which is pointed out by Michael Lyle.
v1: initial version.

Signed-off-by: Coly Li 
Reviewed-by: Hannes Reinecke 
Cc: Michael Lyle 
---
 drivers/md/bcache/sysfs.c | 4 +++-
 drivers/md/bcache/writeback.c | 2 +-
 drivers/md/bcache/writeback.h | 3 +++
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index c524305cc9a7..4a6a697e1680 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -218,7 +218,9 @@ STORE(__cached_dev)
sysfs_strtoul_clamp(writeback_rate,
dc->writeback_rate.rate, 1, INT_MAX);
 
-   d_strtoul_nonzero(writeback_rate_update_seconds);
+   sysfs_strtoul_clamp(writeback_rate_update_seconds,
+   dc->writeback_rate_update_seconds,
+   1, WRITEBACK_RATE_UPDATE_SECS_MAX);
d_strtoul(writeback_rate_i_term_inverse);
d_strtoul_nonzero(writeback_rate_p_term_inverse);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 58218f7e77c3..f1d2fc15abcc 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -655,7 +655,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
dc->writeback_rate.rate = 1024;
dc->writeback_rate_minimum  = 8;
 
-   dc->writeback_rate_update_seconds = 5;
+   dc->writeback_rate_update_seconds = WRITEBACK_RATE_UPDATE_SECS_DEFAULT;
dc->writeback_rate_p_term_inverse = 40;
dc->writeback_rate_i_term_inverse = 1;
 
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 66f1c527fa24..587b25599856 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -8,6 +8,9 @@
 #define MAX_WRITEBACKS_IN_PASS  5
 #define MAX_WRITESIZE_IN_PASS   5000   /* *512b */
 
+#define WRITEBACK_RATE_UPDATE_SECS_MAX 60
+#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
+
 /*
  * 14 (16384ths) is chosen here as something that each backing device
  * should be a reasonable fraction of the share, and not to blow up
-- 
2.16.1



[PATCH v5 00/10] bcache: device failure handling improvement

2018-02-05 Thread Coly Li
Hi maintainers and folks,

This patch set tries to improve bcache device failure handling, includes
cache device and backing device failures.

The basic idea to handle failed cache device is,
- Unregister cache set
- Detach all backing devices which are attached to this cache set
- Stop all the detached bcache devices (configurable)
- Stop all flash only volume on the cache set
The above process is named 'cache set retire' by me. The result of cache
set retire is, cache set and bcache devices are all removed, following
I/O requests will get failed immediately to notift upper layer or user
space coce that the cache device is failed or disconnected.
- Stop all the detached bcache devices (configurable)
- Stop all flash only volume on the cache set
The above process is named 'cache set retire' by me. The result of cache
set retire is, cache set and bcache devices are all removed
(configurable), following I/O requests will get failed immediately to
notify upper layer or user space coce that the cache device is failed or
disconnected.

There are 2 patches from v4 patch set is merged into bcache-for-next, they
are not in v5 patch set any more.

V5 patch set adds a new patch "bcache: add stop_when_cache_set_failed
option to backing device", which provides "auto"/"always" options to
configure whether or not to stop bcache device for a broken cache set. The
patch "bcache: stop all attached bcache devices for a retired cache set"
from v4 patch set is replaced by the above new added patch.

Most of the patches are reviewed by Hannes Reinecke and Junhui Tang. There
are still severl patches need to be reviewed,
- [PATCH v5 03/10] bcache: quit dc->writeback_thread when
  BCACHE_DEV_DETACHING is set
- [PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to
   backing device 

Any comment, question and review are warmly welcome. Thanks in advance.

Changelog:
v5: replace patch "bcache: stop all attached bcache devices for a retired
cache set" from v4 patch set by "bcache: add stop_when_cache_set_failed
option to backing device" from v5 patch set. 
fix issues from v4 patch set.
improve kernel message format, remove redundant prefix string.
v4: add per-cached_dev option stop_attached_devs_on_fail to avoid stopping
attached bcache device from a retiring cache set.
v3: fix detach issue find in v2 patch set.
v2: fixes all problems found in v1 review.
add patches to handle backing device failure.
add one more patch to set writeback_rate_update_seconds range.
include a patch from Junhui Tang.
v1: the initial version, only handles cache device failure.

Coly Li


Coly Li (10):
  bcache: set writeback_rate_update_seconds in range [1, 60] seconds
  bcache: fix cached_dev->count usage for bch_cache_set_error()
  bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
  bcache: stop dc->writeback_rate_update properly
  bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
  bcache: stop all attached bcache devices for a retired cache set
  bcache: add backing_request_endio() for bi_end_io of attached backing
device I/O
  bcache: add io_disable to struct cached_dev
  bcache: stop bcache device when backing device is offline
  bcache: add stop_when_cache_set_failed option to backing device

Tang Junhui (1):
  bcache: fix inaccurate io state for detached bcache devices

 drivers/md/bcache/alloc.c |   3 +-
 drivers/md/bcache/bcache.h|  44 -
 drivers/md/bcache/btree.c |  10 +-
 drivers/md/bcache/io.c|  16 +++-
 drivers/md/bcache/journal.c   |   4 +-
 drivers/md/bcache/request.c   | 185 +++--
 drivers/md/bcache/super.c | 206 ++
 drivers/md/bcache/sysfs.c |  59 +++-
 drivers/md/bcache/util.h  |   6 --
 drivers/md/bcache/writeback.c |  94 ---
 drivers/md/bcache/writeback.h |   5 +-
 11 files changed, 551 insertions(+), 81 deletions(-)

-- 
2.16.1



[PATCH 1/2] char_dev: Fix off-by-one bugs in find_dynamic_major()

2018-02-05 Thread Srivatsa S. Bhat
From: Srivatsa S. Bhat 

CHRDEV_MAJOR_DYN_END and CHRDEV_MAJOR_DYN_EXT_END are valid major
numbers. So fix the loop iteration to include them in the search for
free major numbers.

While at it, also remove a redundant if condition ("cd->major != i"),
as it will never be true.

Signed-off-by: Srivatsa S. Bhat 
---

 fs/char_dev.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/char_dev.c b/fs/char_dev.c
index a65e4a5..33c9385 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -67,18 +67,18 @@ static int find_dynamic_major(void)
int i;
struct char_device_struct *cd;
 
-   for (i = ARRAY_SIZE(chrdevs)-1; i > CHRDEV_MAJOR_DYN_END; i--) {
+   for (i = ARRAY_SIZE(chrdevs)-1; i >= CHRDEV_MAJOR_DYN_END; i--) {
if (chrdevs[i] == NULL)
return i;
}
 
for (i = CHRDEV_MAJOR_DYN_EXT_START;
-i > CHRDEV_MAJOR_DYN_EXT_END; i--) {
+i >= CHRDEV_MAJOR_DYN_EXT_END; i--) {
for (cd = chrdevs[major_to_index(i)]; cd; cd = cd->next)
if (cd->major == i)
break;
 
-   if (cd == NULL || cd->major != i)
+   if (cd == NULL)
return i;
}
 



[PATCH 2/2] block, char_dev: Use correct format specifier for unsigned ints

2018-02-05 Thread Srivatsa S. Bhat
From: Srivatsa S. Bhat 

register_blkdev() and __register_chrdev_region() treat the major
number as an unsigned int. So print it the same way to avoid
absurd error statements such as:
"... major requested (-1) is greater than the maximum (511) ..."
(and also fix off-by-one bugs in the error prints).

While at it, also update the comment describing register_blkdev().

Signed-off-by: Srivatsa S. Bhat 
---

 block/genhd.c |   19 +++
 fs/char_dev.c |4 ++--
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 88a53c1..21a18e5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -308,19 +308,22 @@ void blkdev_show(struct seq_file *seqf, off_t offset)
 /**
  * register_blkdev - register a new block device
  *
- * @major: the requested major device number [1..255]. If @major = 0, try to
- * allocate any unused major number.
+ * @major: the requested major device number [1..BLKDEV_MAJOR_MAX-1]. If
+ * @major = 0, try to allocate any unused major number.
  * @name: the name of the new block device as a zero terminated string
  *
  * The @name must be unique within the system.
  *
  * The return value depends on the @major input parameter:
  *
- *  - if a major device number was requested in range [1..255] then the
- *function returns zero on success, or a negative error code
+ *  - if a major device number was requested in range [1..BLKDEV_MAJOR_MAX-1]
+ *then the function returns zero on success, or a negative error code
  *  - if any unused major number was requested with @major = 0 parameter
  *then the return value is the allocated major number in range
- *[1..255] or a negative error code otherwise
+ *[1..BLKDEV_MAJOR_MAX-1] or a negative error code otherwise
+ *
+ * See Documentation/admin-guide/devices.txt for the list of allocated
+ * major numbers.
  */
 int register_blkdev(unsigned int major, const char *name)
 {
@@ -347,8 +350,8 @@ int register_blkdev(unsigned int major, const char *name)
}
 
if (major >= BLKDEV_MAJOR_MAX) {
-   pr_err("register_blkdev: major requested (%d) is greater than 
the maximum (%d) for %s\n",
-  major, BLKDEV_MAJOR_MAX, name);
+   pr_err("register_blkdev: major requested (%u) is greater than 
the maximum (%u) for %s\n",
+  major, BLKDEV_MAJOR_MAX-1, name);
 
ret = -EINVAL;
goto out;
@@ -375,7 +378,7 @@ int register_blkdev(unsigned int major, const char *name)
ret = -EBUSY;
 
if (ret < 0) {
-   printk("register_blkdev: cannot get major %d for %s\n",
+   printk("register_blkdev: cannot get major %u for %s\n",
   major, name);
kfree(p);
}
diff --git a/fs/char_dev.c b/fs/char_dev.c
index 33c9385..a279c58 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -121,8 +121,8 @@ __register_chrdev_region(unsigned int major, unsigned int 
baseminor,
}
 
if (major >= CHRDEV_MAJOR_MAX) {
-   pr_err("CHRDEV \"%s\" major requested (%d) is greater than the 
maximum (%d)\n",
-  name, major, CHRDEV_MAJOR_MAX);
+   pr_err("CHRDEV \"%s\" major requested (%u) is greater than the 
maximum (%u)\n",
+  name, major, CHRDEV_MAJOR_MAX-1);
ret = -EINVAL;
goto out;
}



Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread chenxiang (M)

在 2018/2/5 23:20, Ming Lei 写道:

This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
each reply queue to blk_mq's hw queue, then .queuecommand can always
choose the hw queue as the reply queue. And if no any online CPU is
mapped to one hw queue, request can't be submitted to this hw queue
at all, finally the irq affinity issue is solved.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
  drivers/scsi/hpsa.c | 51 ++-
  1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 443eabf63a9f..e517a4c74a28 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -51,6 +51,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include "hpsa_cmd.h"
@@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
  #define HPSA_NRESERVED_CMDS   (HPSA_CMDS_RESERVED_FOR_DRIVER +\
 HPSA_MAX_CONCURRENT_PASSTHRUS)
  
+static int hpsa_map_queues(struct Scsi_Host *shost)

+{
+struct ctlr_info *h = shost_to_hba(shost);
+
+return blk_mq_pci_map_queues(>tag_set, h->pdev);
+}
+


Hi Lei Ming,
It is okay to use blk_mq_pci_map_queue to solve automatic irq affinity 
issue when the first interrupt vector for queues is 0.
But if the first interrupt vector for queues is not 0,  we seems 
couldn't use blk_mq_pci_map_queue directly,
such as blk_mq_virtio_map_queues, it realizes a interface itself. Is it 
possible to provide a general interface for those

situations?



  static struct scsi_host_template hpsa_driver_template = {
.module = THIS_MODULE,
.name   = HPSA,
@@ -974,10 +982,13 @@ static struct scsi_host_template hpsa_driver_template = {
  #ifdef CONFIG_COMPAT
.compat_ioctl   = hpsa_compat_ioctl,
  #endif
+   .map_queues = hpsa_map_queues,
.sdev_attrs = hpsa_sdev_attrs,
.shost_attrs = hpsa_shost_attrs,
.max_sectors = 1024,
.no_write_same = 1,
+   .force_blk_mq = 1,
+   .host_tagset = 1,
  };
  
  static inline u32 next_command(struct ctlr_info *h, u8 q)

@@ -1045,11 +1056,7 @@ static void set_performant_mode(struct ctlr_info *h, 
struct CommandList *c,
c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
if (unlikely(!h->msix_vectors))
return;
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   c->Header.ReplyQueue =
-   raw_smp_processor_id() % h->nreply_queues;
-   else
-   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+   c->Header.ReplyQueue = reply_queue;
}
  }
  
@@ -1063,10 +1070,7 @@ static void set_ioaccel1_performant_mode(struct ctlr_info *h,

 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->ReplyQueue = reply_queue % h->nreply_queues;
+   cp->ReplyQueue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit (bit 0)
@@ -1087,10 +1091,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
ctlr_info *h,
/* Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/* Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
 *  - pull count (bits 0-3)
@@ -1109,10 +1110,7 @@ static void set_ioaccel2_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % 

[PATCH rfc 1/5] irq-am: Introduce library implementing generic adaptive moderation

2018-02-05 Thread Sagi Grimberg
irq-am library helps I/O devices implement interrupt moderation in
an adaptive fashion, based on online stats.

The consumer can initialize an irq-am context with a callback that
performs the device specific moderation programming and also the number
of am (adaptive moderation) levels which are also, abstracted and allows
for device specific tuning.

The irq-am code will sample once every nr_events and will check for significant
change in workload characteristics (completions per second, events per second)
and if it detects one, will perform an am level update(called a step).

The irq-am code  assumes that the am levels are sorted in an increasing order 
when
the lowest level corresponds to the optimum latency tuning (short time and low
completion-count) and gradually increasing towards the throughput optimum tuning
(longer time and higher completion-count). So there is a trend and tuning 
direction
tracked by the moderator. When the moderator collects sufficient statistics 
(also
controlled by the consumer defining nr_events), it compares the current stats 
with the
previous stats and if a significant changed was observed in the load, the 
moderator
attempts to increment/decrement its current level (step) and schedules a program
dispatch work.

Signed-off-by: Sagi Grimberg 
---
 include/linux/irq-am.h | 116 +++
 lib/Kconfig|   5 ++
 lib/Makefile   |   1 +
 lib/irq-am.c   | 182 +
 4 files changed, 304 insertions(+)
 create mode 100644 include/linux/irq-am.h
 create mode 100644 lib/irq-am.c

diff --git a/include/linux/irq-am.h b/include/linux/irq-am.h
new file mode 100644
index ..5ddd5ca268aa
--- /dev/null
+++ b/include/linux/irq-am.h
@@ -0,0 +1,116 @@
+/*
+ * Adaptive moderation support for I/O devices.
+ * Copyright (c) 2018 Lightbits Labs.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#ifndef _IRQ_AM_H
+#define _IRQ_AM_H
+
+#include 
+#include 
+
+struct irq_am;
+typedef int (irq_am_fn)(struct irq_am *, unsigned short level);
+
+/*
+ * struct irq_am_sample_stats - sample stats for adpative moderation
+ * @cps:completions per-second
+ * @eps:events per-second
+ * @cpe:   completions per event
+ */
+struct irq_am_sample_stats {
+   u32 cps;
+   u32 eps;
+   u32 cpe;
+};
+
+/*
+ * struct irq_am_sample - per-irq interrupt batch sample unit
+ * @time: current time
+ * @comps: completions count since last sample
+ * @events:events count since the last sample
+ */
+struct irq_am_sample {
+   ktime_t time;
+   u64 comps;
+   u64 events;
+};
+
+/*
+ * enum irq_am_state - adaptive moderation monitor states
+ * @IRQ_AM_START_MEASURING:collect first sample (start_sample)
+ * @IRQ_AM_MEASURING:  measurement in progress
+ * @IRQ_AM_PROGRAM_MODERATION: moderatio program scheduled
+ * so we should not react to any stats
+ * from the old moderation profile.
+ */
+enum irq_am_state {
+   IRQ_AM_START_MEASURING,
+   IRQ_AM_MEASURING,
+   IRQ_AM_PROGRAM_MODERATION,
+};
+
+enum irq_am_tune_state {
+   IRQ_AM_GOING_UP,
+   IRQ_AM_GOING_DOWN,
+};
+
+enum irq_am_relative_diff {
+   IRQ_AM_STATS_WORSE,
+   IRQ_AM_STATS_SAME,
+   IRQ_AM_STATS_BETTER,
+};
+
+struct irq_am_stats {
+   u64 events;
+   u64 comps;
+};
+
+/*
+ * struct irq_am - irq adaptive moderation monitor
+ * @state: adaptive moderation monitor state
+ * @tune_state:tuning state of the moderation monitor
+ * @am_stats:  overall completions and events counters
+ * @start_sample:  first sample in moderation batch
+ * @prev_stats:previous stats for trend detection
+ * @nr_events: number of events between samples
+ * @nr_levels: number of moderation levels
+ * @curr_level:current moderation level
+ * @work:  schedule moderation program
+ * @program:   moderation program handler
+ */
+struct irq_am {
+   enum irq_am_state   state;
+   enum irq_am_tune_state  tune_state;
+
+   struct irq_am_stats am_stats;
+   struct irq_am_samplestart_sample;
+   struct irq_am_sample_stats  prev_stats;
+
+   u16 nr_events;
+   unsigned short  nr_levels;
+   unsigned short  curr_level;
+
+   struct work_struct   

[PATCH rfc 2/5] irq-am: add some debugfs exposure on tuning state

2018-02-05 Thread Sagi Grimberg
Useful for local debugging

Signed-off-by: Sagi Grimberg 
---
 include/linux/irq-am.h |   2 +
 lib/irq-am.c   | 109 +
 2 files changed, 111 insertions(+)

diff --git a/include/linux/irq-am.h b/include/linux/irq-am.h
index 5ddd5ca268aa..18df315633a4 100644
--- a/include/linux/irq-am.h
+++ b/include/linux/irq-am.h
@@ -101,6 +101,8 @@ struct irq_am {
 
struct work_struct  work;
irq_am_fn   *program;
+   struct dentry   *debugfs_dir;
+   int id;
 };
 
 void irq_am_add_event(struct irq_am *am);
diff --git a/lib/irq-am.c b/lib/irq-am.c
index ed7befd7a560..6cd2f131f5e8 100644
--- a/lib/irq-am.c
+++ b/lib/irq-am.c
@@ -12,6 +12,61 @@
  * more details.
  */
 #include 
+#include 
+
+static DEFINE_IDA(am_ida);
+
+#ifdef CONFIG_DEBUG_FS
+struct dentry *irq_am_debugfs_root;
+
+struct irq_am_debugfs_attr {
+   const char *name;
+   umode_t mode;
+   int (*show)(void *, struct seq_file *);
+   ssize_t (*write)(void *, const char __user *, size_t, loff_t *);
+};
+
+static int irq_am_tune_state_show(void *data, struct seq_file *m)
+{
+   struct irq_am *am = data;
+
+   seq_printf(m, "%d\n", am->tune_state);
+   return 0;
+}
+
+static int irq_am_curr_level_show(void *data, struct seq_file *m)
+{
+   struct irq_am *am = data;
+
+   seq_printf(m, "%d\n", am->curr_level);
+   return 0;
+}
+
+static int irq_am_debugfs_show(struct seq_file *m, void *v)
+{
+   const struct irq_am_debugfs_attr *attr = m->private;
+   void *data = d_inode(m->file->f_path.dentry->d_parent)->i_private;
+
+   return attr->show(data, m);
+}
+
+static int irq_am_debugfs_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, irq_am_debugfs_show, inode->i_private);
+}
+
+static const struct file_operations irq_am_debugfs_fops = {
+   .open   = irq_am_debugfs_open,
+   .read   = seq_read,
+   .release= seq_release,
+};
+
+static const struct irq_am_debugfs_attr irq_am_attrs[] = {
+   {"tune_state", 0400, irq_am_tune_state_show},
+   {"curr_level", 0400, irq_am_curr_level_show},
+   {},
+};
+#endif
 
 static void irq_am_try_step(struct irq_am *am)
 {
@@ -160,10 +215,51 @@ static void irq_am_program_moderation_work(struct 
work_struct *w)
am->state = IRQ_AM_START_MEASURING;
 }
 
+#ifdef CONFIG_DEBUG_FS
+static bool debugfs_create_files(struct dentry *parent, void *data,
+const struct irq_am_debugfs_attr *attr)
+{
+   d_inode(parent)->i_private = data;
+
+   for (; attr->name; attr++) {
+   if (!debugfs_create_file(attr->name, attr->mode, parent,
+(void *)attr, _am_debugfs_fops))
+   return false;
+   }
+   return true;
+}
+
+static int irq_am_register_debugfs(struct irq_am *am)
+{
+   char name[20];
+
+   snprintf(name, sizeof(name), "am%u", am->id);
+   am->debugfs_dir = debugfs_create_dir(name,
+   irq_am_debugfs_root);
+   if (!am->debugfs_dir)
+   return -ENOMEM;
+
+   if (!debugfs_create_files(am->debugfs_dir, am,
+   irq_am_attrs))
+   return -ENOMEM;
+   return 0;
+}
+
+static void irq_am_deregister_debugfs(struct irq_am *am)
+{
+   debugfs_remove_recursive(am->debugfs_dir);
+}
+
+#else
+static int irq_am_register_debugfs(struct irq_am *am) {}
+static void irq_am_deregister_debugfs(struct irq_am *am) {}
+#endif
 
 void irq_am_cleanup(struct irq_am *am)
 {
flush_work(>work);
+   irq_am_deregister_debugfs(am);
+   ida_simple_remove(_ida, am->id);
 }
 EXPORT_SYMBOL_GPL(irq_am_cleanup);
 
@@ -177,6 +273,19 @@ void irq_am_init(struct irq_am *am, unsigned int nr_events,
am->nr_events = nr_events;
am->curr_level = start_level;
am->program = fn;
+   am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
+   WARN_ON(am->id < 0);
INIT_WORK(>work, irq_am_program_moderation_work);
+   if (irq_am_register_debugfs(am))
+   pr_warn("irq-am %d failed to register debugfs\n", am->id);
 }
 EXPORT_SYMBOL_GPL(irq_am_init);
+
+static __init int irq_am_setup(void)
+{
+#ifdef CONFIG_DEBUG_FS
+   irq_am_debugfs_root = debugfs_create_dir("irq_am", NULL);
+#endif
+   return 0;
+}
+subsys_initcall(irq_am_setup);
-- 
2.14.1



[PATCH rfc 3/5] irq_poll: wire up irq_am

2018-02-05 Thread Sagi Grimberg
Update online stats for fired event and completions on
each poll cycle.

Also expose am initialization interface. The irqpoll consumer will
initialize the irq-am context of the irq-poll context.

Signed-off-by: Sagi Grimberg 
---
 include/linux/irq_poll.h |  9 +
 lib/Kconfig  |  1 +
 lib/irq_poll.c   | 30 +-
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
index 16aaeccb65cb..de3055359fd8 100644
--- a/include/linux/irq_poll.h
+++ b/include/linux/irq_poll.h
@@ -2,14 +2,20 @@
 #ifndef IRQ_POLL_H
 #define IRQ_POLL_H
 
+#include 
+
 struct irq_poll;
 typedef int (irq_poll_fn)(struct irq_poll *, int);
+typedef int (irq_poll_am_fn)(struct irq_poll *, unsigned short);
 
 struct irq_poll {
struct list_head list;
unsigned long state;
int weight;
irq_poll_fn *poll;
+
+   struct irq_am am;
+   irq_poll_am_fn *amfn;
 };
 
 enum {
@@ -23,4 +29,7 @@ extern void irq_poll_complete(struct irq_poll *);
 extern void irq_poll_enable(struct irq_poll *);
 extern void irq_poll_disable(struct irq_poll *);
 
+extern void irq_poll_init_am(struct irq_poll *iop, unsigned int nr_events,
+unsigned short nr_levels, unsigned short start_level,
+   irq_poll_am_fn *amfn);
 #endif
diff --git a/lib/Kconfig b/lib/Kconfig
index bbb4c9eea84d..d495b21cd241 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -511,6 +511,7 @@ config IRQ_AM
 
 config IRQ_POLL
bool "IRQ polling library"
+   select IRQ_AM
help
  Helper library to poll interrupt mitigation using polling.
 
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 86a709954f5a..6bc86a677d8c 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -53,6 +53,7 @@ static void __irq_poll_complete(struct irq_poll *iop)
list_del(>list);
smp_mb__before_atomic();
clear_bit_unlock(IRQ_POLL_F_SCHED, >state);
+   irq_am_add_event(>am);
 }
 
 /**
@@ -106,8 +107,10 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)
 
weight = iop->weight;
work = 0;
-   if (test_bit(IRQ_POLL_F_SCHED, >state))
+   if (test_bit(IRQ_POLL_F_SCHED, >state)) {
work = iop->poll(iop, weight);
+   irq_am_add_comps(>am, work);
+   }
 
budget -= work;
 
@@ -144,6 +147,7 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)
  **/
 void irq_poll_disable(struct irq_poll *iop)
 {
+   irq_am_cleanup(>am);
set_bit(IRQ_POLL_F_DISABLE, >state);
while (test_and_set_bit(IRQ_POLL_F_SCHED, >state))
msleep(1);
@@ -185,6 +189,30 @@ void irq_poll_init(struct irq_poll *iop, int weight, 
irq_poll_fn *poll_fn)
 }
 EXPORT_SYMBOL(irq_poll_init);
 
+static int irq_poll_am(struct irq_am *am, unsigned short level)
+{
+   struct irq_poll *iop = container_of(am, struct irq_poll, am);
+
+   return iop->amfn(iop, level);
+}
+
+/**
+ * irq_poll_init_am - Initialize adaptive moderation parameters on this @iop
+ * @iop:  The parent iopoll structure
+ * @weight:   The default weight (or command completion budget)
+ * @poll_fn:  The handler to invoke
+ *
+ * Description:
+ * Initialize adaptive moderation for this irq_poll structure.
+ **/
+void irq_poll_init_am(struct irq_poll *iop, unsigned int nr_events,
+unsigned short nr_levels, unsigned short start_level, irq_poll_am_fn 
*amfn)
+{
+   iop->amfn = amfn;
+   irq_am_init(>am, nr_events, nr_levels, start_level, irq_poll_am);
+}
+EXPORT_SYMBOL(irq_poll_init_am);
+
 static int irq_poll_cpu_dead(unsigned int cpu)
 {
/*
-- 
2.14.1



[PATCH rfc 4/5] IB/cq: add adaptive moderation support

2018-02-05 Thread Sagi Grimberg
Currently activated via modparam, obviously we will want to
find a more generic way to control this.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/core/cq.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index f2ae75fa3128..270801d28f9d 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -25,6 +25,51 @@
 #define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
 
+#define IB_CQ_AM_NR_LEVELS 8
+#define IB_CQ_AM_NR_EVENTS 64
+
+struct ib_cq_moder {
+   u16 usec;
+   u16 comps;
+};
+
+/* XXX: magic adaptive moderation profiles */
+static const struct ib_cq_moder am_prof[IB_CQ_AM_NR_LEVELS] = {
+   {1,  1},
+   {1,  2},
+   {2,  4},
+   {4,  8},
+   {8, 16},
+   {16, 32},
+   {32, 48},
+   {32, 64},
+};
+
+static bool use_am = true;
+module_param(use_am, bool, 0444);
+MODULE_PARM_DESC(use_am, "Use cq adaptive moderation");
+
+static int ib_cq_am(struct ib_cq *cq, unsigned short level)
+{
+   u16 usec;
+   u16 comps;
+
+   if (!use_am)
+   return 0;
+
+   usec = am_prof[level].usec;
+   comps = am_prof[level].comps;
+
+   return cq->device->modify_cq(cq, comps, usec);
+}
+
+static int ib_cq_irqpoll_am(struct irq_poll *iop, unsigned short level)
+{
+   struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
+
+   return ib_cq_am(cq, level);
+}
+
 static int __ib_process_cq(struct ib_cq *cq, int budget)
 {
int i, n, completed = 0;
@@ -162,6 +207,9 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void 
*private,
cq->comp_handler = ib_cq_completion_softirq;
 
irq_poll_init(>iop, IB_POLL_BUDGET_IRQ, ib_poll_handler);
+   if (cq->device->modify_cq)
+   irq_poll_init_am(>iop, IB_CQ_AM_NR_EVENTS,
+   IB_CQ_AM_NR_LEVELS, 0, ib_cq_irqpoll_am);
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
break;
case IB_POLL_WORKQUEUE:
-- 
2.14.1



[PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-05 Thread Sagi Grimberg
Adaptive IRQ moderation (also called adaptive IRQ coalescing) has been widely 
used
in the networking stack for over 20 years and has become a standard default
setting.

Adaptive moderation is a feature supported by the device to delay an interrupt
for a either a period of time, or number of completions (packets for networking
devices) in order to optimize the effective throughput by mitigating the cost of
handling interrupts which can be expensive in modern rates in networking and/or
storage devices.

The basic concept of adaptive moderation is to provide time and packet based 
control
of when the device generates an interrupt in an adaptive fashion, attempting to
identify the current workload based on online gathered statistics instead of a
predefined configuration. When done correctly, adaptive moderation can win the
best of both throughput

This rfc patchset introduces a generic library that provides the mechanics for
online statistics monitoring and decision making for the consumer which simply
need to program the actual device while still keeping room for device specific
tuning.

In the networking stack, each device driver implements adaptive IRQ moderation
on its own. The approach here is a bit different, it tries to take the common 
denominator,
which is per-queue statistics gathering and workload change identification
(basically decides if the moderation scheme needs to change).

The library is targeted to multi-queue devices, but should work on single queue
devices as well, however I'm not sure that these devices will need something
like interrupt moderation.

The model used in the proposed implementation requires the consumer (a.k.a the
device driver) to initialize an irq adaptive moderator context (a.k.a irq_am) 
per
its desired context which will most likely be a completion queue.

The moderator is initialized with a set of am (adaptive moderation) levels,
which are essentially the abstraction of the device specific moderation 
parameters.
Usually the different levels will map to different pairs of
(time , completion-count) which are left specific to the consumer.

The moderator assumes that the am levels are sorted in an increasing order when 
the
lowest level corresponds to the optimum latency tuning (short time and low
completion-count) and gradually increasing towards the throughput optimum tuning
(longer time and higher completion-count). So there is a trend and tuning 
direction
tracked by the moderator. When the moderator collects sufficient statistics 
(also
controlled by the consumer defining nr_events), it compares the current stats 
with the
previous stats and if a significant changed was observed in the load, the 
moderator
attempts to increment/decrement its current level (called a step) and schedules 
a program
dispatch work.

The main reason why this implementation is different then the common networking 
devices
implementation (and kept separate) is that in my mind at least, network devices 
are different
animals than other I/O devices in the sense that:
(a) network devices rely heavily on byte count of raw ethernet frames for 
adaptive moderation
while in storage or I/O, the byte count is often a result of a 
submission/completion transaction
and sometimes even known only to the application on top of the 
infrastructure (like in the
rdma case).
(b) Performance characteristics and expectations in representative workloads.
(c) network devices collect all sort of stats for different functionalities 
(where adaptive moderation
is only one use-case) and I'm not sure at all that a subset of stats could 
easily migrate to a different
context.

Having said that, if sufficient reasoning comes along, I could be convinced 
otherwise if unification
of the implementations between networking and I/O is desired.

Additionally, I hooked two consumers into this framework:
1. irq-poll - interrupt polling library (used by various rdma consumers and can 
be extended to others)
2. rdma cq - generic rdma cq polling abstraction library

With this, both RDMA initiator mode consumers and RDMA target mode consumers
are able to utilize the framework (nvme, iser, srp, nfs). Moreover, I currently 
do
not see any reason why other HBAs (or devices in general) that support interrupt
moderation wouldn't be able to hook into this framework as well.

Note that the code is in *RFC* level, attempting to convey the concept.
If the direction is acceptable, the setup can be easily modified and
cleanups can be performed.

Initial benchmarking shows promising results with 50% improvement in throughput
when testing high load of small I/O. The experiment taken used nvme-rdma host 
vs. nvmet-rdma
target exposing a null_blk device. The workload ran multithreaded fio run with 
high queue-depth
and block size of 512B read I/O (4K block size would exceed 100 Gbe wire speed).

The results without adaptive moderation reaches ~8M IOPs bottlenecking the host 
and target cpu.
With adaptive moderation enabled, 

[PATCH rfc 5/5] IB/cq: wire up adaptive moderation to workqueue based completion queues

2018-02-05 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/core/cq.c | 25 -
 include/rdma/ib_verbs.h  |  9 +++--
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 270801d28f9d..1d984fd77449 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* # of WCs to poll for with a single call to ib_poll_cq */
@@ -70,6 +71,13 @@ static int ib_cq_irqpoll_am(struct irq_poll *iop, unsigned 
short level)
return ib_cq_am(cq, level);
 }
 
+static int ib_cq_workqueue_am(struct irq_am *am, unsigned short level)
+{
+   struct ib_cq *cq = container_of(am, struct ib_cq, wq.am);
+
+   return ib_cq_am(cq, level);
+}
+
 static int __ib_process_cq(struct ib_cq *cq, int budget)
 {
int i, n, completed = 0;
@@ -147,18 +155,21 @@ static void ib_cq_completion_softirq(struct ib_cq *cq, 
void *private)
 
 static void ib_cq_poll_work(struct work_struct *work)
 {
-   struct ib_cq *cq = container_of(work, struct ib_cq, work);
+   struct ib_cq *cq = container_of(work, struct ib_cq, wq.work);
int completed;
 
completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
+   irq_am_add_comps(>wq.am, completed);
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
-   queue_work(ib_comp_wq, >work);
+   queue_work(ib_comp_wq, >wq.work);
+   else
+   irq_am_add_event(>wq.am);
 }
 
 static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
 {
-   queue_work(ib_comp_wq, >work);
+   queue_work(ib_comp_wq, >wq.work);
 }
 
 /**
@@ -214,7 +225,10 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void 
*private,
break;
case IB_POLL_WORKQUEUE:
cq->comp_handler = ib_cq_completion_workqueue;
-   INIT_WORK(>work, ib_cq_poll_work);
+   INIT_WORK(>wq.work, ib_cq_poll_work);
+   if (cq->device->modify_cq)
+   irq_am_init(>wq.am, IB_CQ_AM_NR_EVENTS,
+   IB_CQ_AM_NR_LEVELS, 0, ib_cq_workqueue_am);
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
break;
default:
@@ -250,7 +264,8 @@ void ib_free_cq(struct ib_cq *cq)
irq_poll_disable(>iop);
break;
case IB_POLL_WORKQUEUE:
-   cancel_work_sync(>work);
+   cancel_work_sync(>wq.work);
+   irq_am_cleanup(>wq.am);
break;
default:
WARN_ON_ONCE(1);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fd84cda5ed7c..14a93566adb6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1555,6 +1555,11 @@ enum ib_poll_context {
IB_POLL_WORKQUEUE,  /* poll from workqueue */
 };
 
+struct ib_cq_workqueue_poll {
+   struct irq_am   am;
+   struct work_struct  work;
+};
+
 struct ib_cq {
struct ib_device   *device;
struct ib_uobject  *uobject;
@@ -1566,8 +1571,8 @@ struct ib_cq {
enum ib_poll_contextpoll_ctx;
struct ib_wc*wc;
union {
-   struct irq_poll iop;
-   struct work_struct  work;
+   struct irq_poll iop;
+   struct ib_cq_workqueue_poll wq;
};
 };
 
-- 
2.14.1



Re: [PATCH] blk-mq: Fix a race between resetting the timer and completion handling

2018-02-05 Thread t...@kernel.org
Hello, Bart.

On Mon, Feb 05, 2018 at 09:33:03PM +, Bart Van Assche wrote:
> My goal with this patch is to fix the race between resetting the timer and
> the completion path. Hence change (3). Changes (1) and (2) are needed to
> make the changes in blk_mq_rq_timed_out() work.

Ah, I see.  That makes sense.  Can I ask you to elaborate the scenario
you were fixing?

> > > @@ -831,13 +834,12 @@ static void blk_mq_rq_timed_out(struct request 
> > > *req, bool reserved)
> > >   __blk_mq_complete_request(req);
> > >   break;
> > >   case BLK_EH_RESET_TIMER:
> > > - /*
> > > -  * As nothing prevents from completion happening while
> > > -  * ->aborted_gstate is set, this may lead to ignored
> > > -  * completions and further spurious timeouts.
> > > -  */
> > > - blk_mq_rq_update_aborted_gstate(req, 0);
> > > + local_irq_disable();
> > > + write_seqcount_begin(>gstate_seq);
> > >   blk_add_timer(req);
> > > + req->aborted_gstate = 0;
> > > + write_seqcount_end(>gstate_seq);
> > > + local_irq_enable();
> > >   break;
> > 
> > So, this is #3 and I'm not sure how adding gstate_seq protection gets
> > rid of the race condition mentioned in the comment.  It's still the
> > same that nothing is protecting against racing w/ completion.
> 
> I think you are right. I will see whether I can rework this patch to address
> that race.

That race is harmless and has always been there tho.  It only happens
when the actual completion coincides with timeout expiring, which is
very unlikely, and the only downside is that the completion gets lost
and the request will get timed out down the line.  It'd of course be
better to close the race window but code simplicity likely is an
important trade-off factor here.

Thanks.

-- 
tejun


Re: [PATCH] blk-mq: Fix a race between resetting the timer and completion handling

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 13:06 -0800, Tejun Heo wrote:
> Thanks a lot for testing and fixing the issues but I'm a bit confused
> by the patch.  Maybe we can split patch a bit more?  There seem to be
> three things going on,
> 
> 1. Changing preemption protection to irq protection in issue path.
> 
> 2. Merge of aborted_gstate_sync and gstate_seq.
> 
> 3. Updates to blk_mq_rq_timed_out().
> 
> Are all three changes necessary for stability?

Hello Tejun,

My goal with this patch is to fix the race between resetting the timer and
the completion path. Hence change (3). Changes (1) and (2) are needed to
make the changes in blk_mq_rq_timed_out() work.

> > @@ -831,13 +834,12 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > -   /*
> > -* As nothing prevents from completion happening while
> > -* ->aborted_gstate is set, this may lead to ignored
> > -* completions and further spurious timeouts.
> > -*/
> > -   blk_mq_rq_update_aborted_gstate(req, 0);
> > +   local_irq_disable();
> > +   write_seqcount_begin(>gstate_seq);
> > blk_add_timer(req);
> > +   req->aborted_gstate = 0;
> > +   write_seqcount_end(>gstate_seq);
> > +   local_irq_enable();
> > break;
> 
> So, this is #3 and I'm not sure how adding gstate_seq protection gets
> rid of the race condition mentioned in the comment.  It's still the
> same that nothing is protecting against racing w/ completion.

I think you are right. I will see whether I can rework this patch to address
that race.

Thanks,

Bart.




Re: [PATCH] blk-mq: Fix a race between resetting the timer and completion handling

2018-02-05 Thread Tejun Heo
Hello, Bart.

Thanks a lot for testing and fixing the issues but I'm a bit confused
by the patch.  Maybe we can split patch a bit more?  There seem to be
three things going on,

1. Changing preemption protection to irq protection in issue path.

2. Merge of aborted_gstate_sync and gstate_seq.

3. Updates to blk_mq_rq_timed_out().

Are all three changes necessary for stability?

> @@ -831,13 +834,12 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> - /*
> -  * As nothing prevents from completion happening while
> -  * ->aborted_gstate is set, this may lead to ignored
> -  * completions and further spurious timeouts.
> -  */
> - blk_mq_rq_update_aborted_gstate(req, 0);
> + local_irq_disable();
> + write_seqcount_begin(>gstate_seq);
>   blk_add_timer(req);
> + req->aborted_gstate = 0;
> + write_seqcount_end(>gstate_seq);
> + local_irq_enable();
>   break;

So, this is #3 and I'm not sure how adding gstate_seq protection gets
rid of the race condition mentioned in the comment.  It's still the
same that nothing is protecting against racing w/ completion.

Thanks.

-- 
tejun


RE: [PATCH V2 4/8] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-05 Thread Don Brace
> This patch introduces the parameter of 'g_global_tags' so that we can
> test this feature by null_blk easiy.
> 
> Not see obvious performance drop with global_tags when the whole hw
> depth is kept as same:
> 
> 1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0
> submit_queues=4 hw_queue_depth=1
> 
> 2) 'global_tags', global hw queue depth is 4, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=1
> submit_queues=4 hw_queue_depth=4
> 
> 3) fio test done in above two settings:
>fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 --
> ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  --
> name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 --
> name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3
> 
> 1M IOPS can be reached in both above tests which is done in one VM.
> 
I am getting 1.1M IOPS for both cases.

Tested-by: Don Brace 




Re: [PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Jens Axboe
On 2/5/18 1:00 PM, Mikulas Patocka wrote:
> 
> 
> On Mon, 5 Feb 2018, Jens Axboe wrote:
> 
>> On 2/5/18 12:22 PM, Jens Axboe wrote:
>>> On 2/5/18 12:11 PM, Mikulas Patocka wrote:
 I have a workload where one process sends many asynchronous write bios
 (without waiting for them) and another process sends synchronous flush
 bios. During this workload, writeback throttling throttles down to one
 outstanding bio, and this incorrect throttling causes performance
 degradation (all write bios sleep in __wbt_wait and they couldn't be sent
 in parallel).

 The reason for this throttling is that wbt_data_dir counts flush requests
 in the read bucket. The flush requests (that take quite a long time)
 trigger this condition repeatedly:
if (stat[READ].min > rwb->min_lat_nsec)
 and that condition causes scale down to one outstanding request, despite
 the fact that there are no read bios at all.

 A similar problem could also show up with REQ_OP_ZONE_REPORT and
 REQ_OP_ZONE_RESET - they are also counted in the read bucket.

 This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
 requests are counted in the read bucket. The patch improves SATA 
 write+flush throughput from 130MB/s to 350MB/s.
>>>
>>> Good catch. But I do wonder if we should account them at all. It
>>> might make sense to account flushes as writes, but we really
>>> should not account anything that isn't regular read/write IO 
>>> related.
>>>
>>> Something like the below? Potentially including REQ_OP_FLUSH in the
>>> write latencies as well.
>>
>> Thinking about it, flush should just be counted as writes, and the
>> rest disregarded. Ala below.
>>
>>
>> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
>> index ae8de9780085..f92fc84b5e2c 100644
>> --- a/block/blk-wbt.c
>> +++ b/block/blk-wbt.c
>> @@ -697,7 +697,15 @@ u64 wbt_default_latency_nsec(struct request_queue *q)
>>  
>>  static int wbt_data_dir(const struct request *rq)
>>  {
>> -return rq_data_dir(rq);
>> +const int op = req_op(rq);
>> +
>> +if (op == REQ_OP_READ)
>> +return READ;
>> +else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
>> +return WRITE;
>> +
>> +/* don't account */
>> +return -1;
>>  }
>>  
>>  int wbt_init(struct request_queue *q)
>>
>> -- 
>> Jens Axboe
> 
> Yes, this works.

OK good, I'll queue that up then.

> BTW. write throttling still causes some performance degradation (350MB/s 
> vs. 500MB/s without throttling) - should there be some logic that disables 
> write throttling if no sync requests are received?
> 
> Something like - if there's sync read, enable write throttling - if there 
> were no sync reads in the last 5 seconds, disable write throttling?

blk-wbt maintains a baseline that isn't too deep into writes, or you run
into problems exactly when a sync or read request does come in after 5
seconds, or whatever heuristic you choose. So that's very much by
design. It does allow steps to go negative if no reads have been seen,
but only so much so, and it bounces back immediately when one is seen.

-- 
Jens Axboe



Re: [PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Mikulas Patocka


On Mon, 5 Feb 2018, Jens Axboe wrote:

> On 2/5/18 12:22 PM, Jens Axboe wrote:
> > On 2/5/18 12:11 PM, Mikulas Patocka wrote:
> >> I have a workload where one process sends many asynchronous write bios
> >> (without waiting for them) and another process sends synchronous flush
> >> bios. During this workload, writeback throttling throttles down to one
> >> outstanding bio, and this incorrect throttling causes performance
> >> degradation (all write bios sleep in __wbt_wait and they couldn't be sent
> >> in parallel).
> >>
> >> The reason for this throttling is that wbt_data_dir counts flush requests
> >> in the read bucket. The flush requests (that take quite a long time)
> >> trigger this condition repeatedly:
> >>if (stat[READ].min > rwb->min_lat_nsec)
> >> and that condition causes scale down to one outstanding request, despite
> >> the fact that there are no read bios at all.
> >>
> >> A similar problem could also show up with REQ_OP_ZONE_REPORT and
> >> REQ_OP_ZONE_RESET - they are also counted in the read bucket.
> >>
> >> This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
> >> requests are counted in the read bucket. The patch improves SATA 
> >> write+flush throughput from 130MB/s to 350MB/s.
> > 
> > Good catch. But I do wonder if we should account them at all. It
> > might make sense to account flushes as writes, but we really
> > should not account anything that isn't regular read/write IO 
> > related.
> > 
> > Something like the below? Potentially including REQ_OP_FLUSH in the
> > write latencies as well.
> 
> Thinking about it, flush should just be counted as writes, and the
> rest disregarded. Ala below.
> 
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index ae8de9780085..f92fc84b5e2c 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -697,7 +697,15 @@ u64 wbt_default_latency_nsec(struct request_queue *q)
>  
>  static int wbt_data_dir(const struct request *rq)
>  {
> - return rq_data_dir(rq);
> + const int op = req_op(rq);
> +
> + if (op == REQ_OP_READ)
> + return READ;
> + else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
> + return WRITE;
> +
> + /* don't account */
> + return -1;
>  }
>  
>  int wbt_init(struct request_queue *q)
> 
> -- 
> Jens Axboe

Yes, this works.

BTW. write throttling still causes some performance degradation (350MB/s 
vs. 500MB/s without throttling) - should there be some logic that disables 
write throttling if no sync requests are received?

Something like - if there's sync read, enable write throttling - if there 
were no sync reads in the last 5 seconds, disable write throttling?

Mikulas


Re: [PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Jens Axboe
On 2/5/18 12:22 PM, Jens Axboe wrote:
> On 2/5/18 12:11 PM, Mikulas Patocka wrote:
>> I have a workload where one process sends many asynchronous write bios
>> (without waiting for them) and another process sends synchronous flush
>> bios. During this workload, writeback throttling throttles down to one
>> outstanding bio, and this incorrect throttling causes performance
>> degradation (all write bios sleep in __wbt_wait and they couldn't be sent
>> in parallel).
>>
>> The reason for this throttling is that wbt_data_dir counts flush requests
>> in the read bucket. The flush requests (that take quite a long time)
>> trigger this condition repeatedly:
>>  if (stat[READ].min > rwb->min_lat_nsec)
>> and that condition causes scale down to one outstanding request, despite
>> the fact that there are no read bios at all.
>>
>> A similar problem could also show up with REQ_OP_ZONE_REPORT and
>> REQ_OP_ZONE_RESET - they are also counted in the read bucket.
>>
>> This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
>> requests are counted in the read bucket. The patch improves SATA 
>> write+flush throughput from 130MB/s to 350MB/s.
> 
> Good catch. But I do wonder if we should account them at all. It
> might make sense to account flushes as writes, but we really
> should not account anything that isn't regular read/write IO 
> related.
> 
> Something like the below? Potentially including REQ_OP_FLUSH in the
> write latencies as well.

Thinking about it, flush should just be counted as writes, and the
rest disregarded. Ala below.


diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index ae8de9780085..f92fc84b5e2c 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -697,7 +697,15 @@ u64 wbt_default_latency_nsec(struct request_queue *q)
 
 static int wbt_data_dir(const struct request *rq)
 {
-   return rq_data_dir(rq);
+   const int op = req_op(rq);
+
+   if (op == REQ_OP_READ)
+   return READ;
+   else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
+   return WRITE;
+
+   /* don't account */
+   return -1;
 }
 
 int wbt_init(struct request_queue *q)

-- 
Jens Axboe



Re: [PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Jens Axboe
On 2/5/18 12:11 PM, Mikulas Patocka wrote:
> I have a workload where one process sends many asynchronous write bios
> (without waiting for them) and another process sends synchronous flush
> bios. During this workload, writeback throttling throttles down to one
> outstanding bio, and this incorrect throttling causes performance
> degradation (all write bios sleep in __wbt_wait and they couldn't be sent
> in parallel).
> 
> The reason for this throttling is that wbt_data_dir counts flush requests
> in the read bucket. The flush requests (that take quite a long time)
> trigger this condition repeatedly:
>   if (stat[READ].min > rwb->min_lat_nsec)
> and that condition causes scale down to one outstanding request, despite
> the fact that there are no read bios at all.
> 
> A similar problem could also show up with REQ_OP_ZONE_REPORT and
> REQ_OP_ZONE_RESET - they are also counted in the read bucket.
> 
> This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
> requests are counted in the read bucket. The patch improves SATA 
> write+flush throughput from 130MB/s to 350MB/s.

Good catch. But I do wonder if we should account them at all. It
might make sense to account flushes as writes, but we really
should not account anything that isn't regular read/write IO 
related.

Something like the below? Potentially including REQ_OP_FLUSH in the
write latencies as well.


diff --git a/block/blk-wbt.c b/block/blk-wbt.c
index ae8de9780085..55671e5f3b6e 100644
--- a/block/blk-wbt.c
+++ b/block/blk-wbt.c
@@ -697,7 +697,13 @@ u64 wbt_default_latency_nsec(struct request_queue *q)
 
 static int wbt_data_dir(const struct request *rq)
 {
-   return rq_data_dir(rq);
+   if (req_op(rq) == REQ_OP_READ)
+   return READ;
+   else if (req_op(rq) == REQ_OP_WRITE)
+   return WRITE;
+
+   /* don't account */
+   return -1;
 }
 
 int wbt_init(struct request_queue *q)

-- 
Jens Axboe



[PATCH] wbt: fix incorrect throttling due to flush latency

2018-02-05 Thread Mikulas Patocka
I have a workload where one process sends many asynchronous write bios
(without waiting for them) and another process sends synchronous flush
bios. During this workload, writeback throttling throttles down to one
outstanding bio, and this incorrect throttling causes performance
degradation (all write bios sleep in __wbt_wait and they couldn't be sent
in parallel).

The reason for this throttling is that wbt_data_dir counts flush requests
in the read bucket. The flush requests (that take quite a long time)
trigger this condition repeatedly:
if (stat[READ].min > rwb->min_lat_nsec)
and that condition causes scale down to one outstanding request, despite
the fact that there are no read bios at all.

A similar problem could also show up with REQ_OP_ZONE_REPORT and
REQ_OP_ZONE_RESET - they are also counted in the read bucket.

This patch fixes the function wbt_data_dir, so that only REQ_OP_READ 
requests are counted in the read bucket. The patch improves SATA 
write+flush throughput from 130MB/s to 350MB/s.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org  # v4.12+

---
 block/blk-wbt.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux-stable/block/blk-wbt.c
===
--- linux-stable.orig/block/blk-wbt.c   2018-02-05 12:30:09.606063908 -0500
+++ linux-stable/block/blk-wbt.c2018-02-05 13:50:02.784213501 -0500
@@ -697,7 +697,11 @@ u64 wbt_default_latency_nsec(struct requ
 
 static int wbt_data_dir(const struct request *rq)
 {
-   return rq_data_dir(rq);
+   /*
+* Flushes must be counted in the write bucket, so that high flush
+* latencies don't cause scale down.
+*/
+   return req_op(rq) == REQ_OP_READ ? READ : WRITE;
 }
 
 int wbt_init(struct request_queue *q)


Re: v4.15 and I/O hang with BFQ

2018-02-05 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 16:40, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
>> scritto:
>> 
>> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>>> Hi.
>>> 
>> ...
>>>  systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>>> rq->0
>>>  systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>>> rq->1
>>>   kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> ===
>>> 
>>> Looks the same, right?
>> 
>> Yeah, same with before.
>> 
> 
> Hi guys,
> sorry for the delay with this fix.  We are proceeding very slowly on
> this, because I'm super busy.  Anyway, now I can at least explain in
> more detail the cause that leads to this hang.  Commit 'a6a252e64914
> ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
> makes all non-flush re-prepared requests be re-inserted into the I/O
> scheduler.  With this change, I/O schedulers may get the same request
> inserted again, even several times, without a finish_request invoked
> on the request before each re-insertion.
> 
> For the I/O scheduler, every such re-prepared request is equivalent
> to the insertion of a new request. For schedulers like mq-deadline
> or kyber this fact causes no problems. In contrast, it confuses a stateful
> scheduler like BFQ, which preserves states for an I/O request until
> finish_request is invoked on it. In particular, BFQ has no way
> to know that the above re-insertions concerns the same, already dispatched
> request. So it may get stuck waiting for the completion of these
> re-inserted requests forever, thus preventing any other queue of
> requests to be served.
> 
> We are trying to address this issue by adding the hook requeue_request
> to bfq interface.
> 
> Unfortunately, with our current implementation of requeue_request in
> place, bfq eventually gets to an incoherent state.  This is apparently
> caused by a requeue of an I/O request, immediately followed by a
> completion of the same request.  This seems rather absurd, and drives
> bfq crazy.  But this is something for which we don't have definite
> results yet.
> 
> We're working on it, sorry again for the delay.
> 

Ok, patch arriving ... Please test it.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Ming
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to bfq-iosched+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



[PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-05 Thread Paolo Valente
Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
be re-inserted into the active I/O scheduler for that device. As a
consequence, I/O schedulers may get the same request inserted again,
even several times, without a finish_request invoked on that request
before each re-insertion.

This fact is the cause of the failure reported in [1]. For an I/O
scheduler, every re-insertion of the same re-prepared request is
equivalent to the insertion of a new request. For schedulers like
mq-deadline or kyber, this fact causes no harm. In contrast, it
confuses a stateful scheduler like BFQ, which keeps state for an I/O
request, until the finish_request hook is invoked on the request. In
particular, BFQ may get stuck, waiting forever for the number of
request dispatches, of the same request, to be balanced by an equal
number of request completions (while there will be one completion for
that request). In this state, BFQ may refuse to serve I/O requests
from other bfq_queues. The hang reported in [1] then follows.

However, the above re-prepared requests undergo a requeue, thus the
requeue_request hook of the active elevator is invoked for these
requests, if set. This commit then addresses the above issue by
properly implementing the hook requeue_request in BFQ.

[1] https://marc.info/?l=linux-block=15127608676

Reported-by: Ivan Kozik 
Reported-by: Alban Browaeys 
Signed-off-by: Paolo Valente 
Signed-off-by: Serena Ziviani 
---
 block/bfq-iosched.c | 101 +++-
 1 file changed, 76 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..343452488515 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
}
 
/*
-* We exploit the bfq_finish_request hook to decrement
-* rq_in_driver, but bfq_finish_request will not be
-* invoked on this request. So, to avoid unbalance,
-* just start this request, without incrementing
-* rq_in_driver. As a negative consequence,
-* rq_in_driver is deceptively lower than it should be
-* while this request is in service. This may cause
-* bfq_schedule_dispatch to be invoked uselessly.
+* We exploit the bfq_finish_requeue_request hook to
+* decrement rq_in_driver, but
+* bfq_finish_requeue_request will not be invoked on
+* this request. So, to avoid unbalance, just start
+* this request, without incrementing rq_in_driver. As
+* a negative consequence, rq_in_driver is deceptively
+* lower than it should be while this request is in
+* service. This may cause bfq_schedule_dispatch to be
+* invoked uselessly.
 *
 * As for implementing an exact solution, the
-* bfq_finish_request hook, if defined, is probably
-* invoked also on this request. So, by exploiting
-* this hook, we could 1) increment rq_in_driver here,
-* and 2) decrement it in bfq_finish_request. Such a
-* solution would let the value of the counter be
-* always accurate, but it would entail using an extra
-* interface function. This cost seems higher than the
-* benefit, being the frequency of non-elevator-private
+* bfq_finish_requeue_request hook, if defined, is
+* probably invoked also on this request. So, by
+* exploiting this hook, we could 1) increment
+* rq_in_driver here, and 2) decrement it in
+* bfq_finish_requeue_request. Such a solution would
+* let the value of the counter be always accurate,
+* but it would entail using an extra interface
+* function. This cost seems higher than the benefit,
+* being the frequency of non-elevator-private
 * requests very low.
 */
goto start_rq;
@@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct 
request_queue *q,
   unsigned int cmd_flags) {}
 #endif
 
+static void bfq_prepare_request(struct request *rq, struct bio *bio);
+
 static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
   bool at_head)
 {
@@ -4541,6 +4545,20 @@ static void bfq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,
else

[PATCH BUGFIX 0/1] block, bfq: handle requeues of I/O requests

2018-02-05 Thread Paolo Valente
Hi,
just a note: the most difficult part in the implementation of this
patch has been how to handle the fact that the requeue and finish
hooks of the active elevator get invoked even for requests that are
not referred in that elevator any longer. You can find details in the
comments introduced by the patch. This note is just to point out this
unexpected fact (for me), in case it is not intentional.

Thanks,
Paolo

Paolo Valente (1):
  block, bfq: add requeue-request hook

 block/bfq-iosched.c | 101 +++-
 1 file changed, 76 insertions(+), 25 deletions(-)

--
2.15.1


RE: [PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host

2018-02-05 Thread Don Brace


> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Monday, February 05, 2018 9:21 AM
> To: Jens Axboe ; linux-block@vger.kernel.org; Christoph
> Hellwig ; Mike Snitzer 
> Cc: linux-s...@vger.kernel.org; Hannes Reinecke ; Arun Easi
> ; Omar Sandoval ; Martin K .
> Petersen ; James Bottomley
> ; Christoph Hellwig ;
> Don Brace ; Kashyap Desai
> ; Peter Rivera ;
> Paolo Bonzini ; Laurence Oberman
> ; Ming Lei 
> Subject: [PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host
> 
> EXTERNAL EMAIL
> 
> 
> So that we can decide the default reply queue by the map created
> during adding host.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hpsa.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 287e5eb0723f..443eabf63a9f 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -8691,12 +8691,9 @@ static int hpsa_init_one(struct pci_dev *pdev, const
> struct pci_device_id *ent)
> /* Disable discovery polling.*/
> h->discovery_polling = 0;
> 
> -
> /* Turn the interrupts on so we can service requests */
> h->access.set_intr_mask(h, HPSA_INTR_ON);
> 
> -   hpsa_hba_inquiry(h);
> -
> h->lastlogicals = kzalloc(sizeof(*(h->lastlogicals)), GFP_KERNEL);
> if (!h->lastlogicals)
> dev_info(>pdev->dev,
> @@ -8707,6 +8704,8 @@ static int hpsa_init_one(struct pci_dev *pdev, const
> struct pci_device_id *ent)
> if (rc)
> goto clean7; /* perf, sg, cmd, irq, shost, pci, lu, aer/h */
> 
> +   hpsa_hba_inquiry(h);
> +
> /* Monitor the controller for firmware lockups */
> h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
> INIT_DELAYED_WORK(>monitor_ctlr_work, hpsa_monitor_ctlr_worker);
> --
> 2.9.5

Tested-by: Don Brace 
P441, P431, P830i, H240

Acked-by: Don Brace 




RE: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Don Brace
> -Original Message-
> From: Laurence Oberman [mailto:lober...@redhat.com]
> Sent: Monday, February 05, 2018 9:58 AM
> To: Ming Lei ; Jens Axboe ; linux-
> bl...@vger.kernel.org; Christoph Hellwig ; Mike Snitzer
> ; Don Brace 
> Cc: linux-s...@vger.kernel.org; Hannes Reinecke ; Arun Easi
> ; Omar Sandoval ; Martin K .
> Petersen ; James Bottomley
> ; Christoph Hellwig ;
> Don Brace ; Kashyap Desai
> ; Peter Rivera ;
> Paolo Bonzini 
> Subject: Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue
> 
> EXTERNAL EMAIL
> 
> 
> On Mon, 2018-02-05 at 23:20 +0800, Ming Lei wrote:
> > This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime
> > maps
> > each reply queue to blk_mq's hw queue, then .queuecommand can always
> > choose the hw queue as the reply queue. And if no any online CPU is
> > mapped to one hw queue, request can't be submitted to this hw queue
> > at all, finally the irq affinity issue is solved.
> >
> > Cc: Hannes Reinecke 
> > Cc: Arun Easi 
> > Cc: Omar Sandoval ,
> > Cc: "Martin K. Petersen" ,
> > Cc: James Bottomley ,
> > Cc: Christoph Hellwig ,
> > Cc: Don Brace 
> > Cc: Kashyap Desai 
> > Cc: Peter Rivera 
> > Cc: Paolo Bonzini 
> > Cc: Mike Snitzer 
> > Tested-by: Laurence Oberman 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/scsi/hpsa.c | 51 ++-
> This is a critical issue on the HPSA because Linus already has the
> original commit that causes the system to fail to boot.
> 
> All my testing was on DL380 G7 servers with:
> 
> Hewlett-Packard Company Smart Array G6 controllers
> Vendor: HP   Model: P410iRev: 6.64
> 
> Ming's patch fixes this so we need to try move this along.
> 
> I have a DL380 G8 as well which is also likely exposed here and I added
>  Don Brace for FYI to this list.
> 
> Thanks Ming

Tested-by: Don Brace 
P441, P431, P830i, H240

Acked-by: Don Brace 





Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche

On 02/05/18 08:40, Danil Kipnis wrote:

It just occurred to me, that we could easily extend the interface in
such a way that each client (i.e. each session) would have on server
side her own directory with the devices it can access. I.e. instead of
just "dev_search_path" per server, any client would be able to only
access devices under /session_name. (session name
must already be generated by each client in a unique way). This way
one could have an explicit control over which devices can be accessed
by which clients. Do you think that would do it?


Hello Danil,

That sounds interesting to me. However, I think that approach requires 
to configure client access completely before the kernel target side 
module is loaded. It does not allow to configure permissions dynamically 
after the kernel target module has been loaded. Additionally, I don't 
see how to support attributes per (initiator, block device) pair with 
that approach. LIO e.g. supports the 
/sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. 
You may want to implement similar functionality if you want to convince 
more users to use IBNBD.


Thanks,

Bart.


Re: [PATCH 3/4] lightnvm: add 2.0 geometry identification

2018-02-05 Thread Randy Dunlap
On 02/05/2018 04:15 AM, Matias Bjørling wrote:
> Implement the geometry data structures for 2.0 and enable a drive
> to be identified as one, including exposing the appropriate 2.0
> sysfs entries.
> 
> Signed-off-by: Matias Bjørling 
> ---
>  drivers/lightnvm/core.c  |   2 +-
>  drivers/nvme/host/lightnvm.c | 334 
> +--
>  include/linux/lightnvm.h |  11 +-
>  3 files changed, 295 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
> index c72863b36439..250e74dfa120 100644
> --- a/drivers/lightnvm/core.c
> +++ b/drivers/lightnvm/core.c
> @@ -934,7 +934,7 @@ static int nvm_init(struct nvm_dev *dev)
>   pr_debug("nvm: ver:%x nvm_vendor:%x\n",
>   dev->identity.ver_id, dev->identity.vmnt);
>  
> - if (dev->identity.ver_id != 1) {
> + if (dev->identity.ver_id != 1 && dev->identity.ver_id != 2) {
>   pr_err("nvm: device not supported by kernel.");
>   goto err;
>   }

Hi,
The pr_err() above could be a bit more informative to the user. E.g.,
pr_err("nvm: device ver_id %d not supported by kernel.",
dev->identity.ver_id);

BTW, isn't that line missing a '\n'?

-- 
~Randy


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-05 Thread Bart Van Assche
On Sat, 2018-02-03 at 10:51 +0800, Joseph Qi wrote:
> Hi Bart,
> 
> On 18/2/3 00:21, Bart Van Assche wrote:
> > On Fri, 2018-02-02 at 09:02 +0800, Joseph Qi wrote:
> > > We triggered this race when using single queue. I'm not sure if it
> > > exists in multi-queue.
> > 
> > Regarding the races between modifying the queue_lock pointer and the code 
> > that
> > uses that pointer, I think the following construct in blk_cleanup_queue() is
> > sufficient to avoid races between the queue_lock pointer assignment and the 
> > code
> > that executes concurrently with blk_cleanup_queue():
> > 
> > spin_lock_irq(lock);
> > if (q->queue_lock != >__queue_lock)
> > q->queue_lock = >__queue_lock;
> > spin_unlock_irq(lock);
> > 
> 
> IMO, the race also exists.
> 
> blk_cleanup_queue   blkcg_print_blkgs
>   spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
> q->queue_lock = >__queue_lock (3)
>   spin_unlock_irq(lock) (4)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) take driver lock;
> (2) busy loop for driver lock;
> (3) override driver lock with internal lock;
> (4) unlock driver lock; 
> (5) can take driver lock now;
> (6) but unlock internal lock.
> 
> If we get blkg->q->queue_lock to local first like blk_cleanup_queue,
> it indeed can fix the different lock use in lock/unlock. But since
> blk_cleanup_queue has overridden queue lock to internal lock now, I'm
> afraid we couldn't still use driver lock in blkcg_print_blkgs.

(+ Jan Kara)

Hello Joseph,

That's a good catch. Since modifying all code that accesses the queue_lock
pointer and that can race with blk_cleanup_queue() would be too cumbersome I
see only one solution, namely making the request queue cgroup and sysfs
attributes invisible before the queue_lock pointer is restored. Leaving the
debugfs attributes visible while blk_cleanup_queue() is in progress should
be fine if the request queue initialization code is modified such that it
only modifies the queue_lock pointer for legacy queues. Jan, I think some
time ago you objected when I proposed to move code from __blk_release_queue()
into blk_cleanup_queue(). Would you be fine with a slightly different
approach, namely making block cgroup and sysfs attributes invisible earlier,
namely from inside blk_cleanup_queue() instead of from inside
__blk_release_queue()?

Thanks,

Bart.



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote:
> Everything (fio jobs, setup, etc) is given in the same link:
> 
> https://www.spinics.net/lists/linux-rdma/msg48799.html
> 
> at the bottom you will find links on google docs with many pages
> and archived fio jobs and scripts. (I do not remember exactly,
> one year passed, but there should be everything).
> 
> Regarding smaller iodepth_batch_submit - that decreases performance.
> Once I played with that, even introduced new iodepth_batch_complete_max
> option for fio, but then I decided to stop and simply chose this
> configuration, which provides me fastest results.

Hello Roman,

That's weird. For which protocols did reducing iodepth_batch_submit lead
to lower performance: all the tested protocols or only some of them?

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Roman Penyaev
Hi Bart,

On Mon, Feb 5, 2018 at 5:58 PM, Bart Van Assche  wrote:
> On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote:
>> - Your latency measurements are surprisingly high for a null target
>>device (even for low end nvme device actually) regardless of the
>>transport implementation.
>>
>> For example:
>> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>>fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>>and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>>latency I got ~14 us. So something does not add up here. If this is
>>not some configuration issue, then we have serious bugs to handle..
>>
>> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>>troubles understanding how you were able to get such high latencies
>>(> 100 ms for QD>=100)
>>
>> Can you share more information about your setup? It would really help
>> us understand more.
>
> I would also appreciate it if more information could be provided about the
> measurement results. In addition to answering Sagi's questions, would it
> be possible to share the fio job that was used for measuring latency? In
> https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf
> I found the following:
>
> iodepth=128
> iodepth_batch_submit=128
>
> If you want to keep the pipeline full I think that you need to set the
> iodepth_batch_submit parameter to a value that is much lower than iodepth.
> I think that setting iodepth_batch_submit equal to iodepth will yield
> suboptimal IOPS results. Jens, please correct me if I got this wrong.

Sorry, Bart, I would answer here in a few words (I would like to answer
in details tomorrow on Sagi's mail).

Everything (fio jobs, setup, etc) is given in the same link:

https://www.spinics.net/lists/linux-rdma/msg48799.html

at the bottom you will find links on google docs with many pages
and archived fio jobs and scripts. (I do not remember exactly,
one year passed, but there should be everything).

Regarding smaller iodepth_batch_submit - that decreases performance.
Once I played with that, even introduced new iodepth_batch_complete_max
option for fio, but then I decided to stop and simply chose this
configuration, which provides me fastest results.

--
Roman


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Roman Penyaev
On Mon, Feb 5, 2018 at 3:14 PM, Sagi Grimberg  wrote:
>
>> Indeed, seems sbitmap can be reused.
>>
>> But tags is a part of IBTRS, and is not related to block device at all.
>> One
>> IBTRS connection (session) handles many block devices
>
>
> we use host shared tag sets for the case of multiple block devices.

Unfortunately (or fortunately, depends on the indented mq design) tags are
not shared between hw_queues.  So in our case (1 session queue, N devices)
you always have to specify tags->nr_hw_queues = 1 or magic will not happen
and you will always have more tags than your session supports.  But
nr_hw_queues = 1 kills performance dramatically.  What scales well is
the following: nr_hw_queues == num_online_cpus() == number of QPs in
one session.

>> (or any IO producers).
>
>
> Lets wait until we actually have this theoretical non-block IO
> producers..
>
>> With a tag you get a free slot of a buffer where you can read/write, so
>> once
>> you've allocated a tag you won't sleep on IO path inside a library.
>
>
> Same for block tags (given that you don't set the request queue
> otherwise)
>
>> Also tag
>> helps a lot on IO fail-over to another connection (multipath
>> implementation,
>> which is also a part of the transport library, not a block device), where
>> you
>> simply reuse the same buffer slot (with a tag in your hands) forwarding IO
>> to
>> another RDMA connection.
>
>
> What is the benefit of this detached architecture?

That gives us separated rdma IO library, where ibnbd is one of the players.

> IMO, one reason why you ended up not reusing a lot of the infrastructure
> is yielded from the  attempt to support a theoretical different consumer
> that is not ibnbd.

Well, not quite.  Not using rdma api helpers (we will use it) and not
using tags from block layer (we need tags inside transport) this is not
"a lot of the infrastructure" :)

I would say that we are not fast enough to follow all kernel trends.
That is the major reason, but not other user of ibtrs.

> Did you actually had plans for any other consumers?

Yep, the major target is replicated block storage, that's why separated
transport.

> Personally, I think you will be much better off with a unified approach
> for your block device implementation.

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 14:16 +0200, Sagi Grimberg wrote:
> - Your latency measurements are surprisingly high for a null target
>device (even for low end nvme device actually) regardless of the
>transport implementation.
> 
> For example:
> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>latency I got ~14 us. So something does not add up here. If this is
>not some configuration issue, then we have serious bugs to handle..
> 
> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>troubles understanding how you were able to get such high latencies
>(> 100 ms for QD>=100)
> 
> Can you share more information about your setup? It would really help
> us understand more.

I would also appreciate it if more information could be provided about the
measurement results. In addition to answering Sagi's questions, would it
be possible to share the fio job that was used for measuring latency? In
https://events.static.linuxfound.org/sites/events/files/slides/Copy%20of%20IBNBD-Vault-2017-5.pdf
I found the following:

iodepth=128
iodepth_batch_submit=128

If you want to keep the pipeline full I think that you need to set the
iodepth_batch_submit parameter to a value that is much lower than iodepth.
I think that setting iodepth_batch_submit equal to iodepth will yield
suboptimal IOPS results. Jens, please correct me if I got this wrong.

Thanks,

Bart.




Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Danil Kipnis
On Mon, Feb 5, 2018 at 3:17 PM, Sagi Grimberg  wrote:
>
 Hi Bart,

 My another 2 cents:)
 On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
 wrote:
>
>
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>>
>>
>> o Simple configuration of IBNBD:
>>  - Server side is completely passive: volumes do not need to be
>>explicitly exported.
>
>
>
> That sounds like a security hole? I think the ability to configure
> whether or
> not an initiator is allowed to log in is essential and also which
> volumes
> an
> initiator has access to.


 Our design target for well controlled production environment, so
 security is handle in other layer.
>>>
>>>
>>>
>>> What will happen to a new adopter of the code you are contributing?
>>
>>
>> Hi Sagi, Hi Bart,
>> thanks for your feedback.
>> We considered the "storage cluster" setup, where each ibnbd client has
>> access to each ibnbd server. Each ibnbd server manages devices under
>> his "dev_search_path" and can provide access to them to any ibnbd
>> client in the network.
>
>
> I don't understand how that helps?
>
>> On top of that Ibnbd server has an additional
>> "artificial" restriction, that a device can be mapped in writable-mode
>> by only one client at once.
>
>
> I think one would still need the option to disallow readable export as
> well.

It just occurred to me, that we could easily extend the interface in
such a way that each client (i.e. each session) would have on server
side her own directory with the devices it can access. I.e. instead of
just "dev_search_path" per server, any client would be able to only
access devices under /session_name. (session name
must already be generated by each client in a unique way). This way
one could have an explicit control over which devices can be accessed
by which clients. Do you think that would do it?


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
On Mon, Feb 5, 2018 at 5:16 PM, Bart Van Assche  wrote:
> On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote:
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  
>> wrote:
>> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > > o Simple configuration of IBNBD:
>> > >- Server side is completely passive: volumes do not need to be
>> > >  explicitly exported.
>> >
>> > That sounds like a security hole? I think the ability to configure whether 
>> > or
>> > not an initiator is allowed to log in is essential and also which volumes 
>> > an
>> > initiator has access to.
>>
>> Our design target for well controlled production environment, so security is
>> handle in other layer. On server side, admin can set the dev_search_path in
>> module parameter to set parent directory, this will concatenate with the path
>> client send in open message to open a block device.
>
> Hello Jack,
>
> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are 
> exported
> and also over which clients do have and do not have access.
>
> Bart.
Hello Bart,

I agree for general purpose, it may be good to have better access control.

Thanks,
-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 15:19 +0100, Roman Penyaev wrote:
> On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg  wrote:
> > Do you actually ever have remote write access in your protocol?
> 
> We do not have reads, instead client writes on write and server writes
> on read. (write only storage solution :)

So there are no restrictions on which clients can log in and any client can
send RDMA writes to the target system? I think this needs to be improved ...

Thanks,

Bart.




Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Bart Van Assche
On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote:
> Hi Bart,
> 
> My another 2 cents:)
> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  
> wrote:
> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> > > o Simple configuration of IBNBD:
> > >- Server side is completely passive: volumes do not need to be
> > >  explicitly exported.
> > 
> > That sounds like a security hole? I think the ability to configure whether 
> > or
> > not an initiator is allowed to log in is essential and also which volumes an
> > initiator has access to.
> 
> Our design target for well controlled production environment, so security is
> handle in other layer. On server side, admin can set the dev_search_path in
> module parameter to set parent directory, this will concatenate with the path
> client send in open message to open a block device.

Hello Jack,

That approach may work well for your employer but sorry I don't think this is
sufficient for an upstream driver. I think that most users who configure a
network storage target expect full control over which storage devices are 
exported
and also over which clients do have and do not have access.

Bart.

RE: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Don Brace
> -Original Message-
> This is a critical issue on the HPSA because Linus already has the
> original commit that causes the system to fail to boot.
> 
> All my testing was on DL380 G7 servers with:
> 
> Hewlett-Packard Company Smart Array G6 controllers
> Vendor: HP   Model: P410iRev: 6.64
> 
> Ming's patch fixes this so we need to try move this along.
> 
> I have a DL380 G8 as well which is also likely exposed here and I added
>  Don Brace for FYI to this list.
> 
> Thanks Ming

Running some tests now.


vgdisplay hang on iSCSI session

2018-02-05 Thread Jean-Louis Dupond

Hi All,

We've got some "strange" issue on a Xen hypervisor with CentOS 6 and 
4.9.63-29.el6.x86_6 kernel.
The system has a local raid + is connected with 2 iscsi sessions to 3 
disks with multipath (6 blockdevs in total).


We've noticed that vgdisplay was hanging, and the kernel was printing 
the following message:
Feb  3 09:43:27 srv01 kernel: [3074775.157416] INFO: task 
vgdisplay:27430 blocked for more than 120 seconds.
Feb  3 09:43:27 srv01 kernel: [3074775.157507]   Not tainted 
4.9.63-29.el6.x86_64 #1
Feb  3 09:43:27 srv01 kernel: [3074775.157541] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  3 09:43:27 srv01 kernel: [3074775.157591] vgdisplay   D0 
27430  27429 0x0080
Feb  3 09:43:27 srv01 kernel: [3074775.157599]  880263017800 
 880244bdc540 88026cd08380
Feb  3 09:43:27 srv01 kernel: [3074775.157613]  88027de59880 
c90063effba8 818daab0 c90063effbb8
Feb  3 09:43:27 srv01 kernel: [3074775.157616]  81114992 
c90063effba8 818def5f c90063effb78

Feb  3 09:43:27 srv01 kernel: [3074775.157620] Call Trace:
Feb  3 09:43:27 srv01 kernel: [3074775.157638]  [] ? 
__schedule+0x230/0x530
Feb  3 09:43:27 srv01 kernel: [3074775.157644]  [] ? 
__call_rcu+0x132/0x220
Feb  3 09:43:27 srv01 kernel: [3074775.157648]  [] ? 
_raw_spin_lock_irqsave+0x1f/0x50
Feb  3 09:43:27 srv01 kernel: [3074775.157655]  [] 
schedule+0x3a/0xa0
Feb  3 09:43:27 srv01 kernel: [3074775.157658]  [] ? 
call_rcu_sched+0x1d/0x20
Feb  3 09:43:27 srv01 kernel: [3074775.157668]  [] 
blk_mq_freeze_queue_wait+0x6f/0xd0
Feb  3 09:43:27 srv01 kernel: [3074775.157676]  [] ? 
woken_wake_function+0x20/0x20
Feb  3 09:43:27 srv01 kernel: [3074775.157679]  [] ? 
_raw_spin_unlock_irqrestore+0x16/0x20
Feb  3 09:43:27 srv01 kernel: [3074775.157685]  [] ? 
percpu_ref_kill_and_confirm+0x63/0xd0
Feb  3 09:43:27 srv01 kernel: [3074775.157688]  [] ? 
blk_mq_run_hw_queues+0x4c/0x90
Feb  3 09:43:27 srv01 kernel: [3074775.157691]  [] 
blk_freeze_queue+0x1e/0x30
Feb  3 09:43:27 srv01 kernel: [3074775.157694]  [] 
blk_mq_freeze_queue+0xe/0x10
Feb  3 09:43:27 srv01 kernel: [3074775.157700]  [] 
loop_switch+0x1e/0xd0
Feb  3 09:43:27 srv01 kernel: [3074775.157703]  [] 
lo_release+0x7a/0x80
Feb  3 09:43:27 srv01 kernel: [3074775.157712]  [] 
__blkdev_put+0x1a7/0x200
Feb  3 09:43:27 srv01 kernel: [3074775.157734]  [] ? 
cache_free_debugcheck+0x188/0x240
Feb  3 09:43:27 srv01 kernel: [3074775.157738]  [] 
blkdev_put+0x56/0x140
Feb  3 09:43:27 srv01 kernel: [3074775.157742]  [] 
blkdev_close+0x24/0x30
Feb  3 09:43:27 srv01 kernel: [3074775.157750]  [] 
__fput+0xc8/0x240
Feb  3 09:43:27 srv01 kernel: [3074775.157753]  [] 
fput+0xe/0x10
Feb  3 09:43:27 srv01 kernel: [3074775.157760]  [] 
task_work_run+0x68/0xa0
Feb  3 09:43:27 srv01 kernel: [3074775.157767]  [] 
exit_to_usermode_loop+0xc6/0xd0
Feb  3 09:43:27 srv01 kernel: [3074775.157770]  [] 
do_syscall_64+0x185/0x240
Feb  3 09:43:27 srv01 kernel: [3074775.157774]  [] 
entry_SYSCALL64_slow_path+0x25/0x25



Now if I did an strace -f vgdisplay, it did hang on the following call:
stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 
0

open("/dev/loop0", O_RDONLY|O_DIRECT|O_NOATIME

And did hang there forever.

multipath -l didn't show any issues on the paths!
But multipath -ll did hang on the following syscall:
close(8)= 0
close(9)= 0
close(10)   = 0
fcntl(11, F_GETFL)  = 0xc000 (flags 
O_RDONLY|O_DIRECT|O_LARGEFILE)

fcntl(11, F_SETFL, O_RDONLY|O_LARGEFILE) = 0
io_destroy(139682352197632

Also I tried to rescan the blockdevs via "echo 1 > 
/sys/block/sdx/device/rescan", and they all worked, except sdf ... 
(which is one of the iscsi blockdevs).

But the VM's using the storage, were still online! So I/O was not dead.

Ofcourse, sdf was the enabled (not active) device member of the 
multipath dev:
size=5.0T features='4 queue_if_no_path pg_init_retries 50 
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='round-robin 0' prio=14 status=active
| `- 12:0:0:3 sdg 8:96 active ready running
`-+- policy='round-robin 0' prio=9 status=enabled
  `- 11:0:0:3 sdf 8:80 active ready running


Also I see 1 kworker was in D-state:
root 13820  0.0  0.0  0 0 ?DFeb03   0:03  \_ 
[kworker/1:2]


Next to the tons of vgdisplay calls:
root 27430  0.0  0.0 127528  5000 ?DFeb03   0:00  \_ 
vgdisplay -c --ignorelockingfailure
root 28349  0.0  0.0 127528  5152 ?DFeb03   0:00  \_ 
vgdisplay -c --ignorelockingfailure
root 28868  0.0  0.0 127528  5160 ?DFeb03   0:00  \_ 
vgdisplay -c --ignorelockingfailure
root 29398  0.0  0.0 127528  4896 ?DFeb03   0:00  \_ 
vgdisplay -c --ignorelockingfailure
root 29708  0.0  0.0 127528  4960 ?DFeb03   0:00  \_ 
vgdisplay -c --ignorelockingfailure

Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Laurence Oberman
On Mon, 2018-02-05 at 23:20 +0800, Ming Lei wrote:
> This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime
> maps
> each reply queue to blk_mq's hw queue, then .queuecommand can always
> choose the hw queue as the reply queue. And if no any online CPU is
> mapped to one hw queue, request can't be submitted to this hw queue
> at all, finally the irq affinity issue is solved.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hpsa.c | 51 ++-
> 
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 443eabf63a9f..e517a4c74a28 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "hpsa_cmd.h"
> @@ -956,6 +957,13 @@ static struct device_attribute
> *hpsa_shost_attrs[] = {
>  #define HPSA_NRESERVED_CMDS  (HPSA_CMDS_RESERVED_FOR_DRIVER +\
>    HPSA_MAX_CONCURRENT_PASSTHRUS)
>  
> +static int hpsa_map_queues(struct Scsi_Host *shost)
> +{
> +struct ctlr_info *h = shost_to_hba(shost);
> +
> +return blk_mq_pci_map_queues(>tag_set, h->pdev);
> +}
> +
>  static struct scsi_host_template hpsa_driver_template = {
>   .module = THIS_MODULE,
>   .name   = HPSA,
> @@ -974,10 +982,13 @@ static struct scsi_host_template
> hpsa_driver_template = {
>  #ifdef CONFIG_COMPAT
>   .compat_ioctl   = hpsa_compat_ioctl,
>  #endif
> + .map_queues = hpsa_map_queues,
>   .sdev_attrs = hpsa_sdev_attrs,
>   .shost_attrs = hpsa_shost_attrs,
>   .max_sectors = 1024,
>   .no_write_same = 1,
> + .force_blk_mq = 1,
> + .host_tagset = 1,
>  };
>  
>  static inline u32 next_command(struct ctlr_info *h, u8 q)
> @@ -1045,11 +1056,7 @@ static void set_performant_mode(struct
> ctlr_info *h, struct CommandList *c,
>   c->busaddr |= 1 | (h->blockFetchTable[c-
> >Header.SGList] << 1);
>   if (unlikely(!h->msix_vectors))
>   return;
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - c->Header.ReplyQueue =
> - raw_smp_processor_id() % h-
> >nreply_queues;
> - else
> - c->Header.ReplyQueue = reply_queue % h-
> >nreply_queues;
> + c->Header.ReplyQueue = reply_queue;
>   }
>  }
>  
> @@ -1063,10 +1070,7 @@ static void
> set_ioaccel1_performant_mode(struct ctlr_info *h,
>    * Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->ReplyQueue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->ReplyQueue = reply_queue % h->nreply_queues;
> + cp->ReplyQueue = reply_queue;
>   /*
>    * Set the bits in the address sent down to include:
>    *  - performant mode bit (bit 0)
> @@ -1087,10 +1091,7 @@ static void
> set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,
>   /* Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /* Set the bits in the address sent down to include:
>    *  - performant mode bit not used in ioaccel mode 2
>    *  - pull count (bits 0-3)
> @@ -1109,10 +1110,7 @@ static void
> set_ioaccel2_performant_mode(struct ctlr_info *h,
>    * Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /*
>    * Set the bits in the address sent down to include:
>    *  - performant mode bit not used in ioaccel mode 2
> @@ -1152,11 +1150,27 @@ static void
> 

Re: [PATCH V2 6/8] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

2018-02-05 Thread Paolo Bonzini
On 05/02/2018 16:20, Ming Lei wrote:
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
> virtio-scsi, the irq affinity assigned can become the following shape:
> 
>   irq 36, cpu list 0-7
>   irq 37, cpu list 0-7
>   irq 38, cpu list 0-7
>   irq 39, cpu list 0-1
>   irq 40, cpu list 4,6
>   irq 41, cpu list 2-3
>   irq 42, cpu list 5,7
> 
> Then IO hang is triggered in case of non-SCSI_MQ.
> 
> Given storage IO is always C/S model, there isn't such issue with 
> SCSI_MQ(blk-mq),
> because no IO can be submitted to one hw queue if the hw queue hasn't online
> CPUs.
> 
> Fix this issue by forcing to use blk_mq.
> 
> BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
> been quite stable, so it shouldn't cause extra risk.

I think that's ok now that we have I/O schedulers for blk-mq.

Acked-by: Paolo Bonzini 

Paolo

> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Reviewed-by: Hannes Reinecke 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/virtio_scsi.c | 59 
> +++---
>  1 file changed, 3 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 7c28e8d4955a..54e3a0f6844c 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -91,9 +91,6 @@ struct virtio_scsi_vq {
>  struct virtio_scsi_target_state {
>   seqcount_t tgt_seq;
>  
> - /* Count of outstanding requests. */
> - atomic_t reqs;
> -
>   /* Currently active virtqueue for requests sent to this target. */
>   struct virtio_scsi_vq *req_vq;
>  };
> @@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
> *vscsi, void *buf)
>   struct virtio_scsi_cmd *cmd = buf;
>   struct scsi_cmnd *sc = cmd->sc;
>   struct virtio_scsi_cmd_resp *resp = >resp.cmd;
> - struct virtio_scsi_target_state *tgt =
> - scsi_target(sc->device)->hostdata;
>  
>   dev_dbg(>device->sdev_gendev,
>   "cmd %p response %u status %#02x sense_len %u\n",
> @@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
> *vscsi, void *buf)
>   }
>  
>   sc->scsi_done(sc);
> -
> - atomic_dec(>reqs);
>  }
>  
>  static void virtscsi_vq_done(struct virtio_scsi *vscsi,
> @@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
> *sh,
>   struct scsi_cmnd *sc)
>  {
>   struct virtio_scsi *vscsi = shost_priv(sh);
> - struct virtio_scsi_target_state *tgt =
> - scsi_target(sc->device)->hostdata;
>  
> - atomic_inc(>reqs);
>   return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
>  }
>  
> @@ -596,55 +586,11 @@ static struct virtio_scsi_vq 
> *virtscsi_pick_vq_mq(struct virtio_scsi *vscsi,
>   return >req_vqs[hwq];
>  }
>  
> -static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
> -struct virtio_scsi_target_state 
> *tgt)
> -{
> - struct virtio_scsi_vq *vq;
> - unsigned long flags;
> - u32 queue_num;
> -
> - local_irq_save(flags);
> - if (atomic_inc_return(>reqs) > 1) {
> - unsigned long seq;
> -
> - do {
> - seq = read_seqcount_begin(>tgt_seq);
> - vq = tgt->req_vq;
> - } while (read_seqcount_retry(>tgt_seq, seq));
> - } else {
> - /* no writes can be concurrent because of atomic_t */
> - write_seqcount_begin(>tgt_seq);
> -
> - /* keep previous req_vq if a reader just arrived */
> - if (unlikely(atomic_read(>reqs) > 1)) {
> - vq = tgt->req_vq;
> - goto unlock;
> - }
> -
> - queue_num = smp_processor_id();
> - while (unlikely(queue_num >= vscsi->num_queues))
> - queue_num -= vscsi->num_queues;
> - tgt->req_vq = vq = >req_vqs[queue_num];
> - unlock:
> - write_seqcount_end(>tgt_seq);
> - }
> - local_irq_restore(flags);
> -
> - return vq;
> -}
> -
>  static int virtscsi_queuecommand_multi(struct Scsi_Host 

[PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Ming Lei
This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
each reply queue to blk_mq's hw queue, then .queuecommand can always
choose the hw queue as the reply queue. And if no any online CPU is
mapped to one hw queue, request can't be submitted to this hw queue
at all, finally the irq affinity issue is solved.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hpsa.c | 51 ++-
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 443eabf63a9f..e517a4c74a28 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "hpsa_cmd.h"
@@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
 #define HPSA_NRESERVED_CMDS(HPSA_CMDS_RESERVED_FOR_DRIVER +\
 HPSA_MAX_CONCURRENT_PASSTHRUS)
 
+static int hpsa_map_queues(struct Scsi_Host *shost)
+{
+struct ctlr_info *h = shost_to_hba(shost);
+
+return blk_mq_pci_map_queues(>tag_set, h->pdev);
+}
+
 static struct scsi_host_template hpsa_driver_template = {
.module = THIS_MODULE,
.name   = HPSA,
@@ -974,10 +982,13 @@ static struct scsi_host_template hpsa_driver_template = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = hpsa_compat_ioctl,
 #endif
+   .map_queues = hpsa_map_queues,
.sdev_attrs = hpsa_sdev_attrs,
.shost_attrs = hpsa_shost_attrs,
.max_sectors = 1024,
.no_write_same = 1,
+   .force_blk_mq = 1,
+   .host_tagset = 1,
 };
 
 static inline u32 next_command(struct ctlr_info *h, u8 q)
@@ -1045,11 +1056,7 @@ static void set_performant_mode(struct ctlr_info *h, 
struct CommandList *c,
c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
if (unlikely(!h->msix_vectors))
return;
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   c->Header.ReplyQueue =
-   raw_smp_processor_id() % h->nreply_queues;
-   else
-   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+   c->Header.ReplyQueue = reply_queue;
}
 }
 
@@ -1063,10 +1070,7 @@ static void set_ioaccel1_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->ReplyQueue = reply_queue % h->nreply_queues;
+   cp->ReplyQueue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit (bit 0)
@@ -1087,10 +1091,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
ctlr_info *h,
/* Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/* Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
 *  - pull count (bits 0-3)
@@ -1109,10 +1110,7 @@ static void set_ioaccel2_performant_mode(struct 
ctlr_info *h,
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
@@ -1152,11 +1150,27 @@ static void 
dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
 }
 
+static unsigned get_reply_queue(struct ctlr_info *h, struct CommandList *c)
+{
+  

[PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host

2018-02-05 Thread Ming Lei
So that we can decide the default reply queue by the map created
during adding host.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hpsa.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 287e5eb0723f..443eabf63a9f 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -8691,12 +8691,9 @@ static int hpsa_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
/* Disable discovery polling.*/
h->discovery_polling = 0;
 
-
/* Turn the interrupts on so we can service requests */
h->access.set_intr_mask(h, HPSA_INTR_ON);
 
-   hpsa_hba_inquiry(h);
-
h->lastlogicals = kzalloc(sizeof(*(h->lastlogicals)), GFP_KERNEL);
if (!h->lastlogicals)
dev_info(>pdev->dev,
@@ -8707,6 +8704,8 @@ static int hpsa_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (rc)
goto clean7; /* perf, sg, cmd, irq, shost, pci, lu, aer/h */
 
+   hpsa_hba_inquiry(h);
+
/* Monitor the controller for firmware lockups */
h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
INIT_DELAYED_WORK(>monitor_ctlr_work, hpsa_monitor_ctlr_worker);
-- 
2.9.5



[PATCH V2 6/8] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

2018-02-05 Thread Ming Lei
Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
virtio-scsi, the irq affinity assigned can become the following shape:

irq 36, cpu list 0-7
irq 37, cpu list 0-7
irq 38, cpu list 0-7
irq 39, cpu list 0-1
irq 40, cpu list 4,6
irq 41, cpu list 2-3
irq 42, cpu list 5,7

Then IO hang is triggered in case of non-SCSI_MQ.

Given storage IO is always C/S model, there isn't such issue with 
SCSI_MQ(blk-mq),
because no IO can be submitted to one hw queue if the hw queue hasn't online
CPUs.

Fix this issue by forcing to use blk_mq.

BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
been quite stable, so it shouldn't cause extra risk.

Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Reviewed-by: Hannes Reinecke 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/virtio_scsi.c | 59 +++---
 1 file changed, 3 insertions(+), 56 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 7c28e8d4955a..54e3a0f6844c 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -91,9 +91,6 @@ struct virtio_scsi_vq {
 struct virtio_scsi_target_state {
seqcount_t tgt_seq;
 
-   /* Count of outstanding requests. */
-   atomic_t reqs;
-
/* Currently active virtqueue for requests sent to this target. */
struct virtio_scsi_vq *req_vq;
 };
@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
struct virtio_scsi_cmd *cmd = buf;
struct scsi_cmnd *sc = cmd->sc;
struct virtio_scsi_cmd_resp *resp = >resp.cmd;
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
dev_dbg(>device->sdev_gendev,
"cmd %p response %u status %#02x sense_len %u\n",
@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
}
 
sc->scsi_done(sc);
-
-   atomic_dec(>reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi,
@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
*sh,
struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
-   atomic_inc(>reqs);
return virtscsi_queuecommand(vscsi, >req_vqs[0], sc);
 }
 
@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct 
virtio_scsi *vscsi,
return >req_vqs[hwq];
 }
 
-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
-  struct virtio_scsi_target_state 
*tgt)
-{
-   struct virtio_scsi_vq *vq;
-   unsigned long flags;
-   u32 queue_num;
-
-   local_irq_save(flags);
-   if (atomic_inc_return(>reqs) > 1) {
-   unsigned long seq;
-
-   do {
-   seq = read_seqcount_begin(>tgt_seq);
-   vq = tgt->req_vq;
-   } while (read_seqcount_retry(>tgt_seq, seq));
-   } else {
-   /* no writes can be concurrent because of atomic_t */
-   write_seqcount_begin(>tgt_seq);
-
-   /* keep previous req_vq if a reader just arrived */
-   if (unlikely(atomic_read(>reqs) > 1)) {
-   vq = tgt->req_vq;
-   goto unlock;
-   }
-
-   queue_num = smp_processor_id();
-   while (unlikely(queue_num >= vscsi->num_queues))
-   queue_num -= vscsi->num_queues;
-   tgt->req_vq = vq = >req_vqs[queue_num];
- unlock:
-   write_seqcount_end(>tgt_seq);
-   }
-   local_irq_restore(flags);
-
-   return vq;
-}
-
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
   struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
-   struct virtio_scsi_vq *req_vq;
-
-   if (shost_use_blk_mq(sh))
-  

[PATCH V2 5/8] scsi: introduce force_blk_mq

2018-02-05 Thread Ming Lei
>From scsi driver view, it is a bit troublesome to support both blk-mq
and non-blk-mq at the same time, especially when drivers need to support
multi hw-queue.

This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
can provide blk-mq only support, so driver code can avoid the trouble
for supporting both.

This patch may clean up driver a lot by providing blk-mq only support, 
espeically
it is easier to convert multiple reply queues into blk_mq's MQ for the following
purposes:

1) use blk_mq multiple hw queue to deal with allocated irq vectors of all 
offline
CPU affinity[1]:

[1] https://marc.info/?l=linux-kernel=151748144730409=2

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned.

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx.

2) some drivers[1] require to complete request in the submission CPU for
avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
to reinvent wheels for solving the problem.

[2] https://marc.info/?t=15160185141=1=2

Sovling the above issues for non-MQ path may not be easy, or introduce
unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
recently[3]:

[3] https://marc.info/?l=linux-scsi=151727684915589=2

Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Mike Snitzer 
Reviewed-by: Hannes Reinecke 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hosts.c | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index fe3a0da3ec97..c75cebd7911d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost->dma_boundary = 0x;
 
shost->use_blk_mq = scsi_use_blk_mq;
+   shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
 
device_initialize(>shost_gendev);
dev_set_name(>shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index f6623f887ee4..d67b2eaff8f1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
/* True if the controller does not support WRITE SAME */
unsigned no_write_same:1;
 
+   /* True if the low-level driver supports blk-mq only */
+   unsigned force_blk_mq:1;
+
/*
 * Countdown for host blocking with no commands outstanding.
 */
-- 
2.9.5



[PATCH V2 4/8] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-05 Thread Ming Lei
This patch introduces the parameter of 'g_global_tags' so that we can
test this feature by null_blk easiy.

Not see obvious performance drop with global_tags when the whole hw
depth is kept as same:

1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
submit_queues=4 hw_queue_depth=1

2) 'global_tags', global hw queue depth is 4, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=1 
submit_queues=4 hw_queue_depth=4

3) fio test done in above two settings:
   fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 
--ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  
--name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 
--name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3

1M IOPS can be reached in both above tests which is done in one VM.

Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Reviewed-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 drivers/block/null_blk.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 287a09611c0f..ad0834efad42 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -163,6 +163,10 @@ static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
 
+static int g_global_tags = 0;
+module_param_named(global_tags, g_global_tags, int, S_IRUGO);
+MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
+
 static int g_home_node = NUMA_NO_NODE;
 module_param_named(home_node, g_home_node, int, S_IRUGO);
 MODULE_PARM_DESC(home_node, "Home node for the device");
@@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;
+   if (g_global_tags)
+   set->flags |= BLK_MQ_F_GLOBAL_TAGS;
set->driver_data = NULL;
 
if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.5



[PATCH V2 3/8] scsi: Add template flag 'host_tagset'

2018-02-05 Thread Ming Lei
From: Hannes Reinecke 

Add a host template flag 'host_tagset' to enable the use of a global
tagset for block-mq.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Hannes Reinecke 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 55be2550c555..9ab74ac634ea 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2274,6 +2274,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
shost->tag_set.flags |=
BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
+   if (shost->hostt->host_tagset)
+   shost->tag_set.flags |= BLK_MQ_F_GLOBAL_TAGS;
shost->tag_set.driver_data = shost;
 
return blk_mq_alloc_tag_set(>tag_set);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a8b7bf879ced..f6623f887ee4 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -457,6 +457,9 @@ struct scsi_host_template {
 */
unsigned int max_host_blocked;
 
+   /* True if the host supports a host-wide tagspace */
+   unsigned host_tagset:1;
+
/*
 * Default value for the blocking.  If the queue is empty,
 * host_blocked counts down in the request_fn until it restarts
-- 
2.9.5



[PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-05 Thread Ming Lei
Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
reply queues, but tags is often HBA wide.

These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
for automatic affinity assignment.

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned, and HPSA has been broken
with v4.15+:

https://marc.info/?l=linux-kernel=151748144730409=2

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx, but one tricky thing is
the HBA wide(instead of hw queue wide) tags.

This patch is based on the following Hannes's patch:

https://marc.info/?l=linux-block=149132580511346=2

One big difference with Hannes's is that this patch only makes the tags sbitmap
and active_queues data structure HBA wide, and others are kept as NUMA locality,
such as request, hctx, tags, ...

The following patch will support global tags on null_blk, also the performance
data is provided, no obvious performance loss is observed when the whole
hw queue depth is same.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Mike Snitzer 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 13 -
 block/blk-mq-tag.c | 23 ++-
 block/blk-mq-tag.h |  5 -
 block/blk-mq.c | 29 -
 block/blk-mq.h |  3 ++-
 include/linux/blk-mq.h |  2 ++
 7 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0dfafa4b655a..0f0fafe03f5d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
HCTX_FLAG_NAME(SG_MERGE),
+   HCTX_FLAG_NAME(GLOBAL_TAGS),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55c0a745b427..385bbec73804 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -81,6 +81,17 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
} else
clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
 
+   /* need to restart all hw queues for global tags */
+   if (hctx->flags & BLK_MQ_F_GLOBAL_TAGS) {
+   struct blk_mq_hw_ctx *hctx2;
+   int i;
+
+   queue_for_each_hw_ctx(hctx->queue, hctx2, i)
+   if (blk_mq_run_hw_queue(hctx2, true))
+   return true;
+   return false;
+   }
+
return blk_mq_run_hw_queue(hctx, true);
 }
 
@@ -495,7 +506,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
int ret;
 
hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
-  set->reserved_tags);
+  set->reserved_tags, false);
if (!hctx->sched_tags)
return -ENOMEM;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 571797dc36cb..66377d09eaeb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct 
blk_mq_tags *tags,
return NULL;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+unsigned int total_tags,
 unsigned int reserved_tags,
-int node, int alloc_policy)
+int node, int alloc_policy,
+bool global_tag)
 {
struct blk_mq_tags *tags;
 
@@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;
 
+   WARN_ON(global_tag && !set->global_tags);
+   if (global_tag && set->global_tags) {
+   tags->bitmap_tags = set->global_tags->bitmap_tags;
+   tags->breserved_tags = set->global_tags->breserved_tags;
+   tags->active_queues = 

[PATCH V2 1/8] blk-mq: tags: define several fields of tags as pointer

2018-02-05 Thread Ming Lei
This patch changes tags->breserved_tags, tags->bitmap_tags and
tags->active_queues as pointer, and prepares for supporting global tags.

No functional change.

Tested-by: Laurence Oberman 
Reviewed-by: Hannes Reinecke 
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bfq-iosched.c|  4 ++--
 block/blk-mq-debugfs.c | 10 +-
 block/blk-mq-tag.c | 48 ++--
 block/blk-mq-tag.h | 10 +++---
 block/blk-mq.c |  2 +-
 block/kyber-iosched.c  |  2 +-
 6 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..1e1211814a57 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct 
blk_mq_alloc_data *data)
WARN_ON_ONCE(1);
return;
}
-   bt = >breserved_tags;
+   bt = tags->breserved_tags;
} else
-   bt = >bitmap_tags;
+   bt = tags->bitmap_tags;
 
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 21cbc1f071c6..0dfafa4b655a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
seq_printf(m, "active_queues=%d\n",
-  atomic_read(>active_queues));
+  atomic_read(tags->active_queues));
 
seq_puts(m, "\nbitmap_tags:\n");
-   sbitmap_queue_show(>bitmap_tags, m);
+   sbitmap_queue_show(tags->bitmap_tags, m);
 
if (tags->nr_reserved_tags) {
seq_puts(m, "\nbreserved_tags:\n");
-   sbitmap_queue_show(>breserved_tags, m);
+   sbitmap_queue_show(tags->breserved_tags, m);
}
 }
 
@@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->tags)
-   sbitmap_bitmap_show(>tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(>tags->bitmap_tags->sb, m);
mutex_unlock(>sysfs_lock);
 
 out:
@@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->sched_tags)
-   sbitmap_bitmap_show(>sched_tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(>sched_tags->bitmap_tags->sb, m);
mutex_unlock(>sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..571797dc36cb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
if (!tags)
return true;
 
-   return sbitmap_any_bit_clear(>bitmap_tags.sb);
+   return sbitmap_any_bit_clear(>bitmap_tags->sb);
 }
 
 /*
@@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, >state) &&
!test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, >state))
-   atomic_inc(>tags->active_queues);
+   atomic_inc(hctx->tags->active_queues);
 
return true;
 }
@@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-   sbitmap_queue_wake_all(>bitmap_tags);
+   sbitmap_queue_wake_all(tags->bitmap_tags);
if (include_reserve)
-   sbitmap_queue_wake_all(>breserved_tags);
+   sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, >state))
return;
 
-   atomic_dec(>active_queues);
+   atomic_dec(tags->active_queues);
 
blk_mq_tag_wakeup_all(tags, false);
 }
@@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
if (bt->sb.depth == 1)
return true;
 
-   users = atomic_read(>tags->active_queues);
+   users = atomic_read(hctx->tags->active_queues);
if (!users)
return true;
 
@@ -117,10 +117,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data 
*data)
WARN_ON_ONCE(1);
return BLK_MQ_TAG_FAIL;
}
-   bt = >breserved_tags;
+   bt = tags->breserved_tags;
tag_offset = 0;
} else {
-   bt = >bitmap_tags;
+   bt = tags->bitmap_tags;
tag_offset = tags->nr_reserved_tags;
}
 
@@ 

[PATCH V2 0/8] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Ming Lei
Hi All,

This patchset supports global tags which was started by Hannes originally:

https://marc.info/?l=linux-block=149132580511346=2

Also inroduce 'force_blk_mq' and 'host_tagset' to 'struct scsi_host_template',
so that driver can avoid to support two IO paths(legacy and blk-mq), especially
recent discusion mentioned that SCSI_MQ will be enabled at default soon.

https://marc.info/?l=linux-scsi=151727684915589=2

With the above changes, it should be easier to convert SCSI drivers'
reply queue into blk-mq's hctx, then the automatic irq affinity issue can
be solved easily, please see detailed descrption in commit log and the
8th patch for converting HPSA.

Also drivers may require to complete request on the submission CPU
for avoiding hard/soft deadlock, which can be done easily with blk_mq
too.

https://marc.info/?t=15160185141=1=2

The 6th patch uses the introduced 'force_blk_mq' to fix virtio_scsi
so that IO hang issue can be avoided inside legacy IO path, this issue is
a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.

The 7th & 8th patch fixes HPSA's IO issue which is caused by the reply
queue selection algorithem.

Laurence has verified that this patch makes HPSA working with the linus
tree with this patchset.

The V2 can be found in the following tree/branch:

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-v2

V2:
- fix restart code for global tags
- fixes HPSA's IO hang issue
- add 'scsi: Add template flag 'host_tagset''
- reorder patch

Thanks
Ming

Hannes Reinecke (1):
  scsi: Add template flag 'host_tagset'

Ming Lei (7):
  blk-mq: tags: define several fields of tags as pointer
  blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  block: null_blk: introduce module parameter of 'g_global_tags'
  scsi: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
  scsi: hpsa: call hpsa_hba_inquiry() after adding host
  scsi: hpsa: use blk_mq to solve irq affinity issue

 block/bfq-iosched.c|  4 +--
 block/blk-mq-debugfs.c | 11 
 block/blk-mq-sched.c   | 13 -
 block/blk-mq-tag.c | 67 +-
 block/blk-mq-tag.h | 15 ---
 block/blk-mq.c | 31 -
 block/blk-mq.h |  3 ++-
 block/kyber-iosched.c  |  2 +-
 drivers/block/null_blk.c   |  6 +
 drivers/scsi/hosts.c   |  1 +
 drivers/scsi/hpsa.c| 56 --
 drivers/scsi/scsi_lib.c|  2 ++
 drivers/scsi/virtio_scsi.c | 59 +++-
 include/linux/blk-mq.h |  2 ++
 include/scsi/scsi_host.h   |  6 +
 15 files changed, 157 insertions(+), 121 deletions(-)

-- 
2.9.5



Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Roman Penyaev
Hi Sagi,

On Mon, Feb 5, 2018 at 12:19 PM, Sagi Grimberg  wrote:
> Hi Roman,
>
>> +static inline void ibtrs_clt_state_lock(void)
>> +{
>> +   rcu_read_lock();
>> +}
>> +
>> +static inline void ibtrs_clt_state_unlock(void)
>> +{
>> +   rcu_read_unlock();
>> +}
>
>
> This looks rather pointless...

Yeah, old scraps.  Some time later those were not only wrappers
around rcu.  Now rcu can be called explicitly, that is true.
Thanks.

>
>> +
>> +#define cmpxchg_min(var, new) ({   \
>> +   typeof(var) old;\
>> +   \
>> +   do {\
>> +   old = var;  \
>> +   new = (!old ? new : min_t(typeof(var), old, new));  \
>> +   } while (cmpxchg(, old, new) != old);   \
>> +})
>
>
> Why is this sort of thing local to your driver?

Good question :)  Most likely because personally I do not know
what is the good generic place for this kind of stuff.

Probably I share the same feeling with the author of these lines
in nvme/host/rdma.c: put_unaligned_le24() :)

>> +/**
>> + * struct ibtrs_fr_pool - pool of fast registration descriptors
>> + *
>> + * An entry is available for allocation if and only if it occurs in
>> @free_list.
>> + *
>> + * @size:  Number of descriptors in this pool.
>> + * @max_page_list_len: Maximum fast registration work request page list
>> length.
>> + * @lock:  Protects free_list.
>> + * @free_list: List of free descriptors.
>> + * @desc:  Fast registration descriptor pool.
>> + */
>> +struct ibtrs_fr_pool {
>> +   int size;
>> +   int max_page_list_len;
>> +   spinlock_t  lock; /* protects free_list */
>> +   struct list_headfree_list;
>> +   struct ibtrs_fr_descdesc[0];
>> +};
>
>
> We already have a per-qp fr list implementation, any specific reason to
> implement it again?

No, fr is a part of the code which we are not using, fmr is faster
in our setup.  So we will need to reiterate on fr mode, thanks.


>> +static inline struct ibtrs_tag *
>> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
>> +{
>> +   size_t max_depth = clt->queue_depth;
>> +   struct ibtrs_tag *tag;
>> +   int cpu, bit;
>> +
>> +   cpu = get_cpu();
>> +   do {
>> +   bit = find_first_zero_bit(clt->tags_map, max_depth);
>> +   if (unlikely(bit >= max_depth)) {
>> +   put_cpu();
>> +   return NULL;
>> +   }
>> +
>> +   } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
>> +   put_cpu();
>> +
>> +   tag = GET_TAG(clt, bit);
>> +   WARN_ON(tag->mem_id != bit);
>> +   tag->cpu_id = cpu;
>> +   tag->con_type = con_type;
>> +
>> +   return tag;
>> +}
>> +
>> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
>> +  struct ibtrs_tag *tag)
>> +{
>> +   clear_bit_unlock(tag->mem_id, clt->tags_map);
>> +}
>> +
>> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
>> +   enum ibtrs_clt_con_type con_type,
>> +   int can_wait)
>> +{
>> +   struct ibtrs_tag *tag;
>> +   DEFINE_WAIT(wait);
>> +
>> +   tag = __ibtrs_get_tag(clt, con_type);
>> +   if (likely(tag) || !can_wait)
>> +   return tag;
>> +
>> +   do {
>> +   prepare_to_wait(>tags_wait, ,
>> TASK_UNINTERRUPTIBLE);
>> +   tag = __ibtrs_get_tag(clt, con_type);
>> +   if (likely(tag))
>> +   break;
>> +
>> +   io_schedule();
>> +   } while (1);
>> +
>> +   finish_wait(>tags_wait, );
>> +
>> +   return tag;
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
>> +
>> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
>> +{
>> +   if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
>> +   return;
>> +
>> +   __ibtrs_put_tag(clt, tag);
>> +
>> +   /*
>> +* Putting a tag is a barrier, so we will observe
>> +* new entry in the wait list, no worries.
>> +*/
>> +   if (waitqueue_active(>tags_wait))
>> +   wake_up(>tags_wait);
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_put_tag);
>
>
> Again, the tags are not clear why they are needed...

We have two separate instances: block device (IBNBD) and a transport
library (IBTRS).  Many block devices share the same IBTRS session
with fixed size queue depth.  Tags is a part of IBTRS, so with allocated
tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.
Also tag helps a lot on IO fail-over to another 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg



Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
wrote:


On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:


o Simple configuration of IBNBD:
 - Server side is completely passive: volumes do not need to be
   explicitly exported.



That sounds like a security hole? I think the ability to configure
whether or
not an initiator is allowed to log in is essential and also which volumes
an
initiator has access to.


Our design target for well controlled production environment, so
security is handle in other layer.



What will happen to a new adopter of the code you are contributing?


Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network.


I don't understand how that helps?


On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.


I think one would still need the option to disallow readable export as
well.


Re: [PATCH 23/24] ibnbd: a bit of documentation

2018-02-05 Thread Sagi Grimberg



/sys/kernel was chosen ages ago and I completely forgot to move it to configfs.
IBTRS is not a block device, so for some read-only entries (statistics
or states)
something else should be probably used, not configfs.  Or it is fine
to read state
of the connection from configfs?  For me sounds a bit weird.


Configs go via configfs, states and alike go to sysfs (but you need your
own class device).


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Sagi Grimberg



Indeed, seems sbitmap can be reused.

But tags is a part of IBTRS, and is not related to block device at all.  One
IBTRS connection (session) handles many block devices


we use host shared tag sets for the case of multiple block devices.


(or any IO producers).


Lets wait until we actually have this theoretical non-block IO
producers..


With a tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.


Same for block tags (given that you don't set the request queue
otherwise)


Also tag
helps a lot on IO fail-over to another connection (multipath implementation,
which is also a part of the transport library, not a block device), where you
simply reuse the same buffer slot (with a tag in your hands) forwarding IO to
another RDMA connection.


What is the benefit of this detached architecture? IMO, one reason why
you ended up not reusing a lot of the infrastructure is yielded from the
attempt to support a theoretical different consumer that is not ibnbd.
Did you actually had plans for any other consumers?

Personally, I think you will be much better off with a unified approach
for your block device implementation.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Danil Kipnis
>
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche 
>> wrote:
>>>
>>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:

 o Simple configuration of IBNBD:
 - Server side is completely passive: volumes do not need to be
   explicitly exported.
>>>
>>>
>>> That sounds like a security hole? I think the ability to configure
>>> whether or
>>> not an initiator is allowed to log in is essential and also which volumes
>>> an
>>> initiator has access to.
>>
>> Our design target for well controlled production environment, so
>> security is handle in other layer.
>
>
> What will happen to a new adopter of the code you are contributing?

Hi Sagi, Hi Bart,
thanks for your feedback.
We considered the "storage cluster" setup, where each ibnbd client has
access to each ibnbd server. Each ibnbd server manages devices under
his "dev_search_path" and can provide access to them to any ibnbd
client in the network. On top of that Ibnbd server has an additional
"artificial" restriction, that a device can be mapped in writable-mode
by only one client at once.

-- 
Danil


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Roman Penyaev
On Fri, Feb 2, 2018 at 5:54 PM, Bart Van Assche  wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> +static inline struct ibtrs_tag *
>> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
>> +{
>> + size_t max_depth = clt->queue_depth;
>> + struct ibtrs_tag *tag;
>> + int cpu, bit;
>> +
>> + cpu = get_cpu();
>> + do {
>> + bit = find_first_zero_bit(clt->tags_map, max_depth);
>> + if (unlikely(bit >= max_depth)) {
>> + put_cpu();
>> + return NULL;
>> + }
>> +
>> + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
>> + put_cpu();
>> +
>> + tag = GET_TAG(clt, bit);
>> + WARN_ON(tag->mem_id != bit);
>> + tag->cpu_id = cpu;
>> + tag->con_type = con_type;
>> +
>> + return tag;
>> +}
>> +
>> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
>> +struct ibtrs_tag *tag)
>> +{
>> + clear_bit_unlock(tag->mem_id, clt->tags_map);
>> +}
>> +
>> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
>> + enum ibtrs_clt_con_type con_type,
>> + int can_wait)
>> +{
>> + struct ibtrs_tag *tag;
>> + DEFINE_WAIT(wait);
>> +
>> + tag = __ibtrs_get_tag(clt, con_type);
>> + if (likely(tag) || !can_wait)
>> + return tag;
>> +
>> + do {
>> + prepare_to_wait(>tags_wait, , TASK_UNINTERRUPTIBLE);
>> + tag = __ibtrs_get_tag(clt, con_type);
>> + if (likely(tag))
>> + break;
>> +
>> + io_schedule();
>> + } while (1);
>> +
>> + finish_wait(>tags_wait, );
>> +
>> + return tag;
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
>> +
>> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
>> +{
>> + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
>> + return;
>> +
>> + __ibtrs_put_tag(clt, tag);
>> +
>> + /*
>> +  * Putting a tag is a barrier, so we will observe
>> +  * new entry in the wait list, no worries.
>> +  */
>> + if (waitqueue_active(>tags_wait))
>> + wake_up(>tags_wait);
>> +}
>> +EXPORT_SYMBOL(ibtrs_clt_put_tag);
>
> Do these functions have any advantage over the code in lib/sbitmap.c? If not,
> please call the sbitmap functions instead of adding an additional tag 
> allocator.

Indeed, seems sbitmap can be reused.

But tags is a part of IBTRS, and is not related to block device at all.  One
IBTRS connection (session) handles many block devices (or any IO producers).
With a tag you get a free slot of a buffer where you can read/write, so once
you've allocated a tag you won't sleep on IO path inside a library.  Also tag
helps a lot on IO fail-over to another connection (multipath implementation,
which is also a part of the transport library, not a block device), where you
simply reuse the same buffer slot (with a tag in your hands) forwarding IO to
another RDMA connection.

--
Roman


Re: [PATCH 23/24] ibnbd: a bit of documentation

2018-02-05 Thread Roman Penyaev
On Fri, Feb 2, 2018 at 4:55 PM, Bart Van Assche  wrote:
> On Fri, 2018-02-02 at 15:09 +0100, Roman Pen wrote:
>> +Entries under /sys/kernel/ibnbd_client/
>> +===
>> [ ... ]
>
> You will need Greg KH's permission to add new entries directly under 
> /sys/kernel.
> Since I think that it is unlikely that he will give that permission: have you
> considered to add the new client entries under /sys/class/block for the 
> client and
> /sys/kernel/configfs/ibnbd for the target, similar to what the NVMeOF drivers 
> do
> today?

/sys/kernel was chosen ages ago and I completely forgot to move it to configfs.
IBTRS is not a block device, so for some read-only entries (statistics
or states)
something else should be probably used, not configfs.  Or it is fine
to read state
of the connection from configfs?  For me sounds a bit weird.

--
Roman


Re: [PATCH 16/24] ibnbd: client: main functionality

2018-02-05 Thread Roman Penyaev
On Fri, Feb 2, 2018 at 4:11 PM, Jens Axboe  wrote:
> On 2/2/18 7:08 AM, Roman Pen wrote:
>> This is main functionality of ibnbd-client module, which provides
>> interface to map remote device as local block device /dev/ibnbd
>> and feeds IBTRS with IO requests.
>
> Kill the legacy IO path for this, the driver should only support
> blk-mq. Hence kill off your BLK_RQ part, that will eliminate
> the dual path you have too.

ok, sounds good, frankly, we did not use rq very long time.

--
Roman


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg

Hi Roman and the team (again), replying to my own email :)

I forgot to mention that first of all thank you for upstreaming
your work! I fully support your goal to have your production driver
upstream to minimize your maintenance efforts. I hope that my
feedback didn't came across with a different impression, that was
certainly not my intent.

It would be great if you can address and/or reply to my feedback
(as well as others) and re-spin it again.

Cheers,
Sagi.


[PATCH 4/4] nvme: lightnvm: add late setup of block size and metadata

2018-02-05 Thread Matias Bjørling
The nvme driver sets up the size of the nvme namespace in two steps.
First it initializes the device with standard logical block and
metadata sizes, and then sets the correct logical block and metadata
size. Due to the OCSSD 2.0 specification relies on the namespace to
expose these sizes for correct initialization, let it be updated
appropriately on the LightNVM side as well.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/core.c | 2 ++
 drivers/nvme/host/lightnvm.c | 8 
 drivers/nvme/host/nvme.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..740ceb28067c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1379,6 +1379,8 @@ static void __nvme_revalidate_disk(struct gendisk *disk, 
struct nvme_id_ns *id)
if (ns->noiob)
nvme_set_chunk_size(ns);
nvme_update_disk_info(disk, ns, id);
+   if (ns->ndev)
+   nvme_nvm_update_nvm_info(ns);
 #ifdef CONFIG_NVME_MULTIPATH
if (ns->head->disk)
nvme_update_disk_info(ns->head->disk, ns, id);
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index a9c010655ccc..8d4301854811 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -814,6 +814,14 @@ int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, 
unsigned long arg)
}
 }
 
+void nvme_nvm_update_nvm_info(struct nvme_ns *ns)
+{
+   struct nvm_dev *ndev = ns->ndev;
+
+   ndev->identity.csecs = ndev->geo.sec_size = 1 << ns->lba_shift;
+   ndev->identity.sos = ndev->geo.oob_size = ns->ms;
+}
+
 int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node)
 {
struct request_queue *q = ns->queue;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..1ca08f4993ba 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -451,12 +451,14 @@ static inline void nvme_mpath_clear_current_path(struct 
nvme_ns *ns)
 #endif /* CONFIG_NVME_MULTIPATH */
 
 #ifdef CONFIG_NVM
+void nvme_nvm_update_nvm_info(struct nvme_ns *ns);
 int nvme_nvm_register(struct nvme_ns *ns, char *disk_name, int node);
 void nvme_nvm_unregister(struct nvme_ns *ns);
 int nvme_nvm_register_sysfs(struct nvme_ns *ns);
 void nvme_nvm_unregister_sysfs(struct nvme_ns *ns);
 int nvme_nvm_ioctl(struct nvme_ns *ns, unsigned int cmd, unsigned long arg);
 #else
+static inline void nvme_nvm_update_nvm_info(struct nvme_ns *ns) {};
 static inline int nvme_nvm_register(struct nvme_ns *ns, char *disk_name,
int node)
 {
-- 
2.11.0



[PATCH 1/4] lightnvm: make 1.2 data structures explicit

2018-02-05 Thread Matias Bjørling
Make the 1.2 data structures explicit, so it will be easy to identify
the 2.0 data structures. Also fix the order of which the nvme_nvm_*
are declared, such that they follow the nvme_nvm_command order.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 82 ++--
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index dc0b1335c7c6..60db3f1b59da 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -51,6 +51,21 @@ struct nvme_nvm_ph_rw {
__le64  resv;
 };
 
+struct nvme_nvm_erase_blk {
+   __u8opcode;
+   __u8flags;
+   __u16   command_id;
+   __le32  nsid;
+   __u64   rsvd[2];
+   __le64  prp1;
+   __le64  prp2;
+   __le64  spba;
+   __le16  length;
+   __le16  control;
+   __le32  dsmgmt;
+   __le64  resv;
+};
+
 struct nvme_nvm_identity {
__u8opcode;
__u8flags;
@@ -89,33 +104,18 @@ struct nvme_nvm_setbbtbl {
__u32   rsvd4[3];
 };
 
-struct nvme_nvm_erase_blk {
-   __u8opcode;
-   __u8flags;
-   __u16   command_id;
-   __le32  nsid;
-   __u64   rsvd[2];
-   __le64  prp1;
-   __le64  prp2;
-   __le64  spba;
-   __le16  length;
-   __le16  control;
-   __le32  dsmgmt;
-   __le64  resv;
-};
-
 struct nvme_nvm_command {
union {
struct nvme_common_command common;
-   struct nvme_nvm_identity identity;
struct nvme_nvm_ph_rw ph_rw;
+   struct nvme_nvm_erase_blk erase;
+   struct nvme_nvm_identity identity;
struct nvme_nvm_getbbtbl get_bb;
struct nvme_nvm_setbbtbl set_bb;
-   struct nvme_nvm_erase_blk erase;
};
 };
 
-struct nvme_nvm_id_group {
+struct nvme_nvm_id12_grp {
__u8mtype;
__u8fmtype;
__le16  res16;
@@ -141,7 +141,7 @@ struct nvme_nvm_id_group {
__u8reserved[906];
 } __packed;
 
-struct nvme_nvm_addr_format {
+struct nvme_nvm_id12_addrf {
__u8ch_offset;
__u8ch_len;
__u8lun_offset;
@@ -157,16 +157,16 @@ struct nvme_nvm_addr_format {
__u8res[4];
 } __packed;
 
-struct nvme_nvm_id {
+struct nvme_nvm_id12 {
__u8ver_id;
__u8vmnt;
__u8cgrps;
__u8res;
__le32  cap;
__le32  dom;
-   struct nvme_nvm_addr_format ppaf;
+   struct nvme_nvm_id12_addrf ppaf;
__u8resv[228];
-   struct nvme_nvm_id_group group;
+   struct nvme_nvm_id12_grp grp;
__u8resv2[2880];
 } __packed;
 
@@ -191,25 +191,25 @@ static inline void _nvme_nvm_check_size(void)
 {
BUILD_BUG_ON(sizeof(struct nvme_nvm_identity) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_ph_rw) != 64);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_getbbtbl) != 64);
BUILD_BUG_ON(sizeof(struct nvme_nvm_setbbtbl) != 64);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_erase_blk) != 64);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_id_group) != 960);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_addr_format) != 16);
-   BUILD_BUG_ON(sizeof(struct nvme_nvm_id) != NVME_IDENTIFY_DATA_SIZE);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_grp) != 960);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_addrf) != 16);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id12) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
 }
 
-static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id)
+static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12)
 {
-   struct nvme_nvm_id_group *src;
+   struct nvme_nvm_id12_grp *src;
struct nvm_id_group *grp;
int sec_per_pg, sec_per_pl, pg_per_blk;
 
-   if (nvme_nvm_id->cgrps != 1)
+   if (id12->cgrps != 1)
return -EINVAL;
 
-   src = _nvm_id->group;
+   src = >grp;
grp = _id->grp;
 
grp->mtype = src->mtype;
@@ -261,34 +261,34 @@ static int init_grps(struct nvm_id *nvm_id, struct 
nvme_nvm_id *nvme_nvm_id)
 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg

Hi Roman and the team,

On 02/02/2018 04:08 PM, Roman Pen wrote:

This series introduces IBNBD/IBTRS modules.

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA.


So its not strictly infiniband correct?

 It is optimized to transfer (read/write) IO blocks

in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality.


Couple of questions on your multipath implementation?
1. What was your main objective over dm-multipath?
2. What was the consideration of this implementation over
creating a stand-alone bio based device node to reinject the
bio to the original block device?


IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

- IBNBD/IBTRS is developed in order to map thin provisioned volumes,
  thus internal protocol is simple and consists of several request
 types only without awareness of underlaying hardware devices.


Can you explain how the protocol is developed for thin-p? What are the
essence of how its suited for it?


- IBTRS was developed as an independent RDMA transport library, which
  supports fail-over and load-balancing policies using multipath, thus
 it can be used for any other IO needs rather than only for block
 device.


What do you mean by "any other IO"?


- IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
  https://www.spinics.net/lists/linux-rdma/msg48799.html
  (I retested on latest 4.14 kernel - there is no any significant
  difference, thus I post the old link).


That is interesting to learn.

Reading your reference brings a couple of questions though,
- Its unclear to me how ibnbd performs reads without performing memory
  registration. Is it using the global dma rkey?

- Its unclear to me how there is a difference in noreg in writes,
  because for small writes nvme-rdma never register memory (it uses
  inline data).

- Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
  seems considerably low against other reports. Can you try and explain
  what was the bottleneck? This can be a potential bug and I (and the
  rest of the community is interesting in knowing more details).

- srp/scst comparison is really not fair having it in legacy request
  mode. Can you please repeat it and report a bug to either linux-rdma
  or to the scst mailing list?

- Your latency measurements are surprisingly high for a null target
  device (even for low end nvme device actually) regardless of the
  transport implementation.

For example:
- QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
  fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
  and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
  latency I got ~14 us. So something does not add up here. If this is
  not some configuration issue, then we have serious bugs to handle..

- QD=16 the read latencies are > 10ms for null devices?! I'm having
  troubles understanding how you were able to get such high latencies
  (> 100 ms for QD>=100)

Can you share more information about your setup? It would really help
us understand more.


- Major parts of the code were rewritten, simplified and overall code
  size was reduced by a quarter.


That is good to know.


[PATCH 3/4] lightnvm: add 2.0 geometry identification

2018-02-05 Thread Matias Bjørling
Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  |   2 +-
 drivers/nvme/host/lightnvm.c | 334 +--
 include/linux/lightnvm.h |  11 +-
 3 files changed, 295 insertions(+), 52 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index c72863b36439..250e74dfa120 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -934,7 +934,7 @@ static int nvm_init(struct nvm_dev *dev)
pr_debug("nvm: ver:%x nvm_vendor:%x\n",
dev->identity.ver_id, dev->identity.vmnt);
 
-   if (dev->identity.ver_id != 1) {
+   if (dev->identity.ver_id != 1 && dev->identity.ver_id != 2) {
pr_err("nvm: device not supported by kernel.");
goto err;
}
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 6412551ecc65..a9c010655ccc 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -184,6 +184,58 @@ struct nvme_nvm_bb_tbl {
__u8blk[0];
 };
 
+struct nvme_nvm_id20_addrf {
+   __u8grp_len;
+   __u8pu_len;
+   __u8chk_len;
+   __u8lba_len;
+   __u8resv[4];
+};
+
+struct nvme_nvm_id20 {
+   __u8mjr;
+   __u8mnr;
+   __u8resv[6];
+
+   struct nvme_nvm_id20_addrf lbaf;
+
+   __u32   mccap;
+   __u8resv2[12];
+
+   __u8wit;
+   __u8resv3[31];
+
+   /* Geometry */
+   __u16   num_grp;
+   __u16   num_pu;
+   __u32   num_chk;
+   __u32   clba;
+   __u8resv4[52];
+
+   /* Write data requirements */
+   __u32   ws_min;
+   __u32   ws_opt;
+   __u32   mw_cunits;
+   __u32   maxoc;
+   __u32   maxocpu;
+   __u8resv5[44];
+
+   /* Performance related metrics */
+   __u32   trdt;
+   __u32   trdm;
+   __u32   twrt;
+   __u32   twrm;
+   __u32   tcrst;
+   __u32   tcrsm;
+   __u8resv6[40];
+
+   /* Reserved area */
+   __u8resv7[2816];
+
+   /* Vendor specific */
+   __u8vs[1024];
+};
+
 /*
  * Check we didn't inadvertently grow the command struct
  */
@@ -198,6 +250,8 @@ static inline void _nvme_nvm_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_nvm_id12_addrf) != 16);
BUILD_BUG_ON(sizeof(struct nvme_nvm_id12) != NVME_IDENTIFY_DATA_SIZE);
BUILD_BUG_ON(sizeof(struct nvme_nvm_bb_tbl) != 64);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id20_addrf) != 8);
+   BUILD_BUG_ON(sizeof(struct nvme_nvm_id20) != NVME_IDENTIFY_DATA_SIZE);
 }
 
 static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12)
@@ -256,6 +310,49 @@ static int init_grp(struct nvm_id *nvm_id, struct 
nvme_nvm_id12 *id12)
return 0;
 }
 
+static int nvme_nvm_setup_12(struct nvm_dev *nvmdev, struct nvm_id *nvm_id,
+   struct nvme_nvm_id12 *id)
+{
+   nvm_id->ver_id = id->ver_id;
+   nvm_id->vmnt = id->vmnt;
+   nvm_id->cap = le32_to_cpu(id->cap);
+   nvm_id->dom = le32_to_cpu(id->dom);
+   memcpy(_id->ppaf, >ppaf,
+   sizeof(struct nvm_addr_format));
+
+   return init_grp(nvm_id, id);
+}
+
+static int nvme_nvm_setup_20(struct nvm_dev *nvmdev, struct nvm_id *nvm_id,
+   struct nvme_nvm_id20 *id)
+{
+   nvm_id->ver_id = id->mjr;
+
+   nvm_id->num_ch = le16_to_cpu(id->num_grp);
+   nvm_id->num_lun = le16_to_cpu(id->num_pu);
+   nvm_id->num_chk = le32_to_cpu(id->num_chk);
+   nvm_id->clba = le32_to_cpu(id->clba);
+
+   nvm_id->ws_min = le32_to_cpu(id->ws_min);
+   nvm_id->ws_opt = le32_to_cpu(id->ws_opt);
+   nvm_id->mw_cunits = le32_to_cpu(id->mw_cunits);
+
+   nvm_id->trdt = le32_to_cpu(id->trdt);
+   nvm_id->trdm = le32_to_cpu(id->trdm);
+   nvm_id->tprt = le32_to_cpu(id->twrt);
+   nvm_id->tprm = le32_to_cpu(id->twrm);
+   nvm_id->tbet = le32_to_cpu(id->tcrst);
+   nvm_id->tbem = le32_to_cpu(id->tcrsm);
+
+   /* calculated values */
+   nvm_id->ws_per_chk = nvm_id->clba / nvm_id->ws_min;
+
+   /* 1.2 compatibility */
+   nvm_id->ws_seq = NVM_IO_SNGL_ACCESS;
+
+   return 0;
+}
+
 static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id)
 {
struct nvme_ns *ns 

[PATCH 2/4] lightnvm: flatten nvm_id_group into nvm_id

2018-02-05 Thread Matias Bjørling
There are no groups in the 2.0 specification, make sure that the
nvm_id structure is flattened before 2.0 data structures are added.

Signed-off-by: Matias Bjørling 
---
 drivers/lightnvm/core.c  |  25 ++-
 drivers/nvme/host/lightnvm.c | 100 +--
 include/linux/lightnvm.h |  53 +++
 3 files changed, 86 insertions(+), 92 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index dcc9e621e651..c72863b36439 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -851,33 +851,32 @@ EXPORT_SYMBOL(nvm_get_tgt_bb_tbl);
 static int nvm_core_init(struct nvm_dev *dev)
 {
struct nvm_id *id = >identity;
-   struct nvm_id_group *grp = >grp;
struct nvm_geo *geo = >geo;
int ret;
 
memcpy(>ppaf, >ppaf, sizeof(struct nvm_addr_format));
 
-   if (grp->mtype != 0) {
+   if (id->mtype != 0) {
pr_err("nvm: memory type not supported\n");
return -EINVAL;
}
 
/* Whole device values */
-   geo->nr_chnls = grp->num_ch;
-   geo->nr_luns = grp->num_lun;
+   geo->nr_chnls = id->num_ch;
+   geo->nr_luns = id->num_lun;
 
/* Generic device geometry values */
-   geo->ws_min = grp->ws_min;
-   geo->ws_opt = grp->ws_opt;
-   geo->ws_seq = grp->ws_seq;
-   geo->ws_per_chk = grp->ws_per_chk;
-   geo->nr_chks = grp->num_chk;
-   geo->sec_size = grp->csecs;
-   geo->oob_size = grp->sos;
-   geo->mccap = grp->mccap;
+   geo->ws_min = id->ws_min;
+   geo->ws_opt = id->ws_opt;
+   geo->ws_seq = id->ws_seq;
+   geo->ws_per_chk = id->ws_per_chk;
+   geo->nr_chks = id->num_chk;
+   geo->sec_size = id->csecs;
+   geo->oob_size = id->sos;
+   geo->mccap = id->mccap;
geo->max_rq_size = dev->ops->max_phys_sect * geo->sec_size;
 
-   geo->sec_per_chk = grp->clba;
+   geo->sec_per_chk = id->clba;
geo->sec_per_lun = geo->sec_per_chk * geo->nr_chks;
geo->all_luns = geo->nr_luns * geo->nr_chnls;
 
diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 60db3f1b59da..6412551ecc65 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -203,57 +203,55 @@ static inline void _nvme_nvm_check_size(void)
 static int init_grp(struct nvm_id *nvm_id, struct nvme_nvm_id12 *id12)
 {
struct nvme_nvm_id12_grp *src;
-   struct nvm_id_group *grp;
int sec_per_pg, sec_per_pl, pg_per_blk;
 
if (id12->cgrps != 1)
return -EINVAL;
 
src = >grp;
-   grp = _id->grp;
 
-   grp->mtype = src->mtype;
-   grp->fmtype = src->fmtype;
+   nvm_id->mtype = src->mtype;
+   nvm_id->fmtype = src->fmtype;
 
-   grp->num_ch = src->num_ch;
-   grp->num_lun = src->num_lun;
+   nvm_id->num_ch = src->num_ch;
+   nvm_id->num_lun = src->num_lun;
 
-   grp->num_chk = le16_to_cpu(src->num_chk);
-   grp->csecs = le16_to_cpu(src->csecs);
-   grp->sos = le16_to_cpu(src->sos);
+   nvm_id->num_chk = le16_to_cpu(src->num_chk);
+   nvm_id->csecs = le16_to_cpu(src->csecs);
+   nvm_id->sos = le16_to_cpu(src->sos);
 
pg_per_blk = le16_to_cpu(src->num_pg);
-   sec_per_pg = le16_to_cpu(src->fpg_sz) / grp->csecs;
+   sec_per_pg = le16_to_cpu(src->fpg_sz) / nvm_id->csecs;
sec_per_pl = sec_per_pg * src->num_pln;
-   grp->clba = sec_per_pl * pg_per_blk;
-   grp->ws_per_chk = pg_per_blk;
+   nvm_id->clba = sec_per_pl * pg_per_blk;
+   nvm_id->ws_per_chk = pg_per_blk;
 
-   grp->mpos = le32_to_cpu(src->mpos);
-   grp->cpar = le16_to_cpu(src->cpar);
-   grp->mccap = le32_to_cpu(src->mccap);
+   nvm_id->mpos = le32_to_cpu(src->mpos);
+   nvm_id->cpar = le16_to_cpu(src->cpar);
+   nvm_id->mccap = le32_to_cpu(src->mccap);
 
-   grp->ws_opt = grp->ws_min = sec_per_pg;
-   grp->ws_seq = NVM_IO_SNGL_ACCESS;
+   nvm_id->ws_opt = nvm_id->ws_min = sec_per_pg;
+   nvm_id->ws_seq = NVM_IO_SNGL_ACCESS;
 
-   if (grp->mpos & 0x020202) {
-   grp->ws_seq = NVM_IO_DUAL_ACCESS;
-   grp->ws_opt <<= 1;
-   } else if (grp->mpos & 0x040404) {
-   grp->ws_seq = NVM_IO_QUAD_ACCESS;
-   grp->ws_opt <<= 2;
+   if (nvm_id->mpos & 0x020202) {
+   nvm_id->ws_seq = NVM_IO_DUAL_ACCESS;
+   nvm_id->ws_opt <<= 1;
+   } else if (nvm_id->mpos & 0x040404) {
+   nvm_id->ws_seq = NVM_IO_QUAD_ACCESS;
+   nvm_id->ws_opt <<= 2;
}
 
-   grp->trdt = le32_to_cpu(src->trdt);
-   grp->trdm = le32_to_cpu(src->trdm);
-   grp->tprt = le32_to_cpu(src->tprt);
-   grp->tprm = le32_to_cpu(src->tprm);
-   grp->tbet = le32_to_cpu(src->tbet);
-   grp->tbem = le32_to_cpu(src->tbem);
+   nvm_id->trdt = le32_to_cpu(src->trdt);
+   nvm_id->trdm = 

[PATCH 0/4] lightnvm: base 2.0 implementation

2018-02-05 Thread Matias Bjørling
Hi,

A couple of patches for 2.0 support for the lightnvm subsystem. They
form the basis for integrating 2.0 support.

For the rest of the support, Javier has code that implements report
chunk and sets up the LBA format data structure. He also has a bunch
of patches that brings pblk up to speed.

The first two patches is preparation for the 2.0 work. The third patch
implements the 2.0 data structures, the geometry command, and exposes
the sysfs attributes that comes with the 2.0 specification. Note that
the attributes between 1.2 and 2.0 are different, and it is expected
that user-space shall use the version sysfs attribute to know which
attributes will be available.

The last patch implements support for using the nvme namespace logical
block and metadata fields and sync it with the internal lightnvm
identify structures.

-Matias

Matias Bjørling (4):
  lightnvm: make 1.2 data structures explicit
  lightnvm: flatten nvm_id_group into nvm_id
  lightnvm: add 2.0 geometry identification
  nvme: lightnvm: add late setup of block size and metadata

 drivers/lightnvm/core.c  |  27 ++-
 drivers/nvme/host/core.c |   2 +
 drivers/nvme/host/lightnvm.c | 508 ---
 drivers/nvme/host/nvme.h |   2 +
 include/linux/lightnvm.h |  64 +++---
 5 files changed, 426 insertions(+), 177 deletions(-)

-- 
2.11.0



[PATCH 2/2] bcache: fix for data collapse after re-attaching an attached device

2018-02-05 Thread tang . junhui
From: Tang Junhui 

back-end device sdm has already attached a cache_set with ID
f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with
another cache set, and it returns with an error:
[root]# cd /sys/block/sdm/bcache
[root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach
-bash: echo: write error: Invalid argument

After that, execute a command to modify the label of bcache
device:
[root]# echo data_disk1 > label

Then we reboot the system, when the system power on, the back-end
device can not attach to cache_set, a messages show in the log:
Feb  5 12:05:52 ceph152 kernel: [922385.508498] bcache:
bch_cached_dev_attach() couldn't find uuid for sdm in set

In sysfs_attach(), dc->sb.set_uuid was assigned to the value
which input through sysfs, no matter whether it is success
or not in bch_cached_dev_attach(). For example, If the back-end
device has already attached to an cache set, bch_cached_dev_attach()
would fail, but dc->sb.set_uuid was changed. Then modify the
label of bcache device, it will call bch_write_bdev_super(),
which would write the dc->sb.set_uuid to the super block, so we
record a wrong cache set ID in the super block, after the system
reboot, the cache set couldn't find the uuid of the back-end
device, so the bcache device couldn't exist and use any more.

In this patch, we don't assigned cache set ID to dc->sb.set_uuid
in sysfs_attach() directly, but input it into bch_cached_dev_attach(),
and assigned dc->sb.set_uuid to the cache set ID after the back-end
device attached to the cache set successful.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 10 ++
 drivers/md/bcache/sysfs.c  |  6 --
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e..40f0267 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -907,7 +907,7 @@ bool bch_alloc_sectors(struct cache_set *, struct bkey *, 
unsigned,
 
 int bch_flash_dev_create(struct cache_set *c, uint64_t size);
 
-int bch_cached_dev_attach(struct cached_dev *, struct cache_set *);
+int bch_cached_dev_attach(struct cached_dev *, struct cache_set *, uint8_t *);
 void bch_cached_dev_detach(struct cached_dev *);
 void bch_cached_dev_run(struct cached_dev *);
 void bcache_device_stop(struct bcache_device *);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index b4d2892..bdabdca 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -946,7 +946,8 @@ void bch_cached_dev_detach(struct cached_dev *dc)
cached_dev_put(dc);
 }
 
-int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c)
+int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
+ uint8_t *set_uuid)
 {
uint32_t rtime = cpu_to_le32(get_seconds());
struct uuid_entry *u;
@@ -954,7 +955,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct 
cache_set *c)
 
bdevname(dc->bdev, buf);
 
-   if (memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16))
+   if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) ||
+   (!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16)))
return -ENOENT;
 
if (dc->disk.c) {
@@ -1183,7 +1185,7 @@ static void register_bdev(struct cache_sb *sb, struct 
page *sb_page,
 
list_add(>list, _devices);
list_for_each_entry(c, _cache_sets, list)
-   bch_cached_dev_attach(dc, c);
+   bch_cached_dev_attach(dc, c, NULL);
 
if (BDEV_STATE(>sb) == BDEV_STATE_NONE ||
BDEV_STATE(>sb) == BDEV_STATE_STALE)
@@ -1705,7 +1707,7 @@ static void run_cache_set(struct cache_set *c)
bcache_write_super(c);
 
list_for_each_entry_safe(dc, t, _devices, list)
-   bch_cached_dev_attach(dc, c);
+   bch_cached_dev_attach(dc, c, NULL);
 
flash_devs_run(c);
 
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 9c8fdd5..6316399 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -267,12 +267,14 @@
}
 
if (attr == _attach) {
-   if (bch_parse_uuid(buf, dc->sb.set_uuid) < 16)
+   uint8_t set_uuid[16];
+
+   if (bch_parse_uuid(buf, set_uuid) < 16)
return -EINVAL;
 
v = -ENOENT;
list_for_each_entry(c, _cache_sets, list) {
-   v = bch_cached_dev_attach(dc, c);
+   v = bch_cached_dev_attach(dc, c, set_uuid);
if (!v)
return size;
}
-- 
1.8.3.1



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Sagi Grimberg



Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  wrote:

On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:

o Simple configuration of IBNBD:
- Server side is completely passive: volumes do not need to be
  explicitly exported.


That sounds like a security hole? I think the ability to configure whether or
not an initiator is allowed to log in is essential and also which volumes an
initiator has access to.

Our design target for well controlled production environment, so
security is handle in other layer.


What will happen to a new adopter of the code you are contributing?


Re: [PATCH 09/24] ibtrs: server: main functionality

2018-02-05 Thread Sagi Grimberg

Hi Roman,

Some comments below.

On 02/02/2018 04:08 PM, Roman Pen wrote:

This is main functionality of ibtrs-server module, which accepts
set of RDMA connections (so called IBTRS session), creates/destroys
sysfs entries associated with IBTRS session and notifies upper layer
(user of IBTRS API) about RDMA requests or link events.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
  drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1811 ++
  1 file changed, 1811 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
new file mode 100644
index ..0d1fc08bd821
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
@@ -0,0 +1,1811 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Server");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_MAX_IO_SIZE_KB 128
+#define DEFAULT_MAX_IO_SIZE (DEFAULT_MAX_IO_SIZE_KB * 1024)
+#define MAX_REQ_SIZE PAGE_SIZE
+#define MAX_SG_COUNT ((MAX_REQ_SIZE - sizeof(struct ibtrs_msg_rdma_read)) \
+ / sizeof(struct ibtrs_sg_desc))
+
+static int max_io_size = DEFAULT_MAX_IO_SIZE;
+static int rcv_buf_size = DEFAULT_MAX_IO_SIZE + MAX_REQ_SIZE;
+
+static int max_io_size_set(const char *val, const struct kernel_param *kp)
+{
+   int err, ival;
+
+   err = kstrtoint(val, 0, );
+   if (err)
+   return err;
+
+   if (ival < 4096 || ival + MAX_REQ_SIZE > (4096 * 1024) ||
+   (ival + MAX_REQ_SIZE) % 512 != 0) {
+   pr_err("Invalid max io size value %d, has to be"
+  " > %d, < %d\n", ival, 4096, 4194304);
+   return -EINVAL;
+   }
+
+   max_io_size = ival;
+   rcv_buf_size = max_io_size + MAX_REQ_SIZE;
+   pr_info("max io size changed to %d\n", ival);
+
+   return 0;
+}
+
+static const struct kernel_param_ops max_io_size_ops = {
+   .set= max_io_size_set,
+   .get= param_get_int,
+};
+module_param_cb(max_io_size, _io_size_ops, _io_size, 0444);
+MODULE_PARM_DESC(max_io_size,
+"Max size for each IO request, when change the unit is in byte"
+" (default: " __stringify(DEFAULT_MAX_IO_SIZE_KB) "KB)");
+
+#define DEFAULT_SESS_QUEUE_DEPTH 512
+static int sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
+module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
+MODULE_PARM_DESC(sess_queue_depth,
+"Number of buffers for pending I/O requests to allocate"
+" per session. Maximum: " __stringify(MAX_SESS_QUEUE_DEPTH)
+" (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
+
+/* We guarantee to serve 10 paths at least */
+#define CHUNK_POOL_SIZE (DEFAULT_SESS_QUEUE_DEPTH * 10)
+static mempool_t *chunk_pool;
+
+static int retry_count = 7;
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+   int err, ival;
+
+   err = kstrtoint(val, 0, );
+   if (err)
+   return err;
+
+   if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
+   pr_err("Invalid retry count value %d, has to be"
+  " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
+   return -EINVAL;
+   }
+
+   retry_count = ival;
+   pr_info("QP retry count changed to %d\n", ival);
+
+   return 0;
+}
+
+static const 

Re: [PATCH 07/24] ibtrs: client: sysfs interface functions

2018-02-05 Thread Sagi Grimberg

Hi Roman,


This is the sysfs interface to IBTRS sessions on client side:

   /sys/kernel/ibtrs_client//
 *** IBTRS session created by ibtrs_clt_open() API call
 |
 |- max_reconnect_attempts
 |  *** number of reconnect attempts for session
 |
 |- add_path
 |  *** adds another connection path into IBTRS session
 |
 |- paths//
*** established paths to server in a session
|
|- disconnect
|  *** disconnect path
|
|- reconnect
|  *** reconnect path
|
|- remove_path
|  *** remove current path
|
|- state
|  *** retrieve current path state
|
|- stats/
   *** current path statistics
   |
  |- cpu_migration
  |- rdma
  |- rdma_lat
  |- reconnects
  |- reset_all
  |- sg_entries
  |- wc_completions

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 


I think stats usually belong in debugfs.


Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-05 Thread Sagi Grimberg

Hi Roman,


+static inline void ibtrs_clt_state_lock(void)
+{
+   rcu_read_lock();
+}
+
+static inline void ibtrs_clt_state_unlock(void)
+{
+   rcu_read_unlock();
+}


This looks rather pointless...


+
+#define cmpxchg_min(var, new) ({   \
+   typeof(var) old;\
+   \
+   do {\
+   old = var;  \
+   new = (!old ? new : min_t(typeof(var), old, new));  \
+   } while (cmpxchg(, old, new) != old);   \
+})


Why is this sort of thing local to your driver?


+/**
+ * struct ibtrs_fr_pool - pool of fast registration descriptors
+ *
+ * An entry is available for allocation if and only if it occurs in @free_list.
+ *
+ * @size:  Number of descriptors in this pool.
+ * @max_page_list_len: Maximum fast registration work request page list length.
+ * @lock:  Protects free_list.
+ * @free_list: List of free descriptors.
+ * @desc:  Fast registration descriptor pool.
+ */
+struct ibtrs_fr_pool {
+   int size;
+   int max_page_list_len;
+   spinlock_t  lock; /* protects free_list */
+   struct list_headfree_list;
+   struct ibtrs_fr_descdesc[0];
+};


We already have a per-qp fr list implementation, any specific reason to
implement it again?


+static inline struct ibtrs_tag *
+__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
+{
+   size_t max_depth = clt->queue_depth;
+   struct ibtrs_tag *tag;
+   int cpu, bit;
+
+   cpu = get_cpu();
+   do {
+   bit = find_first_zero_bit(clt->tags_map, max_depth);
+   if (unlikely(bit >= max_depth)) {
+   put_cpu();
+   return NULL;
+   }
+
+   } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
+   put_cpu();
+
+   tag = GET_TAG(clt, bit);
+   WARN_ON(tag->mem_id != bit);
+   tag->cpu_id = cpu;
+   tag->con_type = con_type;
+
+   return tag;
+}
+
+static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
+  struct ibtrs_tag *tag)
+{
+   clear_bit_unlock(tag->mem_id, clt->tags_map);
+}
+
+struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
+   enum ibtrs_clt_con_type con_type,
+   int can_wait)
+{
+   struct ibtrs_tag *tag;
+   DEFINE_WAIT(wait);
+
+   tag = __ibtrs_get_tag(clt, con_type);
+   if (likely(tag) || !can_wait)
+   return tag;
+
+   do {
+   prepare_to_wait(>tags_wait, , TASK_UNINTERRUPTIBLE);
+   tag = __ibtrs_get_tag(clt, con_type);
+   if (likely(tag))
+   break;
+
+   io_schedule();
+   } while (1);
+
+   finish_wait(>tags_wait, );
+
+   return tag;
+}
+EXPORT_SYMBOL(ibtrs_clt_get_tag);
+
+void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
+{
+   if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
+   return;
+
+   __ibtrs_put_tag(clt, tag);
+
+   /*
+* Putting a tag is a barrier, so we will observe
+* new entry in the wait list, no worries.
+*/
+   if (waitqueue_active(>tags_wait))
+   wake_up(>tags_wait);
+}
+EXPORT_SYMBOL(ibtrs_clt_put_tag);


Again, the tags are not clear why they are needed...


+/**
+ * ibtrs_destroy_fr_pool() - free the resources owned by a pool
+ * @pool: Fast registration pool to be destroyed.
+ */
+static void ibtrs_destroy_fr_pool(struct ibtrs_fr_pool *pool)
+{
+   struct ibtrs_fr_desc *d;
+   int i, err;
+
+   if (!pool)
+   return;
+
+   for (i = 0, d = >desc[0]; i < pool->size; i++, d++) {
+   if (d->mr) {
+   err = ib_dereg_mr(d->mr);
+   if (err)
+   pr_err("Failed to deregister memory region,"
+  " err: %d\n", err);
+   }
+   }
+   kfree(pool);
+}
+
+/**
+ * ibtrs_create_fr_pool() - allocate and initialize a pool for fast 
registration
+ * @device:IB device to allocate fast registration descriptors for.
+ * @pd:Protection domain associated with the FR descriptors.
+ * @pool_size: Number of descriptors to allocate.
+ * @max_page_list_len: Maximum fast registration work request page list length.
+ */
+static struct ibtrs_fr_pool *ibtrs_create_fr_pool(struct ib_device *device,
+ struct ib_pd *pd,
+ int pool_size,
+ int max_page_list_len)
+{

Re: [PATCH 04/24] ibtrs: client: private header with client structs and functions

2018-02-05 Thread Sagi Grimberg

Hi Roman,



+struct ibtrs_clt_io_req {
+   struct list_headlist;
+   struct ibtrs_iu *iu;
+   struct scatterlist  *sglist; /* list holding user data */
+   unsigned intsg_cnt;
+   unsigned intsg_size;
+   unsigned intdata_len;
+   unsigned intusr_len;
+   void*priv;
+   boolin_use;
+   struct ibtrs_clt_con*con;
+   union {
+   struct ib_pool_fmr  **fmr_list;
+   struct ibtrs_fr_desc**fr_list;
+   };


We are pretty much stuck with fmrs for legacy devices, it has
no future support plans, please don't add new dependencies
on it. Its already hard enough to get rid of it.


+   void*map_page;
+   struct ibtrs_tag*tag;


Can I ask why do you need another tag that is not the request
tag?


+   u16 nmdesc;
+   enum dma_data_direction dir;
+   ibtrs_conf_fn   *conf;
+   unsigned long   start_time;
+};
+



+static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
+{
+   if (unlikely(!c))
+   return NULL;
+
+   return container_of(c, struct ibtrs_clt_con, c);
+}
+
+static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
+{
+   if (unlikely(!s))
+   return NULL;
+
+   return container_of(s, struct ibtrs_clt_sess, s);
+}


Seems a bit awkward that container_of wrappers check pointer validity...


+/**
+ * list_next_or_null_rr - get next list element in round-robin fashion.
+ * @pos: entry, starting cursor.
+ * @head:head of the list to examine. This list must have at least one
+ *   element, namely @pos.
+ * @member:  name of the list_head structure within typeof(*pos).
+ *
+ * Important to understand that @pos is a list entry, which can be already
+ * removed using list_del_rcu(), so if @head has become empty NULL will be
+ * returned. Otherwise next element is returned in round-robin fashion.
+ */
+#define list_next_or_null_rcu_rr(pos, head, member) ({ \
+   typeof(pos) next = NULL;\
+   \
+   if (!list_empty(head))  \
+   next = (pos)->member.next != (head) ?\
+   list_entry_rcu((pos)->member.next,   \
+  typeof(*pos), member) :  \
+   list_entry_rcu((pos)->member.next->next,  \
+  typeof(*pos), member);   \
+   next;   \
+})


Why is this local to your driver?


+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj) \
+   LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),\
+CASE(obj, struct ibtrs_clt *, sessname))
+
+#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
+#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)


Still don't understand why this is even needed..


Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules

2018-02-05 Thread Sagi Grimberg

Hi Roman,

Here are some comments below.


+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+   struct ib_recv_wr wr, *bad_wr;
+
+   wr.next= NULL;
+   wr.wr_cqe  = cqe;
+   wr.sg_list = NULL;
+   wr.num_sge = 0;
+
+   return ib_post_recv(con->qp, , _wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);


What is this designed to do?


+int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu *iu,
+struct ib_sge *sge, unsigned int num_sge,
+u32 rkey, u64 rdma_addr, u32 imm_data,
+enum ib_send_flags flags)
+{
+   struct ib_send_wr *bad_wr;
+   struct ib_rdma_wr wr;
+   int i;
+
+   wr.wr.next= NULL;
+   wr.wr.wr_cqe  = >cqe;
+   wr.wr.sg_list = sge;
+   wr.wr.num_sge = num_sge;
+   wr.rkey   = rkey;
+   wr.remote_addr= rdma_addr;
+   wr.wr.opcode  = IB_WR_RDMA_WRITE_WITH_IMM;
+   wr.wr.ex.imm_data = cpu_to_be32(imm_data);
+   wr.wr.send_flags  = flags;
+
+   /*
+* If one of the sges has 0 size, the operation will fail with an
+* length error
+*/
+   for (i = 0; i < num_sge; i++)
+   if (WARN_ON(sge[i].length == 0))
+   return -EINVAL;
+
+   return ib_post_send(con->qp, , _wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
+
+int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe *cqe,
+   u32 imm_data, enum ib_send_flags flags)
+{
+   struct ib_send_wr wr, *bad_wr;
+
+   memset(, 0, sizeof(wr));
+   wr.wr_cqe   = cqe;
+   wr.send_flags   = flags;
+   wr.opcode   = IB_WR_RDMA_WRITE_WITH_IMM;
+   wr.ex.imm_data  = cpu_to_be32(imm_data);
+
+   return ib_post_send(con->qp, , _wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);


Christoph did a great job adding a generic rdma rw API, please
reuse it, if you rely on needed functionality that does not exist
there, please enhance it instead of open-coding a new rdma engine
library.


+static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device *dev)
+{
+   int err;
+
+   d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
+   if (IS_ERR(d->pd))
+   return PTR_ERR(d->pd);
+   d->dev = dev;
+   d->lkey = d->pd->local_dma_lkey;
+   d->rkey = d->pd->unsafe_global_rkey;
+
+   err = ibtrs_query_device(d);
+   if (unlikely(err))
+   ib_dealloc_pd(d->pd);
+
+   return err;
+}


I must say that this makes me frustrated.. We stopped doing these
sort of things long time ago. No way we can even consider accepting
the unsafe use of the global rkey exposing the entire memory space for
remote access permissions.

Sorry for being blunt, but this protocol design which makes a concious
decision to expose unconditionally is broken by definition.


+struct ibtrs_ib_dev *ibtrs_ib_dev_find_get(struct rdma_cm_id *cm_id)
+{
+   struct ibtrs_ib_dev *dev;
+   int err;
+
+   mutex_lock(_list_mutex);
+   list_for_each_entry(dev, _list, entry) {
+   if (dev->dev->node_guid == cm_id->device->node_guid &&
+   kref_get_unless_zero(>ref))
+   goto out_unlock;
+   }
+   dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+   if (unlikely(!dev))
+   goto out_err;
+
+   kref_init(>ref);
+   err = ibtrs_ib_dev_init(dev, cm_id->device);
+   if (unlikely(err))
+   goto out_free;
+   list_add(>entry, _list);
+out_unlock:
+   mutex_unlock(_list_mutex);
+
+   return dev;
+
+out_free:
+   kfree(dev);
+out_err:
+   mutex_unlock(_list_mutex);
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_ib_dev_find_get);


Is it time to make this a common helper in rdma_cm?

...


+static void schedule_hb(struct ibtrs_sess *sess)
+{
+   queue_delayed_work(sess->hb_wq, >hb_dwork,
+  msecs_to_jiffies(sess->hb_interval_ms));
+}


What does hb stand for?


+void ibtrs_send_hb_ack(struct ibtrs_sess *sess)
+{
+   struct ibtrs_con *usr_con = sess->con[0];
+   u32 imm;
+   int err;
+
+   imm = ibtrs_to_imm(IBTRS_HB_ACK_IMM, 0);
+   err = ibtrs_post_rdma_write_imm_empty(usr_con, sess->hb_cqe,
+ imm, IB_SEND_SIGNALED);
+   if (unlikely(err)) {
+   sess->hb_err_handler(usr_con, err);
+   return;
+   }
+}
+EXPORT_SYMBOL_GPL(ibtrs_send_hb_ack);


What is this?

What is all this hb stuff?


+
+static int ibtrs_str_ipv4_to_sockaddr(const char *addr, size_t len,
+ short port, struct sockaddr *dst)
+{
+   struct sockaddr_in *dst_sin = (struct sockaddr_in *)dst;
+   int ret;
+
+   ret = in4_pton(addr, len, (u8 *)_sin->sin_addr.s_addr,
+  '\0', NULL);
+  

Re: [PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-05 Thread Ming Lei
On Mon, Feb 05, 2018 at 07:54:29AM +0100, Hannes Reinecke wrote:
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> > reply queues, but tags is often HBA wide.
> > 
> > These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> > for automatic affinity assignment.
> > 
> > Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> > has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> > for some irq vectors, this can't be avoided even though the allocation
> > is improved.
> > 
> > So all these drivers have to avoid to ask HBA to complete request in
> > reply queue which hasn't online CPUs assigned, and HPSA has been broken
> > with v4.15+:
> > 
> > https://marc.info/?l=linux-kernel=151748144730409=2
> > 
> > This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> > hw queue by mapping each reply queue into hctx, but one tricky thing is
> > the HBA wide(instead of hw queue wide) tags.
> > 
> > This patch is based on the following Hannes's patch:
> > 
> > https://marc.info/?l=linux-block=149132580511346=2
> > 
> > One big difference with Hannes's is that this patch only makes the tags 
> > sbitmap
> > and active_queues data structure HBA wide, and others are kept as NUMA 
> > locality,
> > such as request, hctx, tags, ...
> > 
> > The following patch will support global tags on null_blk, also the 
> > performance
> > data is provided, no obvious performance loss is observed when the whole
> > hw queue depth is same.
> > 
> > Cc: Hannes Reinecke 
> > Cc: Arun Easi 
> > Cc: Omar Sandoval ,
> > Cc: "Martin K. Petersen" ,
> > Cc: James Bottomley ,
> > Cc: Christoph Hellwig ,
> > Cc: Don Brace 
> > Cc: Kashyap Desai 
> > Cc: Peter Rivera 
> > Cc: Laurence Oberman 
> > Cc: Mike Snitzer 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq-debugfs.c |  1 +
> >  block/blk-mq-sched.c   |  2 +-
> >  block/blk-mq-tag.c | 23 ++-
> >  block/blk-mq-tag.h |  5 -
> >  block/blk-mq.c | 29 -
> >  block/blk-mq.h |  3 ++-
> >  include/linux/blk-mq.h |  2 ++
> >  7 files changed, 52 insertions(+), 13 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 0dfafa4b655a..0f0fafe03f5d 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
> > HCTX_FLAG_NAME(SHOULD_MERGE),
> > HCTX_FLAG_NAME(TAG_SHARED),
> > HCTX_FLAG_NAME(SG_MERGE),
> > +   HCTX_FLAG_NAME(GLOBAL_TAGS),
> > HCTX_FLAG_NAME(BLOCKING),
> > HCTX_FLAG_NAME(NO_SCHED),
> >  };
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 55c0a745b427..191d4bc95f0e 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue 
> > *q,
> > int ret;
> >  
> > hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
> > -  set->reserved_tags);
> > +  set->reserved_tags, false);
> > if (!hctx->sched_tags)
> > return -ENOMEM;
> >  
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 571797dc36cb..66377d09eaeb 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -379,9 +379,11 @@ static struct blk_mq_tags 
> > *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
> > return NULL;
> >  }
> >  
> > -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> > +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> > +unsigned int total_tags,
> >  unsigned int reserved_tags,
> > -int node, int alloc_policy)
> > +int node, int alloc_policy,
> > +bool global_tag)
> >  {
> > struct blk_mq_tags *tags;
> >  
> > @@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
> > total_tags,
> > tags->nr_tags = total_tags;
> > tags->nr_reserved_tags = reserved_tags;
> >  
> > +   WARN_ON(global_tag && !set->global_tags);
> > +   if (global_tag && set->global_tags) {
> > +   tags->bitmap_tags = set->global_tags->bitmap_tags;
> > +   tags->breserved_tags = set->global_tags->breserved_tags;
> > +   tags->active_queues = set->global_tags->active_queues;
> > +   tags->global_tags = true;
> > +   return tags;
> > +   }
> > tags->bitmap_tags = >__bitmap_tags;
> > 

Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Ming Lei
On Mon, Feb 05, 2018 at 07:58:29AM +0100, Hannes Reinecke wrote:
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Hi All,
> > 
> > This patchset supports global tags which was started by Hannes originally:
> > 
> > https://marc.info/?l=linux-block=149132580511346=2
> > 
> > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > driver can avoid to support two IO paths(legacy and blk-mq), especially
> > recent discusion mentioned that SCSI_MQ will be enabled at default soon.
> > 
> > https://marc.info/?l=linux-scsi=151727684915589=2
> > 
> > With the above two changes, it should be easier to convert SCSI drivers'
> > reply queue into blk-mq's hctx, then the automatic irq affinity issue can
> > be solved easily, please see detailed descrption in commit log.
> > 
> > Also drivers may require to complete request on the submission CPU
> > for avoiding hard/soft deadlock, which can be done easily with blk_mq
> > too.
> > 
> > https://marc.info/?t=15160185141=1=2
> > 
> > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > so that IO hang issue can be avoided inside legacy IO path, this issue is
> > a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.
> > 
> > Thanks
> > Ming
> > 
> > Ming Lei (5):
> >   blk-mq: tags: define several fields of tags as pointer
> >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> >   block: null_blk: introduce module parameter of 'g_global_tags'
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> > 
> >  block/bfq-iosched.c|  4 +--
> >  block/blk-mq-debugfs.c | 11 
> >  block/blk-mq-sched.c   |  2 +-
> >  block/blk-mq-tag.c | 67 
> > +-
> >  block/blk-mq-tag.h | 15 ---
> >  block/blk-mq.c | 31 -
> >  block/blk-mq.h |  3 ++-
> >  block/kyber-iosched.c  |  2 +-
> >  drivers/block/null_blk.c   |  6 +
> >  drivers/scsi/hosts.c   |  1 +
> >  drivers/scsi/virtio_scsi.c | 59 +++-
> >  include/linux/blk-mq.h |  2 ++
> >  include/scsi/scsi_host.h   |  3 +++
> >  13 files changed, 105 insertions(+), 101 deletions(-)
> > 
> Thanks Ming for picking this up.
> 
> I'll give it a shot and see how it behaves on other hardware.

Hi Hannes,

Thanks for looking at it.

I am working on V2, which has fixed some issues, and added your patch
of 'scsi: Add template flag 'host_tagset', but causes a HPSA kernel
oops. Once it is fixed, I will posted V2 out, then there will be one
real example about how to use global tags for converting reply queue
to blk-mq hctx.

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-V2

Thanks,
Ming


Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Ming Lei
Hi Kashyap,

On Mon, Feb 05, 2018 at 12:35:13PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Hannes Reinecke [mailto:h...@suse.de]
> > Sent: Monday, February 5, 2018 12:28 PM
> > To: Ming Lei; Jens Axboe; linux-block@vger.kernel.org; Christoph Hellwig;
> > Mike Snitzer
> > Cc: linux-s...@vger.kernel.org; Arun Easi; Omar Sandoval; Martin K .
> > Petersen;
> > James Bottomley; Christoph Hellwig; Don Brace; Kashyap Desai; Peter
> > Rivera;
> > Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > On 02/03/2018 05:21 AM, Ming Lei wrote:
> > > Hi All,
> > >
> > > This patchset supports global tags which was started by Hannes
> > > originally:
> > >
> > >   https://marc.info/?l=linux-block=149132580511346=2
> > >
> > > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > > driver can avoid to support two IO paths(legacy and blk-mq),
> > > especially recent discusion mentioned that SCSI_MQ will be enabled at
> > default soon.
> > >
> > >   https://marc.info/?l=linux-scsi=151727684915589=2
> > >
> > > With the above two changes, it should be easier to convert SCSI drivers'
> > > reply queue into blk-mq's hctx, then the automatic irq affinity issue
> > > can be solved easily, please see detailed descrption in commit log.
> > >
> > > Also drivers may require to complete request on the submission CPU for
> > > avoiding hard/soft deadlock, which can be done easily with blk_mq too.
> > >
> > >   https://marc.info/?t=15160185141=1=2
> > >
> > > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > > so that IO hang issue can be avoided inside legacy IO path, this issue
> > > is a bit generic, at least HPSA/virtio-scsi are found broken with
> > > v4.15+.
> > >
> > > Thanks
> > > Ming
> > >
> > > Ming Lei (5):
> > >   blk-mq: tags: define several fields of tags as pointer
> > >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> > >   block: null_blk: introduce module parameter of 'g_global_tags'
> > >   scsi: introduce force_blk_mq
> > >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> > >
> > >  block/bfq-iosched.c|  4 +--
> > >  block/blk-mq-debugfs.c | 11 
> > >  block/blk-mq-sched.c   |  2 +-
> > >  block/blk-mq-tag.c | 67
> > > +-
> > >  block/blk-mq-tag.h | 15 ---
> > >  block/blk-mq.c | 31 -
> > >  block/blk-mq.h |  3 ++-
> > >  block/kyber-iosched.c  |  2 +-
> > >  drivers/block/null_blk.c   |  6 +
> > >  drivers/scsi/hosts.c   |  1 +
> > >  drivers/scsi/virtio_scsi.c | 59
> > > +++-
> > >  include/linux/blk-mq.h |  2 ++
> > >  include/scsi/scsi_host.h   |  3 +++
> > >  13 files changed, 105 insertions(+), 101 deletions(-)
> > >
> > Thanks Ming for picking this up.
> >
> > I'll give it a shot and see how it behaves on other hardware.
> 
> Ming -
> 
> There is no way we can enable global tags from SCSI stack in this patch

It has been done in V2 of this patchset, which will be posted out after
HPSA's issue is fixed:

https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-V2

Global tags will be enabled easily via .host_tagset of scsi_host_template.

> series.   I still think we have no solution for issue described below in
> this patch series.
> https://marc.info/?t=15160185141=1=2
> 
> What we will be doing is just use global tag HBA wide instead of h/w queue
> based.

Right, that is just what the 1st three patches are doing.

> We still have more than one reply queue ending up completion one CPU.

pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
smp_affinity_enable has to be set as 1, but seems it is the default
setting.

Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
which figures out an optimal number of vectors, and the computation is
based on cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
mapped to some of reply queues, these queues won't be active(no request
submitted to these queues). The mechanism of PCI_IRQ_AFFINITY basically
makes sure that more than one irq vector won't be handled by one same CPU,
and the irq vector spread is done in irq_create_affinity_masks().

> Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via module
> parameter to simulate the issue. We need more number of Online CPU than
> reply-queue.

IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
irq vector number, num_possible_cpus()/num_online_cpus() and each
irq vector's affinity assignment.

> We may see completion redirected to original CPU because of
> "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU busy
> in local ISR routine.

Could you dump each irq vector's affinity assignment of your 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> o Simple configuration of IBNBD:
>>- Server side is completely passive: volumes do not need to be
>>  explicitly exported.
>
> That sounds like a security hole? I think the ability to configure whether or
> not an initiator is allowed to log in is essential and also which volumes an
> initiator has access to.
Our design target for well controlled production environment, so
security is handle in other layer.
On server side, admin can set the dev_search_path in module parameter
to set parent directory,
this will concatenate with the path client send in open message to
open  a block device.


>
>>- Only IB port GID and device path needed on client side to map
>>  a block device.
>
> I think IP addressing is preferred over GID addressing in RoCE networks.
> Additionally, have you noticed that GUID configuration support has been added
> to the upstream ib_srpt driver? Using GIDs has a very important disadvantage,
> namely that at least in IB networks the prefix will change if the subnet
> manager is reconfigured. Additionally, in IB networks it may happen that the
> target driver is loaded and configured before the GID has been assigned to
> all RDMA ports.
>
> Thanks,
>
> Bart.

Sorry, the above description is not accurate, IBNBD/IBTRS support
GID/IPv4/IPv6 addressing.
We will adjust in next post.

Thanks,
-- 
Jack Wang
Linux Kernel Developer


Re: blkdev loop UAF

2018-02-05 Thread Jan Kara
On Thu 01-02-18 19:58:45, Eric Biggers wrote:
> On Thu, Jan 11, 2018 at 06:00:08PM +0100, Jan Kara wrote:
> > On Thu 11-01-18 19:22:39, Hou Tao wrote:
> > > Hi,
> > > 
> > > On 2018/1/11 16:24, Dan Carpenter wrote:
> > > > Thanks for your report and the patch.  I am sending it to the
> > > > linux-block devs since it's already public.
> > > > 
> > > > regards,
> > > > dan carpenter
> > > 
> > > The User-after-free problem is not specific for loop device, it can also
> > > be reproduced on scsi device, and there are more race problems caused by
> > > the race between bdev open and gendisk shutdown [1].
> > > 
> > > The cause of the UAF problem is that there are two instances of gendisk 
> > > which share
> > > the same bdev. After the process owning the new gendisk increases 
> > > bdev->bd_openers,
> > > the other process which owns the older gendisk will find bdev->bd_openers 
> > > is not zero
> > > and will put the last reference of the older gendisk and cause 
> > > User-after-free.
> > > 
> > > I had proposed a patch for the problem, but it's still an incomplete fix 
> > > for the race
> > > between gendisk shutdown and bdev opening.
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index 4a181fc..5ecdb9f 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -1510,6 +1510,11 @@ static int __blkdev_get(struct block_device *bdev, 
> > > fmode_t mode, int for_part)
> > > if (bdev->bd_bdi == _backing_dev_info)
> > > bdev->bd_bdi = 
> > > bdi_get(disk->queue->backing_dev_info);
> > > } else {
> > > +   if (bdev->bd_disk != disk) {
> > > +   ret = -ENXIO;
> > > +   goto out_unlock_bdev;
> > > +   }
> > > +
> > > if (bdev->bd_contains == bdev) {
> > > ret = 0;
> > > if (bdev->bd_disk->fops->open)
> > > 
> > > 
> > > As far as I know, Jan Kara is working on these problems. So, Jan, any 
> > > suggestions ?
> > 
> > Yeah, thanks for the ping. I was working today to get this fixed. I have
> > about 6 patches (attached) which should fix all the outstanding issues I'm
> > aware of.  So far they are only boot tested. I will give them more
> > stress-testing next week and then post them officially but if you have time,
> > feel free to try them out as well. Thanks!
> > 
> 
> Jan, are you still planning to fix this?  This was also reported by syzbot 
> here:
> https://groups.google.com/d/msg/syzkaller-bugs/9wXx4YULITQ/0SK8vqxhAgAJ.  Note
> that the C reproducer attached to that report doesn't work for me anymore
> following ae6650163c6 ("loop: fix concurrent lo_open/lo_release").  However,
> syzbot has still hit this on more recent kernels.  Here's a C reproducer that
> still works for me on Linus' tree as of today, though it can take 5-10 
> seconds:

Yes, I do plan to post the patches. I was just struggling with getting the
original failure reproduced (plus I had quite some other work, vacation) so
I couldn't really test whether my patches fix the reported problem. So
thanks for your reproducer, I'll try it out and see whether my patches fix
the problem.

Honza

> #include 
> #include 
> #include 
> #include 
> 
> int main()
> {
>   int fd = open("/dev/loop-control", 0);
> 
>   if (!fork()) {
>   for (;;)
>   ioctl(fd, LOOP_CTL_ADD, 0);
>   }
> 
>   if (!fork()) {
>   for (;;)
>   ioctl(fd, LOOP_CTL_REMOVE, 0);
>   }
> 
>   fork();
>   for (;;) {
>   fd = open("/dev/loop0", O_EXCL);
>   close(fd);
>   }
> }
> 
> This is the KASAN splat I get from the above:
> 
> BUG: KASAN: use-after-free in disk_unblock_events+0x3e/0x40 block/genhd.c:1672
> Read of size 8 at addr 880030743670 by task systemd-udevd/165
> 
> CPU: 1 PID: 165 Comm: systemd-udevd Not tainted 4.15.0-08279-gfe53d1443a146 
> #38
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.0-20171110_100015-anatol 04/01/2014
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0xb3/0x145 lib/dump_stack.c:53
> perf: interrupt took too long (7334 > 4816), lowering 
> kernel.perf_event_max_sample_rate to 27000
>  print_address_description+0x73/0x290 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x235/0x350 mm/kasan/report.c:409
>  disk_unblock_events+0x3e/0x40 block/genhd.c:1672
>  __blkdev_get+0x41b/0xd30 fs/block_dev.c:1535
>  blkdev_get+0x283/0x980 fs/block_dev.c:1591
>  do_dentry_open.isra.1+0x6c0/0xaf0 fs/open.c:752
>  do_last fs/namei.c:3378 [inline]
>  path_openat+0x587/0x28b0 fs/namei.c:3518
>  do_filp_open+0x231/0x3a0 fs/namei.c:3553
>  do_sys_open+0x399/0x610 fs/open.c:1059
>  do_syscall_64+0x213/0x740 arch/x86/entry/common.c:285
>  entry_SYSCALL64_slow_path+0x25/0x25
> RIP: 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
On Fri, Feb 2, 2018 at 5:40 PM, Doug Ledford  wrote:
> On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote:
>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > Since the first version the following was changed:
>> >
>> >- Load-balancing and IO fail-over using multipath features were added.
>> >- Major parts of the code were rewritten, simplified and overall code
>> >  size was reduced by a quarter.
>>
>> That is interesting to know, but what happened to the feedback that Sagi and
>> I provided on v1? Has that feedback been addressed? See also
>> https://www.spinics.net/lists/linux-rdma/msg47819.html and
>> https://www.spinics.net/lists/linux-rdma/msg47879.html.
>>
>> Regarding multipath support: there are already two multipath implementations
>> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
>> I'm not sure we want a third multipath implementation in the Linux kernel.
>
> There's more than that.  There was also md-multipath, and smc-r includes
> another version of multipath, plus I assume we support mptcp as well.
>
> But, to be fair, the different multipaths in this list serve different
> purposes and I'm not sure they could all be generalized out and served
> by a single multipath code.  Although, fortunately, md-multipath is
> deprecated, so no need to worry about it, and it is only dm-multipath
> and nvme multipath that deal directly with block devices and assume
> block semantics.  If I read the cover letter right (and I haven't dug
> into the code to confirm this), the ibtrs multipath has much more in
> common with smc-r multipath, where it doesn't really assume a block
> layer device sits on top of it, it's more of a pure network multipath,
> which the implementation of smc-r is and mptcp would be too.  I would
> like to see a core RDMA multipath implementation soon that would
> abstract out some of these multipath tasks, at least across RDMA links,
> and that didn't have the current limitations (smc-r only supports RoCE
> links, and it sounds like ibtrs only supports IB like links, but maybe
> I'm wrong there, I haven't looked at the patches yet).
Hi Doug, hi Bart,

Thanks for your valuable input, here is my 2 cents:

IBTRS multipath is indeed a network multipath, with sysfs interface to
remove/add path dynamically.
IBTRS is built on rdma-cm, so expect to support RoCE and iWARP, but we
mainly tested in IB environment,
slightly tested on RXE.


Regards,
-- 
Jack Wang
Linux Kernel Developer