[PATCH v2 2/2] block/cfq: Fix memory leak without CFQ_GROUP_IOSCHED

2017-10-12 Thread Jeffy Chen
Currently we only unref the async cfqqs in cfq_pd_offline, which would
never be called without CONFIG_CFQ_GROUP_IOSCHED enabled.

Kmemleak reported:
unreferenced object 0xffc0cd9fc000 (size 240):
  comm "kworker/3:1", pid 52, jiffies 4294673527 (age 97.149s)
  hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 80 55 13 cf c0 ff ff ff  .U..
10 c0 9f cd c0 ff ff ff 00 00 00 00 00 00 00 00  
  backtrace:
[] kmemleak_alloc+0x58/0x8c
[] kmem_cache_alloc+0x184/0x268
[] cfq_get_queue.isra.11+0x144/0x2e0
[] cfq_set_request+0x1bc/0x444
[] elv_set_request+0x88/0x9c
[] get_request+0x494/0x914
[] blk_get_request+0xdc/0x160
[] scsi_execute+0x70/0x23c
[] scsi_test_unit_ready+0xf4/0x1ec

Fixes: 60a837077e2b ("cfq-iosched: charge async IOs to the appropriate blkcg's 
instead of the root")
Signed-off-by: Jeffy Chen 
---

Changes in v2:
Rewrite commit message and rename api.

 block/cfq-iosched.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef1ad42..d182e81a07af 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -401,6 +401,7 @@ struct cfq_data {
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 static void cfq_put_queue(struct cfq_queue *cfqq);
+static void cfq_put_async_queues(struct cfq_group *cfqg);
 
 static struct cfq_rb_root *st_for(struct cfq_group *cfqg,
enum wl_class_t class,
@@ -1638,17 +1639,8 @@ static void cfq_pd_init(struct blkg_policy_data *pd)
 static void cfq_pd_offline(struct blkg_policy_data *pd)
 {
struct cfq_group *cfqg = pd_to_cfqg(pd);
-   int i;
-
-   for (i = 0; i < IOPRIO_BE_NR; i++) {
-   if (cfqg->async_cfqq[0][i])
-   cfq_put_queue(cfqg->async_cfqq[0][i]);
-   if (cfqg->async_cfqq[1][i])
-   cfq_put_queue(cfqg->async_cfqq[1][i]);
-   }
 
-   if (cfqg->async_idle_cfqq)
-   cfq_put_queue(cfqg->async_idle_cfqq);
+   cfq_put_async_queues(cfqg);
 
/*
 * @blkg is going offline and will be ignored by
@@ -3741,6 +3733,21 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
cfqq->pid = pid;
 }
 
+static void cfq_put_async_queues(struct cfq_group *cfqg)
+{
+   int i;
+
+   for (i = 0; i < IOPRIO_BE_NR; i++) {
+   if (cfqg->async_cfqq[0][i])
+   cfq_put_queue(cfqg->async_cfqq[0][i]);
+   if (cfqg->async_cfqq[1][i])
+   cfq_put_queue(cfqg->async_cfqq[1][i]);
+   }
+
+   if (cfqg->async_idle_cfqq)
+   cfq_put_queue(cfqg->async_idle_cfqq);
+}
+
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
@@ -4564,6 +4571,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
blkcg_deactivate_policy(q, _policy_cfq);
 #else
+   cfq_put_async_queues(cfqd->root_group);
kfree(cfqd->root_group);
 #endif
kfree(cfqd);
-- 
2.11.0




Re: [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Ming Lei
On Thu, Oct 12, 2017 at 11:48:22AM -0700, Bart Van Assche wrote:
> On 10/12/17 11:37, Ming Lei wrote:
> > [ ... ]
> > 
> > This patch improves dispatching from sw queue by using the callback of
> > .get_budget and .put_budget
> > 
> > Reviewed-by: Omar Sandoval 
> > Reviewed-by: Bart Van Assche 
> > Signed-off-by: Ming Lei 
> > ---
> > 
> > Jens/Omar, once .get_budget is introduced, it can be unusual to
> > return busy from .queue_rq, so we can't use the approach discussed
> > to decide to take one by one or flush all. Then this patch
> > simply checks if .get_budget is implemented, looks it is reasonable,
> > and in future it is even possible to take more requests by getting
> > enough budget first.
> > [ ... ]
> 
> Version 7 of this patch differs significantly from the version I reviewed so
> I think you should remove my Reviewed-by tag.

OK, no problem, the only difference is on the introduced .get_budget().

-- 
Ming


Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

2017-10-12 Thread Ming Lei
On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:
> On 10/12/2017 12:37 PM, Ming Lei wrote:
> > For SCSI devices, there is often per-request-queue depth, which need
> > to be respected before queuing one request.
> > 
> > The current blk-mq always dequeues one request first, then calls .queue_rq()
> > to dispatch the request to lld. One obvious issue of this way is that I/O
> > merge may not be good, because when the per-request-queue depth can't be
> > respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> > has to staty in hctx->dispatch list, and never got chance to participate
> > into I/O merge.
> > 
> > This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> > then we can try to get reserved budget first before dequeuing request.
> > Once we can't get budget for queueing I/O, we don't need to dequeue request
> > at all, then I/O merge can get improved a lot.
> 
> I can't help but think that it would be cleaner to just be able to
> reinsert the request into the scheduler properly, if we fail to
> dispatch it. Bart hinted at that earlier as well.

Actually when I start to investigate the issue, the 1st thing I tried
is to reinsert, but that way is even worse on qla2xxx.

Once request is dequeued, the IO merge chance is decreased a lot.
With none scheduler, it becomes not possible to merge because
we only try to merge over the last 8 requests. With mq-deadline,
when one request is reinserted, another request may be dequeued
at the same time.

Not mention the cost of acquiring/releasing lock, that work
is just doing useless work and wasting CPU.

> 
> With that, you would not need budget functions.
> 
> I don't feel _that_ strongly about it, though, and it is something
> we can do down the line as well. I'd just hate having to introduce
> budget ops only to kill them a little later.
> 
> If we stick with the budget, then add a parallel blk_mq_get_budget()
> like you have a blk_mq_put_budget(), so we don't have to litter the code
> with things like:
> 
> > +   if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
> > +   return true;
> 
> all over the place. There are also a few places where you don't use
> blk_mq_put_budget() but open-code it instead, you should be consistent.

OK.


-- 
Ming


Re: [PATCH 1/8] lib: Introduce sgl_alloc() and sgl_free()

2017-10-12 Thread Bart Van Assche
On Thu, 2017-10-12 at 16:52 -0600, Jens Axboe wrote:
> On 10/12/2017 04:45 PM, Bart Van Assche wrote:
> > +++ b/include/linux/sgl_alloc.h
> > @@ -0,0 +1,16 @@
> > +#ifndef _SGL_ALLOC_H_
> > +#define _SGL_ALLOC_H_
> > +
> > +#include  /* bool, gfp_t */
> > +
> > +struct scatterlist;
> > +
> > +struct scatterlist *sgl_alloc_order(unsigned long long length,
> > +   unsigned int order, unsigned int *nent_p,
> > +   gfp_t gfp, bool chainable);
> > +struct scatterlist *sgl_alloc(unsigned long long length, unsigned int 
> > *nent_p,
> > + gfp_t gfp);
> > +void sgl_free_order(struct scatterlist *sgl, int order);
> > +void sgl_free(struct scatterlist *sgl);
> > +
> > +#endif /* _SGL_ALLOC_H_ */
> 
> Should this just go into linux/scatterlist.h instead of creating a new header
> file?

That's something I can change. I don't have a strong opinion about whether
these declarations should go into a new header file or into linux/scatterlist.h.

Bart.

Re: [PATCH 1/8] lib: Introduce sgl_alloc() and sgl_free()

2017-10-12 Thread Jens Axboe
On 10/12/2017 04:45 PM, Bart Van Assche wrote:
> Many kernel drivers contain code that allocate and free both a
> scatterlist and the pages that populate that scatterlist.
> Introduce functions under lib/ that perform these tasks instead
> of duplicating this functionality in multiple drivers.
> 
> Signed-off-by: Bart Van Assche 
> ---
>  include/linux/sgl_alloc.h |  16 
>  lib/Kconfig   |   4 ++
>  lib/Makefile  |   1 +
>  lib/sgl_alloc.c   | 102 
> ++
>  4 files changed, 123 insertions(+)
>  create mode 100644 include/linux/sgl_alloc.h
>  create mode 100644 lib/sgl_alloc.c
> 
> diff --git a/include/linux/sgl_alloc.h b/include/linux/sgl_alloc.h
> new file mode 100644
> index ..a0f719690c9e
> --- /dev/null
> +++ b/include/linux/sgl_alloc.h
> @@ -0,0 +1,16 @@
> +#ifndef _SGL_ALLOC_H_
> +#define _SGL_ALLOC_H_
> +
> +#include  /* bool, gfp_t */
> +
> +struct scatterlist;
> +
> +struct scatterlist *sgl_alloc_order(unsigned long long length,
> + unsigned int order, unsigned int *nent_p,
> + gfp_t gfp, bool chainable);
> +struct scatterlist *sgl_alloc(unsigned long long length, unsigned int 
> *nent_p,
> +   gfp_t gfp);
> +void sgl_free_order(struct scatterlist *sgl, int order);
> +void sgl_free(struct scatterlist *sgl);
> +
> +#endif /* _SGL_ALLOC_H_ */

Should this just go into linux/scatterlist.h instead of creating a new header
file?

-- 
Jens Axboe



[PATCH 3/8] nvmet/fc: Use sgl_alloc() and sgl_free()

2017-10-12 Thread Bart Van Assche
Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche 
Cc: Keith Busch 
Cc: Christoph Hellwig 
Cc: James Smart 
Cc: Johannes Thumshirn 
Cc: Sagi Grimberg 
---
 drivers/nvme/target/Kconfig |  1 +
 drivers/nvme/target/fc.c| 37 +++--
 2 files changed, 4 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 03e4ab65fe77..4d9715630e21 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -39,6 +39,7 @@ config NVME_TARGET_FC
tristate "NVMe over Fabrics FC target driver"
depends on NVME_TARGET
depends on HAS_DMA
+   select SGL_ALLOC
help
  This enables the NVMe FC target support, which allows exporting NVMe
  devices over FC.
diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 58e010bdda3e..02084539de5b 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -16,6 +16,7 @@
  */
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1683,31 +1684,12 @@ static int
 nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 {
struct scatterlist *sg;
-   struct page *page;
unsigned int nent;
-   u32 page_len, length;
-   int i = 0;
 
-   length = fod->total_length;
-   nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL);
+   sg = sgl_alloc(fod->total_length, , GFP_KERNEL);
if (!sg)
goto out;
 
-   sg_init_table(sg, nent);
-
-   while (length) {
-   page_len = min_t(u32, length, PAGE_SIZE);
-
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   goto out_free_pages;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-
fod->data_sg = sg;
fod->data_sg_cnt = nent;
fod->data_sg_cnt = fc_dma_map_sg(fod->tgtport->dev, sg, nent,
@@ -1717,14 +1699,6 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 
return 0;
 
-out_free_pages:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-   fod->data_sg = NULL;
-   fod->data_sg_cnt = 0;
 out:
return NVME_SC_INTERNAL;
 }
@@ -1732,18 +1706,13 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 static void
 nvmet_fc_free_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 {
-   struct scatterlist *sg;
-   int count;
-
if (!fod->data_sg || !fod->data_sg_cnt)
return;
 
fc_dma_unmap_sg(fod->tgtport->dev, fod->data_sg, fod->data_sg_cnt,
((fod->io_dir == NVMET_FCP_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE));
-   for_each_sg(fod->data_sg, sg, fod->data_sg_cnt, count)
-   __free_page(sg_page(sg));
-   kfree(fod->data_sg);
+   sgl_free(fod->data_sg);
fod->data_sg = NULL;
fod->data_sg_cnt = 0;
 }
-- 
2.14.2



[PATCH 2/8] crypto: scompress - use sgl_alloc() and sgl_free()

2017-10-12 Thread Bart Van Assche
Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche 
Cc: Ard Biesheuvel 
Cc: Herbert Xu 
---
 crypto/Kconfig |  1 +
 crypto/scompress.c | 52 +++-
 2 files changed, 4 insertions(+), 49 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 0a121f9ddf8e..a0667dd284ff 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -105,6 +105,7 @@ config CRYPTO_KPP
 config CRYPTO_ACOMP2
tristate
select CRYPTO_ALGAPI2
+   select SGL_ALLOC
 
 config CRYPTO_ACOMP
tristate
diff --git a/crypto/scompress.c b/crypto/scompress.c
index 2075e2c4e7df..ea4d5a3e01d1 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -140,53 +141,6 @@ static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
return ret;
 }
 
-static void crypto_scomp_sg_free(struct scatterlist *sgl)
-{
-   int i, n;
-   struct page *page;
-
-   if (!sgl)
-   return;
-
-   n = sg_nents(sgl);
-   for_each_sg(sgl, sgl, n, i) {
-   page = sg_page(sgl);
-   if (page)
-   __free_page(page);
-   }
-
-   kfree(sgl);
-}
-
-static struct scatterlist *crypto_scomp_sg_alloc(size_t size, gfp_t gfp)
-{
-   struct scatterlist *sgl;
-   struct page *page;
-   int i, n;
-
-   n = ((size - 1) >> PAGE_SHIFT) + 1;
-
-   sgl = kmalloc_array(n, sizeof(struct scatterlist), gfp);
-   if (!sgl)
-   return NULL;
-
-   sg_init_table(sgl, n);
-
-   for (i = 0; i < n; i++) {
-   page = alloc_page(gfp);
-   if (!page)
-   goto err;
-   sg_set_page(sgl + i, page, PAGE_SIZE, 0);
-   }
-
-   return sgl;
-
-err:
-   sg_mark_end(sgl + i);
-   crypto_scomp_sg_free(sgl);
-   return NULL;
-}
-
 static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 {
struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
@@ -220,7 +174,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, 
int dir)
  scratch_dst, >dlen, *ctx);
if (!ret) {
if (!req->dst) {
-   req->dst = crypto_scomp_sg_alloc(req->dlen, GFP_ATOMIC);
+   req->dst = sgl_alloc(req->dlen, NULL, GFP_ATOMIC);
if (!req->dst)
goto out;
}
@@ -274,7 +228,7 @@ int crypto_init_scomp_ops_async(struct crypto_tfm *tfm)
 
crt->compress = scomp_acomp_compress;
crt->decompress = scomp_acomp_decompress;
-   crt->dst_free = crypto_scomp_sg_free;
+   crt->dst_free = sgl_free;
crt->reqsize = sizeof(void *);
 
return 0;
-- 
2.14.2



[PATCH 8/8] scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order()

2017-10-12 Thread Bart Van Assche
Use the sgl_alloc_order() and sgl_free_order() functions instead
of open coding these functions.

Signed-off-by: Bart Van Assche 
Cc: linux-s...@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Anil Ravindranath 
---
 drivers/scsi/Kconfig   |  1 +
 drivers/scsi/pmcraid.c | 44 +---
 drivers/scsi/pmcraid.h |  2 +-
 3 files changed, 7 insertions(+), 40 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index d11e75e76195..632200ec36a6 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -1576,6 +1576,7 @@ config ZFCP
 config SCSI_PMCRAID
tristate "PMC SIERRA Linux MaxRAID adapter support"
depends on PCI && SCSI && NET
+   select SGL_ALLOC
---help---
  This driver supports the PMC SIERRA MaxRAID adapters.
 
diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c
index b4d6cd8cd1ad..5f3ea3f3dc64 100644
--- a/drivers/scsi/pmcraid.c
+++ b/drivers/scsi/pmcraid.c
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3232,12 +3233,7 @@ static int pmcraid_build_ioadl(
  */
 static void pmcraid_free_sglist(struct pmcraid_sglist *sglist)
 {
-   int i;
-
-   for (i = 0; i < sglist->num_sg; i++)
-   __free_pages(sg_page(&(sglist->scatterlist[i])),
-sglist->order);
-
+   sgl_free_order(sglist->scatterlist, sglist->order);
kfree(sglist);
 }
 
@@ -3254,50 +3250,20 @@ static void pmcraid_free_sglist(struct pmcraid_sglist 
*sglist)
 static struct pmcraid_sglist *pmcraid_alloc_sglist(int buflen)
 {
struct pmcraid_sglist *sglist;
-   struct scatterlist *scatterlist;
-   struct page *page;
-   int num_elem, i, j;
int sg_size;
int order;
-   int bsize_elem;
 
sg_size = buflen / (PMCRAID_MAX_IOADLS - 1);
order = (sg_size > 0) ? get_order(sg_size) : 0;
-   bsize_elem = PAGE_SIZE * (1 << order);
-
-   /* Determine the actual number of sg entries needed */
-   if (buflen % bsize_elem)
-   num_elem = (buflen / bsize_elem) + 1;
-   else
-   num_elem = buflen / bsize_elem;
 
/* Allocate a scatter/gather list for the DMA */
-   sglist = kzalloc(sizeof(struct pmcraid_sglist) +
-(sizeof(struct scatterlist) * (num_elem - 1)),
-GFP_KERNEL);
-
+   sglist = kzalloc(sizeof(struct pmcraid_sglist), GFP_KERNEL);
if (sglist == NULL)
return NULL;
 
-   scatterlist = sglist->scatterlist;
-   sg_init_table(scatterlist, num_elem);
sglist->order = order;
-   sglist->num_sg = num_elem;
-   sg_size = buflen;
-
-   for (i = 0; i < num_elem; i++) {
-   page = alloc_pages(GFP_KERNEL|GFP_DMA|__GFP_ZERO, order);
-   if (!page) {
-   for (j = i - 1; j >= 0; j--)
-   __free_pages(sg_page([j]), order);
-   kfree(sglist);
-   return NULL;
-   }
-
-   sg_set_page([i], page,
-   sg_size < bsize_elem ? sg_size : bsize_elem, 0);
-   sg_size -= bsize_elem;
-   }
+   sgl_alloc_order(buflen, order, >num_sg,
+   GFP_KERNEL | GFP_DMA | __GFP_ZERO, false);
 
return sglist;
 }
diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h
index 44da91712115..754ef30927e2 100644
--- a/drivers/scsi/pmcraid.h
+++ b/drivers/scsi/pmcraid.h
@@ -542,7 +542,7 @@ struct pmcraid_sglist {
u32 order;
u32 num_sg;
u32 num_dma_sg;
-   struct scatterlist scatterlist[1];
+   struct scatterlist *scatterlist;
 };
 
 /* page D0 inquiry data of focal point resource */
-- 
2.14.2



[PATCH 6/8] scsi/ipr: Use sgl_alloc_order() and sgl_free_order()

2017-10-12 Thread Bart Van Assche
Use the sgl_alloc_order() and sgl_free_order() functions instead
of open coding these functions.

Signed-off-by: Bart Van Assche 
Cc: linux-s...@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Brian King 
---
 drivers/scsi/Kconfig |  1 +
 drivers/scsi/ipr.c   | 50 +-
 drivers/scsi/ipr.h   |  2 +-
 3 files changed, 11 insertions(+), 42 deletions(-)

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 41366339b950..d11e75e76195 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -1058,6 +1058,7 @@ config SCSI_IPR
depends on PCI && SCSI && ATA
select FW_LOADER
select IRQ_POLL
+   select SGL_ALLOC
---help---
  This driver supports the IBM Power Linux family RAID adapters.
  This includes IBM pSeries 5712, 5703, 5709, and 570A, as well
diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index f838bd73befa..b197995a6f34 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -76,6 +76,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -3815,10 +3816,8 @@ static struct device_attribute ipr_iopoll_weight_attr = {
  **/
 static struct ipr_sglist *ipr_alloc_ucode_buffer(int buf_len)
 {
-   int sg_size, order, bsize_elem, num_elem, i, j;
+   int sg_size, order;
struct ipr_sglist *sglist;
-   struct scatterlist *scatterlist;
-   struct page *page;
 
/* Get the minimum size per scatter/gather element */
sg_size = buf_len / (IPR_MAX_SGLIST - 1);
@@ -3826,45 +3825,18 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int 
buf_len)
/* Get the actual size per element */
order = get_order(sg_size);
 
-   /* Determine the actual number of bytes per element */
-   bsize_elem = PAGE_SIZE * (1 << order);
-
-   /* Determine the actual number of sg entries needed */
-   if (buf_len % bsize_elem)
-   num_elem = (buf_len / bsize_elem) + 1;
-   else
-   num_elem = buf_len / bsize_elem;
-
/* Allocate a scatter/gather list for the DMA */
-   sglist = kzalloc(sizeof(struct ipr_sglist) +
-(sizeof(struct scatterlist) * (num_elem - 1)),
-GFP_KERNEL);
-
+   sglist = kzalloc(sizeof(struct ipr_sglist), GFP_KERNEL);
if (sglist == NULL) {
ipr_trace;
return NULL;
}
-
-   scatterlist = sglist->scatterlist;
-   sg_init_table(scatterlist, num_elem);
-
sglist->order = order;
-   sglist->num_sg = num_elem;
-
-   /* Allocate a bunch of sg elements */
-   for (i = 0; i < num_elem; i++) {
-   page = alloc_pages(GFP_KERNEL, order);
-   if (!page) {
-   ipr_trace;
-
-   /* Free up what we already allocated */
-   for (j = i - 1; j >= 0; j--)
-   __free_pages(sg_page([j]), order);
-   kfree(sglist);
-   return NULL;
-   }
-
-   sg_set_page([i], page, 0, 0);
+   sglist->scatterlist = sgl_alloc_order(buf_len, order, >num_sg,
+ GFP_KERNEL, false);
+   if (!sglist->scatterlist) {
+   kfree(sglist);
+   return NULL;
}
 
return sglist;
@@ -3882,11 +3854,7 @@ static struct ipr_sglist *ipr_alloc_ucode_buffer(int 
buf_len)
  **/
 static void ipr_free_ucode_buffer(struct ipr_sglist *sglist)
 {
-   int i;
-
-   for (i = 0; i < sglist->num_sg; i++)
-   __free_pages(sg_page(>scatterlist[i]), sglist->order);
-
+   sgl_free_order(sglist->scatterlist, sglist->order);
kfree(sglist);
 }
 
diff --git a/drivers/scsi/ipr.h b/drivers/scsi/ipr.h
index c7f0e9e3cd7d..93570734cbfb 100644
--- a/drivers/scsi/ipr.h
+++ b/drivers/scsi/ipr.h
@@ -1454,7 +1454,7 @@ struct ipr_sglist {
u32 num_sg;
u32 num_dma_sg;
u32 buffer_len;
-   struct scatterlist scatterlist[1];
+   struct scatterlist *scatterlist;
 };
 
 enum ipr_sdt_state {
-- 
2.14.2



[PATCH 7/8] scsi/pmcraid: Remove an unused structure member

2017-10-12 Thread Bart Van Assche
Signed-off-by: Bart Van Assche 
Cc: linux-s...@vger.kernel.org
Cc: Martin K. Petersen 
Cc: Anil Ravindranath 
---
 drivers/scsi/pmcraid.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/pmcraid.h b/drivers/scsi/pmcraid.h
index 8bfac72a242b..44da91712115 100644
--- a/drivers/scsi/pmcraid.h
+++ b/drivers/scsi/pmcraid.h
@@ -542,7 +542,6 @@ struct pmcraid_sglist {
u32 order;
u32 num_sg;
u32 num_dma_sg;
-   u32 buffer_len;
struct scatterlist scatterlist[1];
 };
 
-- 
2.14.2



[PATCH 0/8] Introduce sgl_alloc() and sgl_free()

2017-10-12 Thread Bart Van Assche
Hello Jens,

As you probably know there are multiple drivers that both allocate a
scatter/gather list and populate that list with pages. This patch series moves
the code for allocating and freeing such scatterlists from these drivers into
a new source file, namely lib/sgl_alloc. Please consider this patch series for
kernel v4.15.

Notes:
- Only the ib_srpt driver changes have been tested but none of the other
  drivers have been retested.
- The next step is to introduce a caching mechanism for these scatterlists
  and make the nvmet/rdma and SCSI target drivers use that caching mechanism
  since for these drivers sgl allocation occurs in the hot path.

Thanks,

Bart.

Bart Van Assche (8):
  lib: Introduce sgl_alloc() and sgl_free()
  crypto: scompress - use sgl_alloc() and sgl_free()
  nvmet/fc: Use sgl_alloc() and sgl_free()
  nvmet/rdma: Use sgl_alloc() and sgl_free()
  target: Use sgl_alloc_order() and sgl_free()
  scsi/ipr: Use sgl_alloc_order() and sgl_free_order()
  scsi/pmcraid: Remove an unused structure member
  scsi/pmcraid: Use sgl_alloc_order() and sgl_free_order()

 crypto/Kconfig |   1 +
 crypto/scompress.c |  52 +
 drivers/nvme/target/Kconfig|   2 +
 drivers/nvme/target/fc.c   |  37 +---
 drivers/nvme/target/rdma.c |  64 ++---
 drivers/scsi/Kconfig   |   2 +
 drivers/scsi/ipr.c |  50 +++-
 drivers/scsi/ipr.h |   2 +-
 drivers/scsi/pmcraid.c |  44 ++
 drivers/scsi/pmcraid.h |   3 +-
 drivers/target/Kconfig |   1 +
 drivers/target/target_core_transport.c |  47 ++-
 include/linux/sgl_alloc.h  |  16 ++
 lib/Kconfig|   4 ++
 lib/Makefile   |   1 +
 lib/sgl_alloc.c| 102 +
 16 files changed, 161 insertions(+), 267 deletions(-)
 create mode 100644 include/linux/sgl_alloc.h
 create mode 100644 lib/sgl_alloc.c

-- 
2.14.2



[PATCH 5/8] target: Use sgl_alloc_order() and sgl_free()

2017-10-12 Thread Bart Van Assche
Use the sgl_alloc_order() and sgl_free() functions instead of open
coding these functions.

Signed-off-by: Bart Van Assche 
Cc: Nicholas A. Bellinger 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
---
 drivers/target/Kconfig |  1 +
 drivers/target/target_core_transport.c | 47 --
 2 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/drivers/target/Kconfig b/drivers/target/Kconfig
index e2bc99980f75..4c44d7bed01a 100644
--- a/drivers/target/Kconfig
+++ b/drivers/target/Kconfig
@@ -5,6 +5,7 @@ menuconfig TARGET_CORE
select CONFIGFS_FS
select CRC_T10DIF
select BLK_SCSI_REQUEST # only for scsi_command_size_tbl..
+   select SGL_ALLOC
default n
help
Say Y or M here to enable the TCM Storage Engine and ConfigFS enabled
diff --git a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
index 836d552b0385..e8f0eabd25ab 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2293,13 +2294,7 @@ static void target_complete_ok_work(struct work_struct 
*work)
 
 void target_free_sgl(struct scatterlist *sgl, int nents)
 {
-   struct scatterlist *sg;
-   int count;
-
-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
-
-   kfree(sgl);
+   sgl_free(sgl);
 }
 EXPORT_SYMBOL(target_free_sgl);
 
@@ -2423,42 +2418,10 @@ int
 target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length,
 bool zero_page, bool chainable)
 {
-   struct scatterlist *sg;
-   struct page *page;
-   gfp_t zero_flag = (zero_page) ? __GFP_ZERO : 0;
-   unsigned int nalloc, nent;
-   int i = 0;
-
-   nalloc = nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   if (chainable)
-   nalloc++;
-   sg = kmalloc_array(nalloc, sizeof(struct scatterlist), GFP_KERNEL);
-   if (!sg)
-   return -ENOMEM;
+   gfp_t gfp = GFP_KERNEL | (zero_page ? __GFP_ZERO : 0);
 
-   sg_init_table(sg, nalloc);
-
-   while (length) {
-   u32 page_len = min_t(u32, length, PAGE_SIZE);
-   page = alloc_page(GFP_KERNEL | zero_flag);
-   if (!page)
-   goto out;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-   *sgl = sg;
-   *nents = nent;
-   return 0;
-
-out:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-   return -ENOMEM;
+   *sgl = sgl_alloc_order(length, 0, nents, gfp, chainable);
+   return *sgl ? 0 : -ENOMEM;
 }
 EXPORT_SYMBOL(target_alloc_sgl);
 
-- 
2.14.2



[PATCH 1/8] lib: Introduce sgl_alloc() and sgl_free()

2017-10-12 Thread Bart Van Assche
Many kernel drivers contain code that allocate and free both a
scatterlist and the pages that populate that scatterlist.
Introduce functions under lib/ that perform these tasks instead
of duplicating this functionality in multiple drivers.

Signed-off-by: Bart Van Assche 
---
 include/linux/sgl_alloc.h |  16 
 lib/Kconfig   |   4 ++
 lib/Makefile  |   1 +
 lib/sgl_alloc.c   | 102 ++
 4 files changed, 123 insertions(+)
 create mode 100644 include/linux/sgl_alloc.h
 create mode 100644 lib/sgl_alloc.c

diff --git a/include/linux/sgl_alloc.h b/include/linux/sgl_alloc.h
new file mode 100644
index ..a0f719690c9e
--- /dev/null
+++ b/include/linux/sgl_alloc.h
@@ -0,0 +1,16 @@
+#ifndef _SGL_ALLOC_H_
+#define _SGL_ALLOC_H_
+
+#include  /* bool, gfp_t */
+
+struct scatterlist;
+
+struct scatterlist *sgl_alloc_order(unsigned long long length,
+   unsigned int order, unsigned int *nent_p,
+   gfp_t gfp, bool chainable);
+struct scatterlist *sgl_alloc(unsigned long long length, unsigned int *nent_p,
+ gfp_t gfp);
+void sgl_free_order(struct scatterlist *sgl, int order);
+void sgl_free(struct scatterlist *sgl);
+
+#endif /* _SGL_ALLOC_H_ */
diff --git a/lib/Kconfig b/lib/Kconfig
index b1445b22a6de..8396c4cfa1ab 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -413,6 +413,10 @@ config HAS_DMA
depends on !NO_DMA
default y
 
+config SGL_ALLOC
+   bool
+   default n
+
 config DMA_NOOP_OPS
bool
depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT)
diff --git a/lib/Makefile b/lib/Makefile
index dafa79613fb4..4a0e9caf3c0e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -27,6 +27,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SGL_ALLOC) += sgl_alloc.o
 lib-$(CONFIG_DMA_NOOP_OPS) += dma-noop.o
 lib-$(CONFIG_DMA_VIRT_OPS) += dma-virt.o
 
diff --git a/lib/sgl_alloc.c b/lib/sgl_alloc.c
new file mode 100644
index ..d96b395dd5c8
--- /dev/null
+++ b/lib/sgl_alloc.c
@@ -0,0 +1,102 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * sgl_alloc_order - allocate a scatterlist and its pages
+ * @length: Length in bytes of the scatterlist. Must be at least one
+ * @order: Second argument for alloc_pages()
+ * @nent_p: [out] Number of entries in the scatterlist that have pages
+ * @gfp: Memory allocation flags
+ * @chainable: Whether or not to allocate an extra element in the scatterlist
+ * for scatterlist chaining purposes
+ *
+ * Returns: %NULL upon failure or a pointer to an initialized scatterlist.
+ */
+struct scatterlist *sgl_alloc_order(unsigned long long length,
+   unsigned int order, unsigned int *nent_p,
+   gfp_t gfp, bool chainable)
+{
+   struct scatterlist *sgl, *sg;
+   struct page *page;
+   unsigned int nent, nalloc;
+   u32 elem_len;
+
+   nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
+   nalloc = nent;
+   if (chainable) {
+   /* Check for integer overflow */
+   if (nalloc + 1 < nalloc)
+   return NULL;
+   nalloc++;
+   }
+   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist),
+   (gfp & ~GFP_DMA) | __GFP_ZERO);
+   if (!sgl)
+   return NULL;
+
+   sg_init_table(sgl, nent);
+   sg = sgl;
+   while (length) {
+   elem_len = min_t(u64, length, PAGE_SIZE << order);
+   page = alloc_pages(gfp, order);
+   if (!page) {
+   sgl_free(sgl);
+   return NULL;
+   }
+
+   sg_set_page(sg, page, elem_len, 0);
+   length -= elem_len;
+   sg = sg_next(sg);
+   }
+   WARN_ON_ONCE(sg);
+   if (nent_p)
+   *nent_p = nent;
+   return sgl;
+}
+EXPORT_SYMBOL(sgl_alloc_order);
+
+/**
+ * sgl_alloc - allocate a scatterlist and its pages
+ * @length: Length in bytes of the scatterlist
+ * @nent_p: [out] Number of entries in the scatterlist
+ * @gfp: Memory allocation flags
+ */
+struct scatterlist *sgl_alloc(unsigned long long length, unsigned int *nent_p,
+ gfp_t gfp)
+{
+   return sgl_alloc_order(length, 0, nent_p, gfp, false);
+}
+EXPORT_SYMBOL(sgl_alloc);
+
+/**
+ * sgl_free_order - free a scatterlist and its pages
+ * @sg: Scatterlist with one or more elements
+ * @order: Second argument for __free_pages()
+ */
+void sgl_free_order(struct scatterlist *sgl, int order)
+{
+   struct scatterlist *sg;
+   struct page *page;
+
+   for (sg = sgl; sg; sg = sg_next(sg)) {
+   page = sg_page(sg);
+   if (page)
+

[bug report] block, bfq: decrease burst size when queues in burst exit

2017-10-12 Thread Dan Carpenter
Hello Paolo Valente,

This is a semi-automatic email about new static checker warnings.

The patch 7cb04004fa37: "block, bfq: decrease burst size when queues 
in burst exit" from Sep 21, 2017, leads to the following Smatch 
complaint:

block/bfq-iosched.c:3730 bfq_put_queue()
 error: we previously assumed 'bfqq->bfqd' could be null (see line 3720)

block/bfq-iosched.c
  3719  
  3720  if (bfqq->bfqd)
^^
Check for NULL.

  3721  bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d",
  3722   bfqq, bfqq->ref);
  3723  
  3724  bfqq->ref--;
  3725  if (bfqq->ref)
  3726  return;
  3727  
  3728  if (bfq_bfqq_sync(bfqq) && 
!hlist_unhashed(>burst_list_node)) {
  3729  hlist_del_init(>burst_list_node);
  3730  bfqq->bfqd->burst_size--;
^^
Patch adds an unchecked dereference.

  3731  }
  3732  

regards,
dan carpenter


Re: [PATCH v7 9/9] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-12 Thread Martin Steigerwald
Bart Van Assche - 10.10.17, 15:27:
> On Tue, 2017-10-10 at 09:57 +0200, Martin Steigerwald wrote:
> 
> > Bart Van Assche - 09.10.17, 16:14:
> > 
> > > The contexts from which a SCSI device can be quiesced or resumed are:
> > > [ ... ]
> > 
> > 
> > Does this as reliably fix the issue as the patches from Ming? I mean in
> > *real world* scenarios? Or is it just about the same approach as Ming
> > has taken. 
> > I ask cause I don´t see any Tested-By:´s here? I know I tested Ming´s
> > patch series and I know it fixes the hang after resume from suspend
> > with blk-mq + BFQ issue for me. I have an uptime of 7 days and I didn´t
> > see any uptime even remotely like that in a long time (before that issue
> > Intel gfx drivers caused hangs, but thankfully that seems fixed
> > meanwhile).
> > 
> > I´d be willing to test. Do you have a 4.14.x tree available with these
> > patches applied I can just add as a remote and fetch from?

> A previous version of this patch series passed Oleksandr's tests. This
> series is close to that previous version so I think it is safe to assume
> that this version will also pass Oleksandr's tests. Anyway, more testing is
> definitely welcome so if you want to verify this series please start from
> the following tree (Jens' for-next branch + this series):
> https://github.com/bvanassche/linux/tree/blk-mq-pm-v7 

Several suspend to ram and resume, and suspend to disk and resume cycles 
without issues. So I think this is good.

Tested-By: Martin Steigerwald 

Thanks,
-- 
Martin


Re: [PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Bart Van Assche

On 10/12/17 11:37, Ming Lei wrote:

[ ... ]

This patch improves dispatching from sw queue by using the callback of
.get_budget and .put_budget

Reviewed-by: Omar Sandoval 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---

Jens/Omar, once .get_budget is introduced, it can be unusual to
return busy from .queue_rq, so we can't use the approach discussed
to decide to take one by one or flush all. Then this patch
simply checks if .get_budget is implemented, looks it is reasonable,
and in future it is even possible to take more requests by getting
enough budget first.
[ ... ]


Version 7 of this patch differs significantly from the version I 
reviewed so I think you should remove my Reviewed-by tag.


Bart.


Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

2017-10-12 Thread Jens Axboe
On 10/12/2017 12:37 PM, Ming Lei wrote:
> For SCSI devices, there is often per-request-queue depth, which need
> to be respected before queuing one request.
> 
> The current blk-mq always dequeues one request first, then calls .queue_rq()
> to dispatch the request to lld. One obvious issue of this way is that I/O
> merge may not be good, because when the per-request-queue depth can't be
> respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
> has to staty in hctx->dispatch list, and never got chance to participate
> into I/O merge.
> 
> This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
> then we can try to get reserved budget first before dequeuing request.
> Once we can't get budget for queueing I/O, we don't need to dequeue request
> at all, then I/O merge can get improved a lot.

I can't help but think that it would be cleaner to just be able to
reinsert the request into the scheduler properly, if we fail to
dispatch it. Bart hinted at that earlier as well.

With that, you would not need budget functions.

I don't feel _that_ strongly about it, though, and it is something
we can do down the line as well. I'd just hate having to introduce
budget ops only to kill them a little later.

If we stick with the budget, then add a parallel blk_mq_get_budget()
like you have a blk_mq_put_budget(), so we don't have to litter the code
with things like:

> + if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
> + return true;

all over the place. There are also a few places where you don't use
blk_mq_put_budget() but open-code it instead, you should be consistent.

-- 
Jens Axboe



[PATCH V7 0/6] blk-mq-sched: improve sequential I/O performance

2017-10-12 Thread Ming Lei
Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

This six patches improve this situation, and brings back
performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performance
improvement on lpfc/qla2xx was observed with V1.[1]

[1] http://marc.info/?l=linux-block=150151989915776=2
[2] https://marc.info/?l=linux-block=150217980602843=2

gitweb:

https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V7_part1

V7:
- introduce .get_budget/.put_budget, and IO merge gets improved
compared with V6, and in theory this approach is better than the way
taken in block legacy

- drop patch of 'blk-mq-sched: don't dequeue request until all in 
->dispatch are flushed'

V6:
- address comments from Christoph
- drop the 1st patch which changes blk_mq_request_bypass_insert(),
which should belong to dm-mpath's improvement
- move ' blk-mq-sched: move actual dispatching into one helper'
as 2nd patch, and use the introduced helper to simplify dispatch
logic
- merge two previous patches into one for improving dispatch from sw 
queue
- make comment/commit log line width as ~70, as suggested by
  Christoph

V5:
- address some comments from Omar
- add Tested-by & Reveiewed-by tag
- use direct issue for blk_mq_request_bypass_insert(), and
start to consider to improve sequential I/O for dm-mpath
- only include part 1(the original patch 1 ~ 6), as suggested
by Omar

V4:
- add Reviewed-by tag
- some trival change: typo fix in commit log or comment,
variable name, no actual functional change

V3:
- totally round robin for picking req from ctx, as suggested
by Bart
- remove one local variable in __sbitmap_for_each_set()
- drop patches of single dispatch list, which can improve
performance on mq-deadline, but cause a bit degrade on
none because all hctxs need to be checked after ->dispatch
is flushed. Will post it again once it is mature.
- rebase on v4.13-rc6 with block for-next

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart

Ming Lei (6):
  blk-mq-sched: fix scheduler bad performance
  blk-mq-sched: move actual dispatching into one helper
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce .get_budget and .put_budget in blk_mq_ops
  blk-mq-sched: improve dispatching from sw queue
  SCSI: implement .get_budget and .put_budget for blk-mq

 block/blk-mq-sched.c| 121 +---
 block/blk-mq.c  |  74 +++--
 block/blk-mq.h  |   4 +-
 drivers/scsi/scsi_lib.c |  42 +
 include/linux/blk-mq.h  |  12 +
 include/linux/sbitmap.h |  64 ++---
 6 files changed, 268 insertions(+), 49 deletions(-)

-- 
2.9.5



[PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

2017-10-12 Thread Ming Lei
For SCSI devices, there is often per-request-queue depth, which need
to be respected before queuing one request.

The current blk-mq always dequeues one request first, then calls .queue_rq()
to dispatch the request to lld. One obvious issue of this way is that I/O
merge may not be good, because when the per-request-queue depth can't be
respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
has to staty in hctx->dispatch list, and never got chance to participate
into I/O merge.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
Once we can't get budget for queueing I/O, we don't need to dequeue request
at all, then I/O merge can get improved a lot.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c   | 39 +++
 block/blk-mq.c | 35 ---
 block/blk-mq.h |  2 +-
 include/linux/blk-mq.h | 10 ++
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index be29ba849408..bcf4773ee1ca 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,19 +89,32 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
return false;
 }
 
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
LIST_HEAD(rq_list);
 
do {
-   struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+   struct request *rq;
 
-   if (!rq)
+   if (e->type->ops.mq.has_work &&
+   !e->type->ops.mq.has_work(hctx))
break;
+
+   if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+   return true;
+
+   rq = e->type->ops.mq.dispatch_request(hctx);
+   if (!rq) {
+   if (q->mq_ops->put_budget)
+   q->mq_ops->put_budget(hctx);
+   break;
+   }
list_add(>queuelist, _list);
-   } while (blk_mq_dispatch_rq_list(q, _list));
+   } while (blk_mq_dispatch_rq_list(q, _list, true));
+
+   return false;
 }
 
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
@@ -110,6 +123,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
LIST_HEAD(rq_list);
+   bool run_queue = false;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -143,13 +157,22 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   if (blk_mq_dispatch_rq_list(q, _list) && has_sched_dispatch)
-   blk_mq_do_dispatch_sched(hctx);
+   if (blk_mq_dispatch_rq_list(q, _list, false) &&
+   has_sched_dispatch)
+   run_queue = blk_mq_do_dispatch_sched(hctx);
} else if (has_sched_dispatch) {
-   blk_mq_do_dispatch_sched(hctx);
+   run_queue = blk_mq_do_dispatch_sched(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, _list);
-   blk_mq_dispatch_rq_list(q, _list);
+   blk_mq_dispatch_rq_list(q, _list, false);
+   }
+
+   if (run_queue) {
+   if (!blk_mq_sched_needs_restart(hctx) &&
+   !test_bit(BLK_MQ_S_TAG_WAITING, >state)) {
+   blk_mq_sched_mark_restart_hctx(hctx);
+   blk_mq_run_hw_queue(hctx, true);
+   }
}
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 076cbab9c3e0..e6e50a91f170 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1045,7 +1045,16 @@ static bool blk_mq_dispatch_wait_add(struct 
blk_mq_hw_ctx *hctx)
return true;
 }
 
-bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
+static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget)
+{
+   struct request_queue *q = hctx->queue;
+
+   if (q->mq_ops->put_budget && got_budget)
+   q->mq_ops->put_budget(hctx);
+}
+
+bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
+   bool got_budget)
 {
struct blk_mq_hw_ctx *hctx;
struct request *rq;
@@ -1054,6 +1063,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
if (list_empty(list))
return false;
 
+   

[PATCH V7 6/6] SCSI: implement .get_budget and .put_budget for blk-mq

2017-10-12 Thread Ming Lei
We need to tell blk-mq to reserve resource before queuing
one request, so implement these two callbacks. Then blk-mq
can avoid to dequeue request earlier, and IO merge can
be improved a lot.

Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..d6273954a3b4 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1946,6 +1946,33 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
blk_mq_complete_request(cmd->request);
 }
 
+static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   struct scsi_device *sdev = q->queuedata;
+
+   atomic_dec(>device_busy);
+   put_device(>sdev_gendev);
+}
+
+static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   struct scsi_device *sdev = q->queuedata;
+
+   if (!get_device(>sdev_gendev))
+   goto out;
+   if (!scsi_dev_queue_ready(q, sdev))
+   goto out_put_device;
+
+   return true;
+
+out_put_device:
+   put_device(>sdev_gendev);
+out:
+   return false;
+}
+
 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 const struct blk_mq_queue_data *bd)
 {
@@ -1962,13 +1989,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out;
 
ret = BLK_STS_RESOURCE;
-   if (!get_device(>sdev_gendev))
-   goto out;
-
-   if (!scsi_dev_queue_ready(q, sdev))
-   goto out_put_device;
if (!scsi_target_queue_ready(shost, sdev))
-   goto out_dec_device_busy;
+   goto out_put_budget;
if (!scsi_host_queue_ready(q, shost, sdev))
goto out_dec_target_busy;
 
@@ -2003,10 +2025,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
 out_dec_target_busy:
if (scsi_target(sdev)->can_queue > 0)
atomic_dec(_target(sdev)->target_busy);
-out_dec_device_busy:
-   atomic_dec(>device_busy);
-out_put_device:
-   put_device(>sdev_gendev);
+out_put_budget:
+   scsi_mq_put_budget(hctx);
 out:
switch (ret) {
case BLK_STS_OK:
@@ -2211,6 +2231,8 @@ struct request_queue *scsi_old_alloc_queue(struct 
scsi_device *sdev)
 }
 
 static const struct blk_mq_ops scsi_mq_ops = {
+   .get_budget = scsi_mq_get_budget,
+   .put_budget = scsi_mq_put_budget,
.queue_rq   = scsi_queue_rq,
.complete   = scsi_softirq_done,
.timeout= scsi_timeout,
-- 
2.9.5



[PATCH V7 5/6] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Ming Lei
SCSI devices use host-wide tagset, and the shared driver tag space is
often quite big. Meantime there is also queue depth for each lun(
.cmd_per_lun), which is often small, for example, on both lpfc and
qla2xxx, .cmd_per_lun is just 3.

So lots of requests may stay in sw queue, and we always flush all
belonging to same hw queue and dispatch them all to driver, unfortunately
it is easy to cause queue busy because of the small .cmd_per_lun.
Once these requests are flushed out, they have to stay in hctx->dispatch,
and no bio merge can participate into these requests, and sequential IO
performance is hurt a lot.

This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
sw queue so that we can dispatch them in scheduler's way, then we can
avoid to dequeue too many requests from sw queue when ->dispatch isn't
flushed completely.

This patch improves dispatching from sw queue by using the callback of
.get_budget and .put_budget

Reviewed-by: Omar Sandoval 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---

Jens/Omar, once .get_budget is introduced, it can be unusual to
return busy from .queue_rq, so we can't use the approach discussed
to decide to take one by one or flush all. Then this patch
simply checks if .get_budget is implemented, looks it is reasonable,
and in future it is even possible to take more requests by getting
enough budget first.


 block/blk-mq-sched.c   | 63 +++---
 block/blk-mq.c | 39 +++
 block/blk-mq.h |  2 ++
 include/linux/blk-mq.h |  2 ++
 4 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index bcf4773ee1ca..ccbbc7e108ea 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -117,6 +117,50 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx 
*hctx)
return false;
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+ struct blk_mq_ctx *ctx)
+{
+   unsigned idx = ctx->index_hw;
+
+   if (++idx == hctx->nr_ctx)
+   idx = 0;
+
+   return hctx->ctxs[idx];
+}
+
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   LIST_HEAD(rq_list);
+   struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+
+   do {
+   struct request *rq;
+
+   if (!sbitmap_any_bit_set(>ctx_map))
+   break;
+
+   if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+   return true;
+
+   rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+   if (!rq) {
+   if (q->mq_ops->put_budget)
+   q->mq_ops->put_budget(hctx);
+   break;
+   }
+   list_add(>queuelist, _list);
+
+   /* round robin for fair dispatch */
+   ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+   } while (blk_mq_dispatch_rq_list(q, _list, true));
+
+   WRITE_ONCE(hctx->dispatch_from, ctx);
+
+   return false;
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
@@ -157,11 +201,24 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   if (blk_mq_dispatch_rq_list(q, _list, false) &&
-   has_sched_dispatch)
-   run_queue = blk_mq_do_dispatch_sched(hctx);
+   if (blk_mq_dispatch_rq_list(q, _list, false)) {
+   if (has_sched_dispatch)
+   run_queue = blk_mq_do_dispatch_sched(hctx);
+   else
+   run_queue = blk_mq_do_dispatch_ctx(hctx);
+   }
} else if (has_sched_dispatch) {
run_queue = blk_mq_do_dispatch_sched(hctx);
+   } else if (q->mq_ops->get_budget) {
+   /*
+* If we need to get budget before queuing request, we
+* dequeue request one by one from sw queue for avoiding
+* to mess up I/O merge when dispatch runs out of resource.
+*
+* TODO: get more budgets, and dequeue more requests in
+* one time.
+*/
+   run_queue = blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list, false);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e6e50a91f170..9e2fc26a9018 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -911,6 +911,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 }
 

[PATCH V7 3/6] sbitmap: introduce __sbitmap_for_each_set()

2017-10-12 Thread Ming Lei
We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Reviewed-by: Omar Sandoval 
Cc: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/sbitmap.h | 64 -
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..0dcc60e820de 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
- * sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * __sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * @start: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned 
int, void *);
  * This is inline even though it's non-trivial so that the function calls to 
the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-   void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+ unsigned int start,
+ sb_for_each_fn fn, void *data)
 {
-   unsigned int i;
+   unsigned int index;
+   unsigned int nr;
+   unsigned int scanned = 0;
 
-   for (i = 0; i < sb->map_nr; i++) {
-   struct sbitmap_word *word = >map[i];
-   unsigned int off, nr;
+   if (start >= sb->depth)
+   start = 0;
+   index = SB_NR_TO_INDEX(sb, start);
+   nr = SB_NR_TO_BIT(sb, start);
 
-   if (!word->word)
-   continue;
+   while (scanned < sb->depth) {
+   struct sbitmap_word *word = >map[index];
+   unsigned int depth = min_t(unsigned int, word->depth - nr,
+  sb->depth - scanned);
 
-   nr = 0;
-   off = i << sb->shift;
+   scanned += depth;
+   if (!word->word)
+   goto next;
+
+   /*
+* On the first iteration of the outer loop, we need to add the
+* bit offset back to the size of the word for find_next_bit().
+* On all other iterations, nr is zero, so this is a noop.
+*/
+   depth += nr;
while (1) {
-   nr = find_next_bit(>word, word->depth, nr);
-   if (nr >= word->depth)
+   nr = find_next_bit(>word, depth, nr);
+   if (nr >= depth)
break;
-
-   if (!fn(sb, off + nr, data))
+   if (!fn(sb, (index << sb->shift) + nr, data))
return;
 
nr++;
}
+next:
+   nr = 0;
+   if (++index >= sb->map_nr)
+   index = 0;
}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+   void *data)
+{
+   __sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
unsigned int bitnr)
-- 
2.9.5



[PATCH V7 1/6] blk-mq-sched: fix scheduler bad performance

2017-10-12 Thread Ming Lei
When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Reviewed-by: Omar Sandoval 
Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..eca011fdfa0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool did_work = false;
+   bool do_sched_dispatch = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   did_work = blk_mq_dispatch_rq_list(q, _list);
+   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
} else if (!has_sched_dispatch) {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
}
 
/*
-* We want to dispatch from the scheduler if we had no work left
-* on the dispatch list, OR if we did have work but weren't able
-* to make progress.
+* We want to dispatch from the scheduler if there was nothing
+* on the dispatch list or we were able to dispatch from the
+* dispatch list.
 */
-   if (!did_work && has_sched_dispatch) {
+   if (do_sched_dispatch && has_sched_dispatch) {
do {
struct request *rq;
 
-- 
2.9.5



[PATCH V7 2/6] blk-mq-sched: move actual dispatching into one helper

2017-10-12 Thread Ming Lei
So that it becomes easy to support to dispatch from sw queue in the
following patch.

No functional change.

Reviewed-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Suggested-by: Christoph Hellwig  # for simplifying dispatch logic
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 43 ---
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eca011fdfa0e..be29ba849408 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,12 +89,26 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
return false;
 }
 
+static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   struct elevator_queue *e = q->elevator;
+   LIST_HEAD(rq_list);
+
+   do {
+   struct request *rq = e->type->ops.mq.dispatch_request(hctx);
+
+   if (!rq)
+   break;
+   list_add(>queuelist, _list);
+   } while (blk_mq_dispatch_rq_list(q, _list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool do_sched_dispatch = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -122,30 +136,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * scheduler, we can no longer merge or sort them. So it's best to
 * leave them there for as long as we can. Mark the hw queue as
 * needing a restart in that case.
+*
+* We want to dispatch from the scheduler if there was nothing
+* on the dispatch list or we were able to dispatch from the
+* dispatch list.
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
-   } else if (!has_sched_dispatch) {
+   if (blk_mq_dispatch_rq_list(q, _list) && has_sched_dispatch)
+   blk_mq_do_dispatch_sched(hctx);
+   } else if (has_sched_dispatch) {
+   blk_mq_do_dispatch_sched(hctx);
+   } else {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
}
-
-   /*
-* We want to dispatch from the scheduler if there was nothing
-* on the dispatch list or we were able to dispatch from the
-* dispatch list.
-*/
-   if (do_sched_dispatch && has_sched_dispatch) {
-   do {
-   struct request *rq;
-
-   rq = e->type->ops.mq.dispatch_request(hctx);
-   if (!rq)
-   break;
-   list_add(>queuelist, _list);
-   } while (blk_mq_dispatch_rq_list(q, _list));
-   }
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5



Re: [PATCH v8 03/10] md: Neither resync nor reshape while the system is frozen

2017-10-12 Thread Shaohua Li
On Thu, Oct 12, 2017 at 04:59:01PM +, Bart Van Assche wrote:
> On Wed, 2017-10-11 at 12:32 -0700, Shaohua Li wrote:
> > On Wed, Oct 11, 2017 at 05:17:56PM +, Bart Van Assche wrote:
> > > On Tue, 2017-10-10 at 18:48 -0700, Shaohua Li wrote:
> > > > The problem is __md_stop_writes set some bit like MD_RECOVERY_FROZEN, 
> > > > which
> > > > will prevent md_check_recovery restarting resync/reshape. I think we 
> > > > need
> > > > preserve mddev->recovery in suspend prepare and restore after resume
> > > 
> > > Hello Shaohua,
> > > 
> > > As far as I can see __md_stop_writes() sets MD_RECOVERY_FROZEN and can set
> > > MD_RECOVERY_INTR. Since md_check_recovery() clears MD_RECOVERY_INTR I 
> > > think
> > > it should be sufficient to save and restore the state of the
> > > MD_RECOVERY_FROZEN flag. However, when I ran the following test:
> > > * echo frozen > /sys/block/md0/md/sync_action
> > > * cat /sys/block/md0/md/sync_action
> > > * systemctl hibernate
> > > * (power on system again)
> > > * cat /sys/block/md0/md/sync_action
> > > 
> > > the output that appeared was as follows:
> > > 
> > > frozen
> > > idle 
> > > Does that mean that a hibernate or suspend followed by a resume is 
> > > sufficient
> > > to clear the MD_RECOVERY_FROZEN flag for the md drivers and hence that the
> > > patch at the start of this e-mail thread does not need any further
> > > modifications?
> > 
> > Have no idea why it shows 'idle'. From my understanding, after resume, 
> > previous
> > memory is stored and MD_RECOVERY_FROZEN bit should be set. Was trying to
> > understand what happens.
> 
> Hello Shaohua,
> 
> I think a user space action causes the MD_RECOVERY_FROZEN bit to be cleared.
> After I had added a few debug statements in the md driver this is what I
> found in the system log:
> 
> Oct 12 09:38:39 kernel: action_store: storing action check
> Oct 12 09:38:39 kernel: md: data-check of RAID array md0
> Oct 12 09:38:40 systemd[1]: Starting Hibernate...
> Oct 12 09:38:40 kernel: md: md0: data-check interrupted.
> [ powered on system manually ]
> Oct 12 09:39:05 kernel: action_store: storing action check
> Oct 12 09:39:05 kernel: md: data-check of RAID array md0
> Oct 12 09:39:11 kernel: action_store: storing action idle
> Oct 12 09:39:11 kernel: md: md0: data-check interrupted.
> 
> Anyway, how about the patch below? I have verified by adding a debug pr_info()
> statement that the newly added code really clears the MD_RECOVERY_FROZEN bit.

Ok, for patch 1-3:
Reviewed-by: Shaohua Li 

> Subject: [PATCH] md: Neither resync nor reshape while the system is frozen
> 
> Some people use the md driver on laptops and use the suspend and
> resume functionality. Since it is essential that submitting of
> new I/O requests stops before a hibernation image is created,
> interrupt the md resync and reshape actions if the system is
> being frozen. Note: the resync and reshape will restart after
> the system is resumed and a message similar to the following
> will appear in the system log:
> 
> md: md0: data-check interrupted.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  drivers/md/md.c | 50 +-
>  drivers/md/md.h |  8 
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b99584e5d6b1..8b2fc750f97f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -66,6 +66,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include "md.h"
> @@ -7439,6 +7441,7 @@ static int md_thread(void *arg)
>*/
>  
>   allow_signal(SIGKILL);
> + set_freezable();
>   while (!kthread_should_stop()) {
>  
>   /* We need to wait INTERRUPTIBLE so that
> @@ -7449,7 +7452,7 @@ static int md_thread(void *arg)
>   if (signal_pending(current))
>   flush_signals(current);
>  
> - wait_event_interruptible_timeout
> + wait_event_freezable_timeout
>   (thread->wqueue,
>test_bit(THREAD_WAKEUP, >flags)
>|| kthread_should_stop() || kthread_should_park(),
> @@ -8944,6 +8947,8 @@ static void md_stop_all_writes(void)
>   int need_delay = 0;
>  
>   for_each_mddev(mddev, tmp) {
> + mddev->frozen_before_suspend =
> + test_bit(MD_RECOVERY_FROZEN, >recovery);
>   if (mddev_trylock(mddev)) {
>   if (mddev->pers)
>   __md_stop_writes(mddev);
> @@ -8963,6 +8968,47 @@ static void md_stop_all_writes(void)
>   mdelay(1000*1);
>  }
>  
> +static void md_restore_frozen_flag(void)
> +{
> + struct 

Re: [PATCH v8 09/10] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-12 Thread Bart Van Assche
On Fri, 2017-10-13 at 00:02 +0800, Ming Lei wrote:
> On Tue, Oct 10, 2017 at 02:03:45PM -0700, Bart Van Assche wrote:
> > [ ... ]
> > +   /* If the SCSI device has already been quiesced, do nothing. */
> > +   if (blk_set_preempt_only(q))
> > +   return 0;
> 
> Last time, I have mentioned that this way isn't safe, especially
> the SCSI domain command can be sent concurrently, also there might
> be issue wrt. runtime PM vs. system suspend.
> 
> The issue should have been avoided simply by setting the flag
> and continue to freeze queue.

I think the pm_autosleep_lock() / pm_autosleep_unlock() calls in the power
management code prevent that the power management code can trigger
concurrent scsi_device_quiesce() calls.

> > +
> > +   blk_mq_freeze_queue(q);
> > +   /*
> > +* Ensure that the effect of blk_set_preempt_only() will be visible
> > +* for percpu_ref_tryget() callers that occur after the queue
> > +* unfreeze even if the queue was already frozen before this function
> > +* was called. See also https://lwn.net/Articles/573497/.
> > +*/
> > +   synchronize_rcu();
> 
> No, it is a bad idea to do two synchronize_rcu() which can
> be very slow, especially in big machine, this way may slow down
> suspend much.

scsi_device_quiesce() is only used by the SCSI power management code and by
the SPI DV code. The number of users of the SCSI parallel code is small. And
the SCSI power management code is used on laptops but not on big machines.
Anyway, if you really care about this optimization, that's something that can
be done easily as a follow-up patch.

Bart.

Re: [PATCH v2] blk-throttle: track read and write request individually

2017-10-12 Thread Shaohua Li
On Thu, Oct 12, 2017 at 05:15:46PM +0800, Joseph Qi wrote:
> From: Joseph Qi 
> 
> In mixed read/write workload on SSD, write latency is much lower than
> read. But now we only track and record read latency and then use it as
> threshold base for both read and write io latency accounting. As a
> result, write io latency will always be considered as good and
> bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
> tg to be checked will be treated as idle most of the time and still let
> others dispatch more ios, even it is truly running under low limit and
> wants its low limit to be guaranteed, which is not we expected in fact.
> So track read and write request individually, which can bring more
> precise latency control for low limit idle detection.
> 
> Signed-off-by: Joseph Qi 

Reviewed-by: Shaohua Li 
> ---
>  block/blk-throttle.c | 134 
> ++-
>  1 file changed, 79 insertions(+), 55 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 17816a0..0897378 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -215,9 +215,9 @@ struct throtl_data
>  
>   unsigned int scale;
>  
> - struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
> - struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
> - struct latency_bucket __percpu *latency_buckets;
> + struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
> + struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
> + struct latency_bucket __percpu *latency_buckets[2];
>   unsigned long last_calculate_time;
>   unsigned long filtered_latency;
>  
> @@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
> throtl_grp *tg)
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>  static void throtl_update_latency_buckets(struct throtl_data *td)
>  {
> - struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
> - int i, cpu;
> - unsigned long last_latency = 0;
> - unsigned long latency;
> + struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
> + int i, cpu, rw;
> + unsigned long last_latency[2] = { 0 };
> + unsigned long latency[2];
>  
>   if (!blk_queue_nonrot(td->queue))
>   return;
> @@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
> throtl_data *td)
>   td->last_calculate_time = jiffies;
>  
>   memset(avg_latency, 0, sizeof(avg_latency));
> - for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> - struct latency_bucket *tmp = >tmp_buckets[i];
> -
> - for_each_possible_cpu(cpu) {
> - struct latency_bucket *bucket;
> -
> - /* this isn't race free, but ok in practice */
> - bucket = per_cpu_ptr(td->latency_buckets, cpu);
> - tmp->total_latency += bucket[i].total_latency;
> - tmp->samples += bucket[i].samples;
> - bucket[i].total_latency = 0;
> - bucket[i].samples = 0;
> - }
> + for (rw = READ; rw <= WRITE; rw++) {
> + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> + struct latency_bucket *tmp = >tmp_buckets[rw][i];
> +
> + for_each_possible_cpu(cpu) {
> + struct latency_bucket *bucket;
> +
> + /* this isn't race free, but ok in practice */
> + bucket = per_cpu_ptr(td->latency_buckets[rw],
> + cpu);
> + tmp->total_latency += bucket[i].total_latency;
> + tmp->samples += bucket[i].samples;
> + bucket[i].total_latency = 0;
> + bucket[i].samples = 0;
> + }
>  
> - if (tmp->samples >= 32) {
> - int samples = tmp->samples;
> + if (tmp->samples >= 32) {
> + int samples = tmp->samples;
>  
> - latency = tmp->total_latency;
> + latency[rw] = tmp->total_latency;
>  
> - tmp->total_latency = 0;
> - tmp->samples = 0;
> - latency /= samples;
> - if (latency == 0)
> - continue;
> - avg_latency[i].latency = latency;
> + tmp->total_latency = 0;
> + tmp->samples = 0;
> + latency[rw] /= samples;
> + if (latency[rw] == 0)
> + continue;
> + avg_latency[rw][i].latency = latency[rw];
> + }
>   }
>   }
>  
> - for (i = 0; i < 

Re: [PATCH v8 03/10] md: Neither resync nor reshape while the system is frozen

2017-10-12 Thread Bart Van Assche
On Wed, 2017-10-11 at 12:32 -0700, Shaohua Li wrote:
> On Wed, Oct 11, 2017 at 05:17:56PM +, Bart Van Assche wrote:
> > On Tue, 2017-10-10 at 18:48 -0700, Shaohua Li wrote:
> > > The problem is __md_stop_writes set some bit like MD_RECOVERY_FROZEN, 
> > > which
> > > will prevent md_check_recovery restarting resync/reshape. I think we need
> > > preserve mddev->recovery in suspend prepare and restore after resume
> > 
> > Hello Shaohua,
> > 
> > As far as I can see __md_stop_writes() sets MD_RECOVERY_FROZEN and can set
> > MD_RECOVERY_INTR. Since md_check_recovery() clears MD_RECOVERY_INTR I think
> > it should be sufficient to save and restore the state of the
> > MD_RECOVERY_FROZEN flag. However, when I ran the following test:
> > * echo frozen > /sys/block/md0/md/sync_action
> > * cat /sys/block/md0/md/sync_action
> > * systemctl hibernate
> > * (power on system again)
> > * cat /sys/block/md0/md/sync_action
> > 
> > the output that appeared was as follows:
> > 
> > frozen
> > idle 
> > Does that mean that a hibernate or suspend followed by a resume is 
> > sufficient
> > to clear the MD_RECOVERY_FROZEN flag for the md drivers and hence that the
> > patch at the start of this e-mail thread does not need any further
> > modifications?
> 
> Have no idea why it shows 'idle'. From my understanding, after resume, 
> previous
> memory is stored and MD_RECOVERY_FROZEN bit should be set. Was trying to
> understand what happens.

Hello Shaohua,

I think a user space action causes the MD_RECOVERY_FROZEN bit to be cleared.
After I had added a few debug statements in the md driver this is what I
found in the system log:

Oct 12 09:38:39 kernel: action_store: storing action check
Oct 12 09:38:39 kernel: md: data-check of RAID array md0
Oct 12 09:38:40 systemd[1]: Starting Hibernate...
Oct 12 09:38:40 kernel: md: md0: data-check interrupted.
[ powered on system manually ]
Oct 12 09:39:05 kernel: action_store: storing action check
Oct 12 09:39:05 kernel: md: data-check of RAID array md0
Oct 12 09:39:11 kernel: action_store: storing action idle
Oct 12 09:39:11 kernel: md: md0: data-check interrupted.

Anyway, how about the patch below? I have verified by adding a debug pr_info()
statement that the newly added code really clears the MD_RECOVERY_FROZEN bit.

Thanks,

Bart.


Subject: [PATCH] md: Neither resync nor reshape while the system is frozen

Some people use the md driver on laptops and use the suspend and
resume functionality. Since it is essential that submitting of
new I/O requests stops before a hibernation image is created,
interrupt the md resync and reshape actions if the system is
being frozen. Note: the resync and reshape will restart after
the system is resumed and a message similar to the following
will appear in the system log:

md: md0: data-check interrupted.

Signed-off-by: Bart Van Assche 
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
---
 drivers/md/md.c | 50 +-
 drivers/md/md.h |  8 
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b99584e5d6b1..8b2fc750f97f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include "md.h"
@@ -7439,6 +7441,7 @@ static int md_thread(void *arg)
 */
 
allow_signal(SIGKILL);
+   set_freezable();
while (!kthread_should_stop()) {
 
/* We need to wait INTERRUPTIBLE so that
@@ -7449,7 +7452,7 @@ static int md_thread(void *arg)
if (signal_pending(current))
flush_signals(current);
 
-   wait_event_interruptible_timeout
+   wait_event_freezable_timeout
(thread->wqueue,
 test_bit(THREAD_WAKEUP, >flags)
 || kthread_should_stop() || kthread_should_park(),
@@ -8944,6 +8947,8 @@ static void md_stop_all_writes(void)
int need_delay = 0;
 
for_each_mddev(mddev, tmp) {
+   mddev->frozen_before_suspend =
+   test_bit(MD_RECOVERY_FROZEN, >recovery);
if (mddev_trylock(mddev)) {
if (mddev->pers)
__md_stop_writes(mddev);
@@ -8963,6 +8968,47 @@ static void md_stop_all_writes(void)
mdelay(1000*1);
 }
 
+static void md_restore_frozen_flag(void)
+{
+   struct list_head *tmp;
+   struct mddev *mddev;
+
+   for_each_mddev(mddev, tmp) {
+   if (mddev->frozen_before_suspend)
+   set_bit(MD_RECOVERY_FROZEN, >recovery);
+   else
+   clear_bit(MD_RECOVERY_FROZEN, >recovery);
+   }
+}
+
+/*
+ * Ensure that neither 

Re: [PATCH v8 09/10] block, scsi: Make SCSI quiesce and resume work reliably

2017-10-12 Thread Ming Lei
On Tue, Oct 10, 2017 at 02:03:45PM -0700, Bart Van Assche wrote:
> The contexts from which a SCSI device can be quiesced or resumed are:
> * Writing into /sys/class/scsi_device/*/device/state.
> * SCSI parallel (SPI) domain validation.
> * The SCSI device power management methods. See also scsi_bus_pm_ops.
> 
> It is essential during suspend and resume that neither the filesystem
> state nor the filesystem metadata in RAM changes. This is why while
> the hibernation image is being written or restored that SCSI devices
> are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
> and scsi_device_resume(). In the SDEV_QUIESCE state execution of
> non-preempt requests is deferred. This is realized by returning
> BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
> devices. Avoid that a full queue prevents power management requests
> to be submitted by deferring allocation of non-preempt requests for
> devices in the quiesced state. This patch has been tested by running
> the following commands and by verifying that after resume the fio job
> is still running:
> 
> for d in /sys/class/block/sd*[a-z]; do
>   hcil=$(readlink "$d/device")
>   hcil=${hcil#../../../}
>   echo 4 > "$d/queue/nr_requests"
>   echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
> done
> bdev=$(readlink /dev/disk/by-uuid/5217d83f-213e-4b42-b86e-20013325ba6c)
> bdev=${bdev#../../}
> hcil=$(readlink "/sys/block/$bdev/device")
> hcil=${hcil#../../../}
> fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 
> --rw=randread \
>   --ioengine=libaio --numjobs=4 --iodepth=16 --iodepth_batch=1 --thread \
>   --loops=$((2**31)) &
> pid=$!
> sleep 1
> systemctl hibernate
> sleep 10
> kill $pid
> 
> Reported-by: Oleksandr Natalenko 
> References: "I/O hangs after resuming from suspend-to-ram" 
> (https://marc.info/?l=linux-block=150340235201348).
> Signed-off-by: Bart Van Assche 
> Cc: Martin K. Petersen 
> Cc: Ming Lei 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> ---
>  block/blk-core.c| 42 +++---
>  block/blk-mq.c  |  4 ++--
>  block/blk-timeout.c |  2 +-
>  drivers/scsi/scsi_lib.c | 29 +++--
>  fs/block_dev.c  |  4 ++--
>  include/linux/blkdev.h  |  2 +-
>  6 files changed, 60 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ed992cbd107f..3847ea42e341 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -372,6 +372,7 @@ void blk_clear_preempt_only(struct request_queue *q)
>  
>   spin_lock_irqsave(q->queue_lock, flags);
>   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> + wake_up_all(>mq_freeze_wq);
>   spin_unlock_irqrestore(q->queue_lock, flags);
>  }
>  EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
> @@ -793,15 +794,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
>  
> -int blk_queue_enter(struct request_queue *q, bool nowait)
> +/**
> + * blk_queue_enter() - try to increase q->q_usage_counter
> + * @q: request queue pointer
> + * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + */
> +int blk_queue_enter(struct request_queue *q, unsigned int flags)
>  {
> + const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
> +
>   while (true) {
> + bool success = false;
>   int ret;
>  
> - if (percpu_ref_tryget_live(>q_usage_counter))
> + rcu_read_lock_sched();
> + if (percpu_ref_tryget_live(>q_usage_counter)) {
> + /*
> +  * The code that sets the PREEMPT_ONLY flag is
> +  * responsible for ensuring that that flag is globally
> +  * visible before the queue is unfrozen.
> +  */
> + if (preempt || !blk_queue_preempt_only(q)) {
> + success = true;
> + } else {
> + percpu_ref_put(>q_usage_counter);
> + WARN_ONCE("%s: Attempt to allocate non-preempt 
> request in preempt-only mode.\n",
> +   kobject_name(q->kobj.parent));
> + }
> + }
> + rcu_read_unlock_sched();
> +
> + if (success)
>   return 0;
>  
> - if (nowait)
> + if (flags & BLK_MQ_REQ_NOWAIT)
>   return -EBUSY;
>  
>   /*
> @@ -814,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
>   smp_rmb();
>  
>   ret = wait_event_interruptible(q->mq_freeze_wq,
> - !atomic_read(>mq_freeze_depth) ||
> + 

Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Ming Lei
On Thu, Oct 12, 2017 at 09:37:11AM -0600, Jens Axboe wrote:
> On 10/12/2017 09:33 AM, Bart Van Assche wrote:
> > On Thu, 2017-10-12 at 18:01 +0800, Ming Lei wrote:
> >> Even EWMA approach isn't good on SCSI-MQ too, because
> >> some SCSI's .cmd_per_lun is very small, such as 3 on
> >> lpfc and qla2xxx, and one full flush will trigger
> >> BLK_STS_RESOURCE easily.
> >>
> >> So I suggest to use the way of q->queue_depth first, since we
> >> don't get performance degrade report on other devices(!q->queue_depth)
> >> with blk-mq. We can improve this way in the future if we
> >> have better approach.
> > 
> > Measurements have shown that even with this patch series applied sequential
> > I/O performance is still below that of the legacy block and SCSI layers. So
> > this patch series is not the final solution. (See also John Garry's e-mail
> > of October 10th - https://lkml.org/lkml/2017/10/10/401). I have been
> > wondering what could be causing that performance difference. Maybe it's
> > because requests can reside for a while in the hctx dispatch queue and hence
> > are unvisible for the scheduler while in the hctx dispatch queue? Should we
> > modify blk_mq_dispatch_rq_list() such that it puts back requests that have
> > not been accepted by .queue_rq() onto the scheduler queue(s) instead of to
> > the hctx dispatch queue? If we would make that change, would it allow us to
> > drop patch "blk-mq-sched: improve dispatching from sw queue"?
> 
> Yes, it's clear that even with the full series, we're not completely there
> yet. We are closer, though, and I do want to close that gap up as much
> as we can. I think everybody will be more motivated and have an easier time
> getting the last bit of the way there, once we have a good foundation in.
> 
> It may be the reason that you hint at, if we do see a lot of requeueing
> or BUSY in the test case. That would prematurely move requests from the
> schedulers knowledge and into the hctx->dispatch holding area. It'd be
> useful to have a standard SATA test run and see if we're missing merging
> in that case (since merging is what it boils down to). If we are, then
> it's not hctx->dispatch issues.

>From Gary's test result on the patches of .get_budget()/.put_budget()[1],
the sequential I/O performance is still not good, that means the
issue may not be in IO merge, because .get_buget/.put_budget is 
more helpful to do I/O merge than block legacy.

Actually in my virtio-scsi test, blk-mq has been better than block legacy
with the way of .get_budget()/.put_budget().


[1] 
https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.2_test


-- 
Ming


Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Jens Axboe
On 10/12/2017 09:33 AM, Bart Van Assche wrote:
> On Thu, 2017-10-12 at 18:01 +0800, Ming Lei wrote:
>> Even EWMA approach isn't good on SCSI-MQ too, because
>> some SCSI's .cmd_per_lun is very small, such as 3 on
>> lpfc and qla2xxx, and one full flush will trigger
>> BLK_STS_RESOURCE easily.
>>
>> So I suggest to use the way of q->queue_depth first, since we
>> don't get performance degrade report on other devices(!q->queue_depth)
>> with blk-mq. We can improve this way in the future if we
>> have better approach.
> 
> Measurements have shown that even with this patch series applied sequential
> I/O performance is still below that of the legacy block and SCSI layers. So
> this patch series is not the final solution. (See also John Garry's e-mail
> of October 10th - https://lkml.org/lkml/2017/10/10/401). I have been
> wondering what could be causing that performance difference. Maybe it's
> because requests can reside for a while in the hctx dispatch queue and hence
> are unvisible for the scheduler while in the hctx dispatch queue? Should we
> modify blk_mq_dispatch_rq_list() such that it puts back requests that have
> not been accepted by .queue_rq() onto the scheduler queue(s) instead of to
> the hctx dispatch queue? If we would make that change, would it allow us to
> drop patch "blk-mq-sched: improve dispatching from sw queue"?

Yes, it's clear that even with the full series, we're not completely there
yet. We are closer, though, and I do want to close that gap up as much
as we can. I think everybody will be more motivated and have an easier time
getting the last bit of the way there, once we have a good foundation in.

It may be the reason that you hint at, if we do see a lot of requeueing
or BUSY in the test case. That would prematurely move requests from the
schedulers knowledge and into the hctx->dispatch holding area. It'd be
useful to have a standard SATA test run and see if we're missing merging
in that case (since merging is what it boils down to). If we are, then
it's not hctx->dispatch issues.

-- 
Jens Axboe



Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Bart Van Assche
On Thu, 2017-10-12 at 18:01 +0800, Ming Lei wrote:
> Even EWMA approach isn't good on SCSI-MQ too, because
> some SCSI's .cmd_per_lun is very small, such as 3 on
> lpfc and qla2xxx, and one full flush will trigger
> BLK_STS_RESOURCE easily.
> 
> So I suggest to use the way of q->queue_depth first, since we
> don't get performance degrade report on other devices(!q->queue_depth)
> with blk-mq. We can improve this way in the future if we
> have better approach.

Measurements have shown that even with this patch series applied sequential
I/O performance is still below that of the legacy block and SCSI layers. So
this patch series is not the final solution. (See also John Garry's e-mail
of October 10th - https://lkml.org/lkml/2017/10/10/401). I have been
wondering what could be causing that performance difference. Maybe it's
because requests can reside for a while in the hctx dispatch queue and hence
are unvisible for the scheduler while in the hctx dispatch queue? Should we
modify blk_mq_dispatch_rq_list() such that it puts back requests that have
not been accepted by .queue_rq() onto the scheduler queue(s) instead of to
the hctx dispatch queue? If we would make that change, would it allow us to
drop patch "blk-mq-sched: improve dispatching from sw queue"?

Bart.

Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Jens Axboe
On 10/12/2017 09:22 AM, Ming Lei wrote:
> On Thu, Oct 12, 2017 at 08:52:12AM -0600, Jens Axboe wrote:
>> On 10/12/2017 04:01 AM, Ming Lei wrote:
>>> On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
 On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared driver tag space is
> often quite big. Meantime there is also queue depth for each lun(
> .cmd_per_lun), which is often small, for example, on both lpfc and
> qla2xxx, .cmd_per_lun is just 3.
>
> So lots of requests may stay in sw queue, and we always flush all
> belonging to same hw queue and dispatch them all to driver, unfortunately
> it is easy to cause queue busy because of the small .cmd_per_lun.
> Once these requests are flushed out, they have to stay in hctx->dispatch,
> and no bio merge can participate into these requests, and sequential IO
> performance is hurt a lot.
>
> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> sw queue so that we can dispatch them in scheduler's way, then we can
> avoid to dequeue too many requests from sw queue when ->dispatch isn't
> flushed completely.
>
> This patch improves dispatching from sw queue when there is 
> per-request-queue
> queue depth by taking request one by one from sw queue, just like the way
> of IO scheduler.

 This still didn't address Jens' concern about using q->queue_depth as
 the heuristic for whether to do the full sw queue flush or one-by-one
 dispatch. The EWMA approach is a bit too complex for now, can you please
 try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?
>>>
>>> That can be done easily, but I am not sure if it is good.
>>>
>>> For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
>>> often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
>>> will be returned to blk-mq, then blk-mq will never do full sw
>>> queue flush even when kmalloc() always succeed from that time
>>> on.
>>
>> Have it be a bit more than a single bit, then. Reset it every x IOs or
>> something like that, that'll be more representative of transient busy
>> conditions anyway.
> 
> OK, that can be done via a simplified EWMA by considering
> the dispatch result only.

Yes, if it's kept simple enough, then that would be fine. I'm not totally
against EWMA, I just don't want to have any of this over-engineered.
Especially not when it's a pretty simple thing, we don't care about
averages, basically only if we ever see BLK_STS_RESOURCE in any kind
of recurring fashion.

-- 
Jens Axboe



Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Ming Lei
On Thu, Oct 12, 2017 at 08:52:12AM -0600, Jens Axboe wrote:
> On 10/12/2017 04:01 AM, Ming Lei wrote:
> > On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
> >> On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> >>> SCSI devices use host-wide tagset, and the shared driver tag space is
> >>> often quite big. Meantime there is also queue depth for each lun(
> >>> .cmd_per_lun), which is often small, for example, on both lpfc and
> >>> qla2xxx, .cmd_per_lun is just 3.
> >>>
> >>> So lots of requests may stay in sw queue, and we always flush all
> >>> belonging to same hw queue and dispatch them all to driver, unfortunately
> >>> it is easy to cause queue busy because of the small .cmd_per_lun.
> >>> Once these requests are flushed out, they have to stay in hctx->dispatch,
> >>> and no bio merge can participate into these requests, and sequential IO
> >>> performance is hurt a lot.
> >>>
> >>> This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> >>> sw queue so that we can dispatch them in scheduler's way, then we can
> >>> avoid to dequeue too many requests from sw queue when ->dispatch isn't
> >>> flushed completely.
> >>>
> >>> This patch improves dispatching from sw queue when there is 
> >>> per-request-queue
> >>> queue depth by taking request one by one from sw queue, just like the way
> >>> of IO scheduler.
> >>
> >> This still didn't address Jens' concern about using q->queue_depth as
> >> the heuristic for whether to do the full sw queue flush or one-by-one
> >> dispatch. The EWMA approach is a bit too complex for now, can you please
> >> try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?
> > 
> > That can be done easily, but I am not sure if it is good.
> > 
> > For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
> > often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
> > will be returned to blk-mq, then blk-mq will never do full sw
> > queue flush even when kmalloc() always succeed from that time
> > on.
> 
> Have it be a bit more than a single bit, then. Reset it every x IOs or
> something like that, that'll be more representative of transient busy
> conditions anyway.

OK, that can be done via a simplified EWMA by considering
the dispatch result only.

I will address it in V6.


-- 
Ming


Re: [PATCH V6 4/5] blk-mq-sched: improve dispatching from sw queue

2017-10-12 Thread Ming Lei
On Tue, Oct 10, 2017 at 11:23:45AM -0700, Omar Sandoval wrote:
> On Mon, Oct 09, 2017 at 07:24:23PM +0800, Ming Lei wrote:
> > SCSI devices use host-wide tagset, and the shared driver tag space is
> > often quite big. Meantime there is also queue depth for each lun(
> > .cmd_per_lun), which is often small, for example, on both lpfc and
> > qla2xxx, .cmd_per_lun is just 3.
> > 
> > So lots of requests may stay in sw queue, and we always flush all
> > belonging to same hw queue and dispatch them all to driver, unfortunately
> > it is easy to cause queue busy because of the small .cmd_per_lun.
> > Once these requests are flushed out, they have to stay in hctx->dispatch,
> > and no bio merge can participate into these requests, and sequential IO
> > performance is hurt a lot.
> > 
> > This patch introduces blk_mq_dequeue_from_ctx for dequeuing request from
> > sw queue so that we can dispatch them in scheduler's way, then we can
> > avoid to dequeue too many requests from sw queue when ->dispatch isn't
> > flushed completely.
> > 
> > This patch improves dispatching from sw queue when there is 
> > per-request-queue
> > queue depth by taking request one by one from sw queue, just like the way
> > of IO scheduler.
> 
> This still didn't address Jens' concern about using q->queue_depth as
> the heuristic for whether to do the full sw queue flush or one-by-one
> dispatch. The EWMA approach is a bit too complex for now, can you please
> try the heuristic of whether the driver ever returned BLK_STS_RESOURCE?

That can be done easily, but I am not sure if it is good.

For example, inside queue rq path of NVMe, kmalloc(GFP_ATOMIC) is
often used, if kmalloc() returns NULL just once, BLK_STS_RESOURCE
will be returned to blk-mq, then blk-mq will never do full sw
queue flush even when kmalloc() always succeed from that time
on.

Even EWMA approach isn't good on SCSI-MQ too, because
some SCSI's .cmd_per_lun is very small, such as 3 on
lpfc and qla2xxx, and one full flush will trigger
BLK_STS_RESOURCE easily.

So I suggest to use the way of q->queue_depth first, since we
don't get performance degrade report on other devices(!q->queue_depth)
with blk-mq. We can improve this way in the future if we
have better approach.

What do you think about it?


-- 
Ming


[PATCH v2] blk-throttle: track read and write request individually

2017-10-12 Thread Joseph Qi
From: Joseph Qi 

In mixed read/write workload on SSD, write latency is much lower than
read. But now we only track and record read latency and then use it as
threshold base for both read and write io latency accounting. As a
result, write io latency will always be considered as good and
bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
tg to be checked will be treated as idle most of the time and still let
others dispatch more ios, even it is truly running under low limit and
wants its low limit to be guaranteed, which is not we expected in fact.
So track read and write request individually, which can bring more
precise latency control for low limit idle detection.

Signed-off-by: Joseph Qi 
---
 block/blk-throttle.c | 134 ++-
 1 file changed, 79 insertions(+), 55 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 17816a0..0897378 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -215,9 +215,9 @@ struct throtl_data
 
unsigned int scale;
 
-   struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
-   struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
-   struct latency_bucket __percpu *latency_buckets;
+   struct latency_bucket tmp_buckets[2][LATENCY_BUCKET_SIZE];
+   struct avg_latency_bucket avg_buckets[2][LATENCY_BUCKET_SIZE];
+   struct latency_bucket __percpu *latency_buckets[2];
unsigned long last_calculate_time;
unsigned long filtered_latency;
 
@@ -2040,10 +2040,10 @@ static void blk_throtl_update_idletime(struct 
throtl_grp *tg)
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 static void throtl_update_latency_buckets(struct throtl_data *td)
 {
-   struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
-   int i, cpu;
-   unsigned long last_latency = 0;
-   unsigned long latency;
+   struct avg_latency_bucket avg_latency[2][LATENCY_BUCKET_SIZE];
+   int i, cpu, rw;
+   unsigned long last_latency[2] = { 0 };
+   unsigned long latency[2];
 
if (!blk_queue_nonrot(td->queue))
return;
@@ -2052,56 +2052,67 @@ static void throtl_update_latency_buckets(struct 
throtl_data *td)
td->last_calculate_time = jiffies;
 
memset(avg_latency, 0, sizeof(avg_latency));
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   struct latency_bucket *tmp = >tmp_buckets[i];
-
-   for_each_possible_cpu(cpu) {
-   struct latency_bucket *bucket;
-
-   /* this isn't race free, but ok in practice */
-   bucket = per_cpu_ptr(td->latency_buckets, cpu);
-   tmp->total_latency += bucket[i].total_latency;
-   tmp->samples += bucket[i].samples;
-   bucket[i].total_latency = 0;
-   bucket[i].samples = 0;
-   }
+   for (rw = READ; rw <= WRITE; rw++) {
+   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+   struct latency_bucket *tmp = >tmp_buckets[rw][i];
+
+   for_each_possible_cpu(cpu) {
+   struct latency_bucket *bucket;
+
+   /* this isn't race free, but ok in practice */
+   bucket = per_cpu_ptr(td->latency_buckets[rw],
+   cpu);
+   tmp->total_latency += bucket[i].total_latency;
+   tmp->samples += bucket[i].samples;
+   bucket[i].total_latency = 0;
+   bucket[i].samples = 0;
+   }
 
-   if (tmp->samples >= 32) {
-   int samples = tmp->samples;
+   if (tmp->samples >= 32) {
+   int samples = tmp->samples;
 
-   latency = tmp->total_latency;
+   latency[rw] = tmp->total_latency;
 
-   tmp->total_latency = 0;
-   tmp->samples = 0;
-   latency /= samples;
-   if (latency == 0)
-   continue;
-   avg_latency[i].latency = latency;
+   tmp->total_latency = 0;
+   tmp->samples = 0;
+   latency[rw] /= samples;
+   if (latency[rw] == 0)
+   continue;
+   avg_latency[rw][i].latency = latency[rw];
+   }
}
}
 
-   for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
-   if (!avg_latency[i].latency) {
-   if (td->avg_buckets[i].latency < last_latency)
-   

Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-10-12 Thread Ulf Hansson
On 12 October 2017 at 10:08, Linus Walleij  wrote:
> On Wed, Oct 11, 2017 at 3:58 PM, Ulf Hansson  wrote:
>
>> Actually completing the request in the ->done callback, may still be
>> possible, because in principle it only needs to inform the other
>> prepared request that it may start, before it continues to post
>> process/completes the current one.
>
> I try to do that in this patch:
> https://marc.info/?l=linux-mmc=148665460925587=2

Yeah, something like that!

>
> I'm trying to rebase this work now.

Great!

>
>> However, by looking at for example how mmci.c works, it actually holds
>> its spinlock while it calls mmc_request_done(). The same spinlock is
>> taken in the ->request() function, but not in the ->post_req()
>> function. In other words, completing the request in the ->done()
>> callback, would make mmci to keep the spinlock held throughout the
>> post processing cycle, which then prevents the next request from being
>> started.
>
> It seems my patch would not deliver in some drivers until we look
> into the locking semantics in the drivers.

Yes!

As a matter of fact the above change may actually introduce
performance regressions, in case the behavior of the driver doesn't
follow the expectations from the core. For that reason, either we have
to invent a new MMC cap to allow drivers to opt-in using the
"shortcut" or fix host drivers first.

Kind regards
Uffe


Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-10-12 Thread Linus Walleij
On Wed, Oct 11, 2017 at 3:58 PM, Ulf Hansson  wrote:

> Actually completing the request in the ->done callback, may still be
> possible, because in principle it only needs to inform the other
> prepared request that it may start, before it continues to post
> process/completes the current one.

I try to do that in this patch:
https://marc.info/?l=linux-mmc=148665460925587=2

I'm trying to rebase this work now.

> However, by looking at for example how mmci.c works, it actually holds
> its spinlock while it calls mmc_request_done(). The same spinlock is
> taken in the ->request() function, but not in the ->post_req()
> function. In other words, completing the request in the ->done()
> callback, would make mmci to keep the spinlock held throughout the
> post processing cycle, which then prevents the next request from being
> started.

It seems my patch would not deliver in some drivers until we look
into the locking semantics in the drivers.

Yours,
Linus Walleij


Re: [PATCH v3] blkcg: add sanity check for blkcg policy operations

2017-10-12 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH v3] blkcg: add sanity check for blkcg policy operations

2017-10-12 Thread weiping zhang
blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs,
otherwise policy would register fail.

Signed-off-by: weiping zhang 
---
 block/blk-cgroup.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index e7ec676..6c5f5f3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1419,6 +1419,13 @@ int blkcg_policy_register(struct blkcg_policy *pol)
if (i >= BLKCG_MAX_POLS)
goto err_unlock;
 
+   /* Make sure cpd_alloc_fn and cpd_free_fn in pairs */
+   if (!pol->cpd_alloc_fn ^ !pol->cpd_free_fn)
+   goto err_unlock;
+
+   if (!pol->pd_alloc_fn ^ !pol->pd_free_fn)
+   goto err_unlock;
+
/* register @pol */
pol->plid = i;
blkcg_policy[pol->plid] = pol;
-- 
2.9.4



Re: [PATCH v2] blkcg: add sanity check for blkcg policy operations

2017-10-12 Thread weiping zhang
On Wed, Oct 11, 2017 at 10:51:32AM -0600, Jens Axboe wrote:
> On 10/11/2017 03:46 AM, weiping zhang wrote:
> > blkcg policy should keep cpd/pd's alloc_fn and free_fn in pairs,
> > otherwise policy would register fail.
> > 
> > Signed-off-by: weiping zhang 
> > ---
> >  block/blk-cgroup.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index e7ec676..67b01c5 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -1419,6 +1419,16 @@ int blkcg_policy_register(struct blkcg_policy *pol)
> > if (i >= BLKCG_MAX_POLS)
> > goto err_unlock;
> >  
> > +   /* Make sure cpd_alloc_fn and cpd_free_fn in pairs */
> > +   if ((pol->cpd_alloc_fn && !pol->cpd_free_fn) ||
> > +   (!pol->cpd_alloc_fn && pol->cpd_free_fn))
> > +   goto err_unlock;
> 
> This might be cleaner (and more readable) as:
> 
> if (!pol->cpd_alloc_fn ^ !pol->cpd_free_fn)
> goto err_unlock;
> 
> Ditto for the pd part.
> 
Really nice, I'll send v3.

Thanks