date:20180416

Re: [PATCH 01/12] iommu-common: move to arch/sparc

2018-04-16 Thread Anshuman Khandual

On 04/16/2018 07:28 PM, David Miller wrote:
> From: Anshuman Khandual 
> Date: Mon, 16 Apr 2018 14:26:07 +0530
> 
>> On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
>>> This code is only used by sparc, and all new iommu drivers should use the
>>> drivers/iommu/ framework.  Also remove the unused exports.
>>>
>>> Signed-off-by: Christoph Hellwig 
>>
>> Right, these functions are used only from SPARC architecture. Simple
>> git grep confirms it as well. Hence it makes sense to move them into
>> arch code instead.
> 
> Well, we put these into a common location and used type friendly for
> powerpc because we hoped powerpc would convert over to using this
> common piece of code as well.
> 
> But nobody did the powerpc work.
> 
> If you look at the powerpc iommu support, it's the same code basically
> for entry allocation.

I understand. But there are some differences in iommu_table structure,
how both regular and large IOMMU pools are being initialized etc. So
if the movement of code into SPARC help cleaning up these generic config
options in general, I guess we should do that. But I will leave it upto
others who have more experience in this area.

+mpe

Re: [PATCH] blk-mq: start request gstate with gen 1

2018-04-16 Thread Ming Lei

On Tue, Apr 17, 2018 at 11:46:20AM +0800, Jianchao Wang wrote:
> rq->gstate and rq->aborted_gstate both are zero before rqs are
> allocated. If we have a small timeout, when the timer fires,
> there could be rqs that are never allocated, and also there could
> be rq that has been allocated but not initialized and started. At
> the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
> the blk_mq_terminate_expired will identify the rq is timed out and
> invoke .timeout early.
> 
> For scsi, this will cause scsi_times_out to be invoked before the
> scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
> the moment, then we will get crash.
> 
> Cc: Bart Van Assche 
> Cc: Tejun Heo 
> Cc: Ming Lei 
> Cc: Martin Steigerwald 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jianchao Wang 
> ---
>  block/blk-core.c | 4 
>  block/blk-mq.c   | 7 +++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index abcb868..ce62681 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -201,6 +201,10 @@ void blk_rq_init(struct request_queue *q, struct request 
> *rq)
>   rq->part = NULL;
>   seqcount_init(>gstate_seq);
>   u64_stats_init(>aborted_gstate_sync);
> + /*
> +  * See comment of blk_mq_init_request
> +  */
> + WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f5c7dbc..d62030a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2069,6 +2069,13 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
> *set, struct request *rq,
>  
>   seqcount_init(>gstate_seq);
>   u64_stats_init(>aborted_gstate_sync);
> + /*
> +  * start gstate with gen 1 instead of 0, otherwise it will be equal
> +  * to aborted_gstate, and be identified timed out by
> +  * blk_mq_terminate_expired.
> +  */
> + WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
> +
>   return 0;
>  }

Good catch, blk_mq_check_expired() is bypassed, but it is still hit
by blk_mq_terminate_expired().

Reviewed-by: Ming Lei 

-- 
Ming

Re: [PATCH] blk-mq: start request gstate with gen 1

2018-04-16 Thread Jens Axboe

On 4/16/18 9:46 PM, Jianchao Wang wrote:
> rq->gstate and rq->aborted_gstate both are zero before rqs are
> allocated. If we have a small timeout, when the timer fires,
> there could be rqs that are never allocated, and also there could
> be rq that has been allocated but not initialized and started. At
> the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
> the blk_mq_terminate_expired will identify the rq is timed out and
> invoke .timeout early.
> 
> For scsi, this will cause scsi_times_out to be invoked before the
> scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
> the moment, then we will get crash.

Oops, this looks good to me. Applied.

-- 
Jens Axboe

Re: [PATCH v2 00/10] SWIM driver fixes

2018-04-16 Thread Jens Axboe

On Tue, Apr 17 2018, Finn Thain wrote:
> On Wed, 11 Apr 2018, I wrote:
> 
> > This patch series has fixes for bugs in the SWIM floppy disk controller 
> > driver, including an oops and a soft lockup.
> > 
> 
> Apparently no-one is authorized to push this series intact.
> 
> Geert, would you please take just the first two patches?
> 
> Jens, would you please take the remaining 8 patches?
> 
> I have confirmed that no merge conflicts or bisection issues arise from 
> splitting up this series in this way. However, I can re-send these patches 
> as separate submission(s) if need be...

I've picked up 3-10, thanks.

-- 
Jens Axboe

Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER

2018-04-16 Thread jianchao.wang

Hi bart

Thanks for your kindly response.
I have sent out the patch. Please refer to
https://marc.info/?l=linux-block=152393666517449=2

Thanks
Jianchao

On 04/17/2018 08:15 AM, Bart Van Assche wrote:
> On Tue, 2018-04-17 at 00:04 +0800, jianchao.wang wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 16e83e6..be9b435 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2077,6 +2077,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
>> *set, struct request *rq,
>>  
>> seqcount_init(>gstate_seq);
>> u64_stats_init(>aborted_gstate_sync);
>> +   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
>> return 0;
>>  }
> 
> Hello Jianchao,
> 
> Your approach looks interesting to me. Can you send an official patch to Jens?
> 
> Thanks,
> 
> Bart.
> 
> 
> 
>

[PATCH] blk-mq: start request gstate with gen 1

2018-04-16 Thread Jianchao Wang

rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.

For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Ming Lei 
Cc: Martin Steigerwald 
Cc: sta...@vger.kernel.org
Signed-off-by: Jianchao Wang 
---
 block/blk-core.c | 4 
 block/blk-mq.c   | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb868..ce62681 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -201,6 +201,10 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->part = NULL;
seqcount_init(>gstate_seq);
u64_stats_init(>aborted_gstate_sync);
+   /*
+* See comment of blk_mq_init_request
+*/
+   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5c7dbc..d62030a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2069,6 +2069,13 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
*set, struct request *rq,
 
seqcount_init(>gstate_seq);
u64_stats_init(>aborted_gstate_sync);
+   /*
+* start gstate with gen 1 instead of 0, otherwise it will be equal
+* to aborted_gstate, and be identified timed out by
+* blk_mq_terminate_expired.
+*/
+   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
+
return 0;
 }
 
-- 
2.7.4

[PATCH] blk-mq: start request gstate with gen 1

2018-04-16 Thread Jianchao Wang

rq->gstate and rq->aborted_gstate both are zero before rqs are
allocated. If we have a small timeout, when the timer fires,
there could be rqs that are never allocated, and also there could
be rq that has been allocated but not initialized and started. At
the moment, the rq->gstate and rq->aborted_gstate both are 0, thus
the blk_mq_terminate_expired will identify the rq is timed out and
invoke .timeout early.

For scsi, this will cause scsi_times_out to be invoked before the
scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at
the moment, then we will get crash.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Ming Lei 
Cc: Martin Steigerwald 
Cc: sta...@vger.kernel.org
Signed-off-by: Jianchao Wang 
---
 block/blk-core.c | 4 
 block/blk-mq.c   | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb868..ce62681 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -201,6 +201,10 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->part = NULL;
seqcount_init(>gstate_seq);
u64_stats_init(>aborted_gstate_sync);
+   /*
+* See comment of blk_mq_init_request
+*/
+   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5c7dbc..d62030a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2069,6 +2069,13 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
*set, struct request *rq,
 
seqcount_init(>gstate_seq);
u64_stats_init(>aborted_gstate_sync);
+   /*
+* start gstate with gen 1 instead of 0, otherwise it will be equal
+* to aborted_gstate, and be identified timed out by
+* blk_mq_terminate_expired.
+*/
+   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
+
return 0;
 }
 
-- 
2.7.4

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-16 Thread Kees Cook

On Mon, Apr 16, 2018 at 1:44 PM, Kees Cook  wrote:
> On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook  wrote:
>> On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook  wrote:
>>> After fixing up some build issues in the middle of the 4.16 cycle, I
>>> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
>>> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
>>> to switch to doing several shorter test boots per kernel and see if
>>> that helps. One more bisect coming...
>>
>> Okay, so I can confirm the bisect points at the _merge_ itself, not a
>> specific patch. I'm not sure how to proceed here. It looks like some
>> kind of interaction between separate trees? Jens, do you have
>> suggestions on how to track this down?
>
> Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report:
>
> [   38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60
> [   38.274841] Read of size 22 at addr 8800122b8c4b by task smartctl/1064
> [   38.275630]
> [   38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ 
> #266
> [   38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [   38.277690] Call Trace:
> [   38.277988]  dump_stack+0x71/0xab
> [   38.278397]  ? _copy_to_user+0x42/0x60
> [   38.278833]  print_address_description+0x6a/0x270
> [   38.279368]  ? _copy_to_user+0x42/0x60
> [   38.279800]  kasan_report+0x243/0x360
> [   38.280221]  _copy_to_user+0x42/0x60
> [   38.280635]  sg_io+0x459/0x660
> ...
>
> Though we get slightly more details (some we already knew):
>
> [   38.301330] Allocated by task 329:
> [   38.301734]  kmem_cache_alloc_node+0xca/0x220
> [   38.302239]  scsi_mq_init_request+0x64/0x130 [scsi_mod]
> [   38.302821]  blk_mq_alloc_rqs+0x2cf/0x370
> [   38.303265]  blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0
> [   38.303820]  blk_mq_init_sched+0xf0/0x220
> [   38.304268]  elevator_switch+0x17a/0x2c0
> [   38.304705]  elv_iosched_store+0x173/0x220
> [   38.305171]  queue_attr_store+0x72/0xb0
> [   38.305602]  kernfs_fop_write+0x188/0x220
> [   38.306049]  __vfs_write+0xb6/0x330
> [   38.306436]  vfs_write+0xe9/0x240
> [   38.306804]  ksys_write+0x98/0x110
> [   38.307181]  do_syscall_64+0x6d/0x1d0
> [   38.307590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [   38.308142]
> [   38.308316] Freed by task 0:
> [   38.308652] (stack is not available)
> [   38.309060]
> [   38.309243] The buggy address belongs to the object at 8800122b8c00
> [   38.309243]  which belongs to the cache scsi_sense_cache of size 96
> [   38.310625] The buggy address is located 75 bytes inside of
> [   38.310625]  96-byte region [8800122b8c00, 8800122b8c60)

With a hardware watchpoint, I've isolated the corruption to here:

bfq_dispatch_request+0x2be/0x1610:
__bfq_dispatch_request at block/bfq-iosched.c:3902
3900if (rq) {
3901inc_in_driver_start_rq:
3902bfqd->rq_in_driver++;
3903start_rq:
3904rq->rq_flags |= RQF_STARTED;
3905}

Through some race condition(?), rq_in_driver is also sense_buffer, and
it can get incremented. Specifically, I am doing:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 25c14c58385c..f50d5053d732 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -6,6 +6,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include 

@@ -428,6 +430,18 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
}
 }

+static void sample_hbp_handler(struct perf_event *bp,
+   struct perf_sample_data *data,
+   struct pt_regs *regs)
+{
+printk(KERN_INFO "sense_buffer value is changed\n");
+dump_stack();
+printk(KERN_INFO "Dump stack from sample_hbp_handler\n");
+}
+
+struct perf_event * __percpu *sample_hbp;
+int ok_to_go;
+
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 bool run_queue, bool async)
 {
@@ -435,6 +449,16 @@ void blk_mq_sched_insert_request(struct request
*rq, bool at_head,
struct elevator_queue *e = q->elevator;
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+   struct perf_event_attr attr;
+   struct scsi_request *req = scsi_req(rq);
+
+   if (ok_to_go) {
+   hw_breakpoint_init();
+   attr.bp_addr = (uintptr_t)&(req->sense);
+   attr.bp_len = HW_BREAKPOINT_LEN_8;
+   attr.bp_type = HW_BREAKPOINT_W;
+   sample_hbp = register_wide_hw_breakpoint(,
sample_hbp_handler, NULL);
+   }

/* flush rq in flush machinery need to be dispatched directly */
if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) {
@@ -461,6 +485,9 @@ void blk_mq_sched_insert_request(struct request
*rq, bool at_head,
 run:

[PATCH] nvme: lightnvm: add granby support

2018-04-16 Thread Wei Xu

Add a new lightnvm quirk to identify CNEX’s Granby controller.

Signed-off-by: Wei Xu 
---
 drivers/nvme/host/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cb73bc8..9419e88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2529,6 +2529,8 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807),   /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+   { PCI_DEVICE(0x1d1d, 0x2601),   /* CNEX Granby */
+   .driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.7.4

[PATCH] block: Avoid executing a report or reset zones while a queue is frozen

2018-04-16 Thread Bart Van Assche

This patch on itself does not change the behavior of either ioctl.
However, this patch is necessary to avoid that these ioctls fail
with -EIO if sd_revalidate_disk() is called while these ioctls are
in progress because the current zoned block command code temporarily
clears data that is needed by these ioctls. See also commit
3ed05a987e0f ("blk-zoned: implement ioctls").

Signed-off-by: Bart Van Assche 
Cc: Martin K. Petersen 
Cc: Damien Le Moal 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Shaun Tancheff 
Cc: sta...@vger.kernel.org # v4.10
---
 block/blk-zoned.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 20bfc37e1852..acc71e8c3473 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -127,15 +127,19 @@ int blkdev_report_zones(struct block_device *bdev,
if (!q)
return -ENXIO;
 
+   blk_queue_enter(q, 0);
+
+   ret = -EOPNOTSUPP;
if (!blk_queue_is_zoned(q))
-   return -EOPNOTSUPP;
+   goto exit_queue;
 
+   ret = 0;
if (!nrz)
-   return 0;
+   goto exit_queue;
 
if (sector > bdev->bd_part->nr_sects) {
*nr_zones = 0;
-   return 0;
+   goto exit_queue;
}
 
/*
@@ -154,9 +158,10 @@ int blkdev_report_zones(struct block_device *bdev,
nr_pages = min_t(unsigned int, nr_pages,
 queue_max_segments(q));
 
+   ret = -ENOMEM;
bio = bio_alloc(gfp_mask, nr_pages);
if (!bio)
-   return -ENOMEM;
+   goto exit_queue;
 
bio_set_dev(bio, bdev);
bio->bi_iter.bi_sector = blk_zone_start(q, sector);
@@ -166,7 +171,7 @@ int blkdev_report_zones(struct block_device *bdev,
page = alloc_page(gfp_mask);
if (!page) {
ret = -ENOMEM;
-   goto out;
+   goto put_bio;
}
if (!bio_add_page(bio, page, PAGE_SIZE, 0)) {
__free_page(page);
@@ -179,7 +184,7 @@ int blkdev_report_zones(struct block_device *bdev,
else
ret = submit_bio_wait(bio);
if (ret)
-   goto out;
+   goto put_bio;
 
/*
 * Process the report result: skip the header and go through the
@@ -222,11 +227,14 @@ int blkdev_report_zones(struct block_device *bdev,
}
 
*nr_zones = nz;
-out:
+put_bio:
bio_for_each_segment_all(bv, bio, i)
__free_page(bv->bv_page);
bio_put(bio);
 
+exit_queue:
+   blk_queue_exit(q);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_report_zones);
@@ -256,21 +264,25 @@ int blkdev_reset_zones(struct block_device *bdev,
if (!q)
return -ENXIO;
 
+   blk_queue_enter(q, 0);
+
+   ret = -EOPNOTSUPP;
if (!blk_queue_is_zoned(q))
-   return -EOPNOTSUPP;
+   goto out;
 
+   ret = -EINVAL;
if (end_sector > bdev->bd_part->nr_sects)
/* Out of range */
-   return -EINVAL;
+   goto out;
 
/* Check alignment (handle eventual smaller last zone) */
zone_sectors = blk_queue_zone_sectors(q);
if (sector & (zone_sectors - 1))
-   return -EINVAL;
+   goto out;
 
if ((nr_sectors & (zone_sectors - 1)) &&
end_sector != bdev->bd_part->nr_sects)
-   return -EINVAL;
+   goto out;
 
while (sector < end_sector) {
 
@@ -283,7 +295,7 @@ int blkdev_reset_zones(struct block_device *bdev,
bio_put(bio);
 
if (ret)
-   return ret;
+   goto out;
 
sector += zone_sectors;
 
@@ -292,7 +304,11 @@ int blkdev_reset_zones(struct block_device *bdev,
 
}
 
-   return 0;
+   ret = 0;
+
+out:
+   blk_queue_exit(q);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_reset_zones);
 
-- 
2.16.3

Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER

2018-04-16 Thread Bart Van Assche

On Tue, 2018-04-17 at 00:04 +0800, jianchao.wang wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 16e83e6..be9b435 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2077,6 +2077,7 @@ static int blk_mq_init_request(struct blk_mq_tag_set 
> *set, struct request *rq,
>  
> seqcount_init(>gstate_seq);
> u64_stats_init(>aborted_gstate_sync);
> +   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
> return 0;
>  }

Hello Jianchao,

Your approach looks interesting to me. Can you send an official patch to Jens?

Thanks,

Bart.

Re: [PATCH v2 00/10] SWIM driver fixes

2018-04-16 Thread Finn Thain

On Wed, 11 Apr 2018, I wrote:

> This patch series has fixes for bugs in the SWIM floppy disk controller 
> driver, including an oops and a soft lockup.
> 

Apparently no-one is authorized to push this series intact.

Geert, would you please take just the first two patches?

Jens, would you please take the remaining 8 patches?

I have confirmed that no merge conflicts or bisection issues arise from 
splitting up this series in this way. However, I can re-send these patches 
as separate submission(s) if need be...

--

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Steven Rostedt

On Mon, 16 Apr 2018 21:30:54 +
Bart Van Assche  wrote:

> Hello Steve,
> 
> The tool I'm most concerned about is blktrace. I'm not sure though how this
> tool receives event data from the block layer core.

Yeah, blktrace is "special", it looks like it registers its callbacks
from the tracepoints, and writes the data to its own relay buffer. As
it's not relying on the output from the tracing directory, additional
fields being added shouldn't affect it.

Looking at the trace event "block_rq_requeue" we have in the blktrace
kernel code:

static void blk_register_tracepoints(void)
{
int ret;

ret = register_trace_block_rq_insert(blk_add_trace_rq_insert, NULL);

Where the callback blk_add_trace_rq_insert() gets called when the
trace event is hit.

static void blk_add_trace_rq_insert(void *ignore,
struct request_queue *q, struct request *rq)
{
blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT,
 blk_trace_request_get_cgid(q, rq));
}

Where:

static void blk_add_trace_rq(struct request *rq, int error,
 unsigned int nr_bytes, u32 what,
 union kernfs_node_id *cgid)
{

calls

__blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, req_op(rq),
rq->cmd_flags, what, error, 0, NULL, cgid);

Which calls either the ftrace tracing file or its own relay buffer.

Looking at the code from
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git

It appears that it does not rely on the ftrace ring buffers.

So I'm guessing blktrace is not affected by this patch.

-- Steve

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bart Van Assche

On Mon, 2018-04-16 at 17:24 -0400, Steven Rostedt wrote:
> On Mon, 16 Apr 2018 20:49:12 +
> Bart Van Assche  wrote:
> 
> > Which tools process these strings? Has it been verified whether or not
> > the tools that process these strings still work fine with this patch
> > applied?
> 
> Ideally, tools shouldn't process trace event strings, but I'm sure some
> do. :-/
> 
> Getting libtraceevent out as a library is a high priority of mine, and
> hopefully I should get something out in a couple of months.
> 
> Ideally, tools should parse the raw data and then new fields will not
> affect them.

Hello Steve,

The tool I'm most concerned about is blktrace. I'm not sure though how this
tool receives event data from the block layer core.

Thanks,

Bart.

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Steven Rostedt

On Mon, 16 Apr 2018 20:49:12 +
Bart Van Assche  wrote:

> Which tools process these strings? Has it been verified whether or not
> the tools that process these strings still work fine with this patch
> applied?

Ideally, tools shouldn't process trace event strings, but I'm sure some
do. :-/

Getting libtraceevent out as a library is a high priority of mine, and
hopefully I should get something out in a couple of months.

Ideally, tools should parse the raw data and then new fields will not
affect them.

-- Steve

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bart Van Assche

On Mon, 2018-04-16 at 21:10 +, Bean Huo (beanhuo) wrote:
> Hi, Bart
> 
> > mi...@redhad.com; linux-block@vger.kernel.org; raja...@google.com
> > Subject: [EXT] Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in 
> > SCSI
> > trace events
> > 
> > On Mon, 2018-04-16 at 14:31 +, Bean Huo (beanhuo) wrote:
> > >   TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u
> > 
> > prot_sgl=%u" \
> > > -   " prot_op=%s cmnd=(%s %s raw=%s)",
> > > +   " prot_op=%s tag=%d cmnd=(%s %s raw=%s)",
> > > 
> > > [ ... ]
> > >   TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u
> > 
> > prot_sgl=%u" \
> > > -   " prot_op=%s cmnd=(%s %s raw=%s) rtn=%d",
> > > +   " prot_op=%s tag=%d cmnd=(%s %s raw=%s) rtn=%d",
> > > [ ... ]
> > >   TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u " \
> > > -   "prot_sgl=%u prot_op=%s cmnd=(%s %s raw=%s)
> > 
> > result=(driver=" \
> > > -   "%s host=%s message=%s status=%s)",
> > > +   "prot_sgl=%u prot_op=%s tag=%d cmnd=(%s %s raw=%s) " \
> > > +   "result=(driver=%s host=%s message=%s status=%s)",
> > 
> > Which tools process these strings? Has it been verified whether or not the
> > tools that process these strings still work fine with this patch applied?
> > 
> 
> I don't use one certain special tool to analyze this string, I am using 
> ftrace with event.
> With tag information, I can see how many tasks in storage device and easily 
> to trace each request
> under multiple thread workload. 
> Event there is someone who using certain tool to analyze this string, after 
> adding additional
> tag printout, in my opinion, they are happy to see it there. 

Since you want to change these strings it is your job to determine which
user space tools parse these strings and also whether or not this change
will break any of these tools.

Thanks,

Bart.

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bean Huo (beanhuo)

Hi, Bart

>mi...@redhad.com; linux-block@vger.kernel.org; raja...@google.com
>Subject: [EXT] Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI
>trace events
>
>On Mon, 2018-04-16 at 14:31 +, Bean Huo (beanhuo) wrote:
>>  TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u
>prot_sgl=%u" \
>> -  " prot_op=%s cmnd=(%s %s raw=%s)",
>> +  " prot_op=%s tag=%d cmnd=(%s %s raw=%s)",
>>
>> [ ... ]
>>  TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u
>prot_sgl=%u" \
>> -  " prot_op=%s cmnd=(%s %s raw=%s) rtn=%d",
>> +  " prot_op=%s tag=%d cmnd=(%s %s raw=%s) rtn=%d",
>> [ ... ]
>>  TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u " \
>> -  "prot_sgl=%u prot_op=%s cmnd=(%s %s raw=%s)
>result=(driver=" \
>> -  "%s host=%s message=%s status=%s)",
>> +  "prot_sgl=%u prot_op=%s tag=%d cmnd=(%s %s raw=%s) " \
>> +  "result=(driver=%s host=%s message=%s status=%s)",
>
>Which tools process these strings? Has it been verified whether or not the
>tools that process these strings still work fine with this patch applied?
> 

I don't use one certain special tool to analyze this string, I am using ftrace 
with event.
With tag information, I can see how many tasks in storage device and easily to 
trace each request
under multiple thread workload. 
Event there is someone who using certain tool to analyze this string, after 
adding additional
tag printout, in my opinion, they are happy to see it there. 

Again, you said if we add new feature in legacy block, we will also add new 
feature in blk-mq.
I still don't understand here.  "include/trace/event/block.h ... scsi.h" will 
be changed?
If yes, how? Because blk-mq is still using the events defined in 
include/trace/event/block.h.

Bean Huo

>Thanks,
>
>Bart.
>

[PATCH] blktests: regression test "block: do not use interruptible wait anywhere"

2018-04-16 Thread Alan Jenkins

> Without this fix, I get an IO error in this test:
>
> # dd if=/dev/sda of=/dev/null iflag=direct & \
>   while killall -SIGUSR1 dd; do sleep 0.1; done & \
>   echo mem > /sys/power/state ; \
>   sleep 5; killall dd  # stop after 5 seconds

linux-block insisted they wanted this, based on my reproducer above.
If you start wondering why you wouldn't base it on scsi_debug with a new
"quiesce" option, that makes two of us.

Thread: http://lkml.kernel.org/r/891e334c-cf19-032c-b996-59ac166fc...@gmail.com

Signed-off-by: Alan Jenkins 
---
v2: fix sense of conditional test for HAVE_BARE_METAL_SCSI.
Also I had a few single-bracket tests, which wanted to be replaced
with double-brackets to match the coding style.

 tests/scsi/004 | 235 +
 tests/scsi/004.out |   7 ++
 2 files changed, 242 insertions(+)
 create mode 100755 tests/scsi/004
 create mode 100644 tests/scsi/004.out

diff --git a/tests/scsi/004 b/tests/scsi/004
new file mode 100755
index 000..ef42033
--- /dev/null
+++ b/tests/scsi/004
@@ -0,0 +1,235 @@
+#!/bin/bash
+#
+# Regression test for patch "block: do not use interruptible wait anywhere".
+#
+# > Without this fix, I get an IO error in this test:
+# >
+# > # dd if=/dev/sda of=/dev/null iflag=direct & \
+# >   while killall -SIGUSR1 dd; do sleep 0.1; done & \
+# >   echo mem > /sys/power/state ; \
+# >   sleep 5; killall dd  # stop after 5 seconds
+#
+# AJ: linux-block insisted they wanted this, based on my reproducer above.
+# If you start wondering why you wouldn't base it on scsi_debug with a new
+# "quiesce" option, that makes two of us.
+# Thread: 
http://lkml.kernel.org/r/891e334c-cf19-032c-b996-59ac166fc...@gmail.com
+#
+#
+# RATIONALE for a suspend test:
+#
+# The original root cause issue was the behaviour around blk_queue_freeze().
+# It put tasks into an interruptible wait, which is wrong for block devices.
+#
+# The freeze feature is not directly exposed to userspace, so I can not test
+# it directly :(.  (It's used to "guarantee no request is in use, so we can
+# change any data structure of the queue afterward".  I.e. freeze, modify the
+# queue structure, unfreeze).
+#
+# However, this lead to a regression with a decent reproducer.  In v4.15 the
+# same interruptible wait was also used for SCSI suspend/resume.  SCSI resume
+# can take a second or so... hence we like to do it asynchronously.  This
+# means we can observe the wait at resume time, and we can test if it is
+# interruptible.
+#
+# Note `echo quiesce > /sys/class/scsi_device/*/device/state` can *not*
+# trigger the specific wait in the block layer.  That code path only
+# sets the SCSI device state; it does not set any block device state.
+# (It does not call into blk_queue_freeze() or blk_set_preempt_only();
+#  it literally just sets sdev->sdev_state to SDEV_QUIESCE).
+#
+#
+# Copyright (C) 2018 Alan Jenkins
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+DESCRIPTION="check SCSI blockdev suspend is not interruptible"
+
+QUICK=1
+
+requires() {
+   # I can't expect to hit the window using bash, if the device is
+   # emulated by cpu.
+   #
+   # Maybe annoying for Xen dom0, but I'm guessing it's not common.
+   if grep -q ^flags.*\ hypervisor /proc/cpuinfo &&
+  (( !HAVE_BARE_METAL_SCSI )); then
+   SKIP_REASON=\
+"Hypervisor detected, but this test wants bare-metal SCSI timings.
+If you have a pass-through device, you may set HAVE_BARE_METAL_SCSI=1."
+   return 1
+   fi
+
+   # "If a user has disabled async probing a likely reason
+   #  is due to a storage enclosure that does not inject
+   #  staggered spin-ups. For safety, make resume
+   #  synchronous as well in that case."
+   if ! scan="$(cat /sys/module/scsi_mod/parameters/scan)"; then
+   SKIP_REASON="Could not read 
'/sys/module/scsi_mod/parameters/scan'"
+   return 1
+   fi
+   if [[ "$scan" != async ]]; then
+   SKIP_REASON="This test does not work if you have set 
'scsi_mod.scan=sync'"
+   return 1
+   fi
+
+   if ! cat /sys/power/pm_test > /dev/null; then
+   SKIP_REASON="Error reading pm_test.  Maybe kernel lacks 
CONFIG_PM_TEST?"
+   return 1
+   fi
+
+   _have_fio
+}
+
+do_test_device() ( # run

Re: [RESEND PATCH v1 2/2] trace: events: block: Add tag in block trace events

2018-04-16 Thread Bart Van Assche

On Mon, 2018-04-16 at 14:33 +, Bean Huo (beanhuo) wrote:
> [ ... ]
> - TP_printk("%d,%d %s (%s) %llu + %u [%d]",
> + TP_printk("%d,%d %s (%s) %llu + %u tag=%d [%d]",
> [ ... ]
> - TP_printk("%d,%d %s (%s) %llu + %u [%d]",
> + TP_printk("%d,%d %s (%s) %llu + %u tag=%d [%d]",
> [ ... ]

Which tools process these strings? Has it been verified whether or not
the tools that process these strings still work fine with this patch
applied?

Thanks,

Bart.

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bart Van Assche

On Mon, 2018-04-16 at 14:31 +, Bean Huo (beanhuo) wrote:
>   TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u prot_sgl=%u" \
> -   " prot_op=%s cmnd=(%s %s raw=%s)",
> +   " prot_op=%s tag=%d cmnd=(%s %s raw=%s)",
> 
> [ ... ]
>   TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u prot_sgl=%u" \
> -   " prot_op=%s cmnd=(%s %s raw=%s) rtn=%d",
> +   " prot_op=%s tag=%d cmnd=(%s %s raw=%s) rtn=%d",
> [ ... ]
>   TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u " \
> -   "prot_sgl=%u prot_op=%s cmnd=(%s %s raw=%s) result=(driver=" \
> -   "%s host=%s message=%s status=%s)",
> +   "prot_sgl=%u prot_op=%s tag=%d cmnd=(%s %s raw=%s) " \
> +   "result=(driver=%s host=%s message=%s status=%s)",

Which tools process these strings? Has it been verified whether or not
the tools that process these strings still work fine with this patch
applied?

Thanks,

Bart.

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-16 Thread Kees Cook

On Thu, Apr 12, 2018 at 8:02 PM, Kees Cook  wrote:
> On Thu, Apr 12, 2018 at 3:47 PM, Kees Cook  wrote:
>> After fixing up some build issues in the middle of the 4.16 cycle, I
>> get an unhelpful bisect result of commit 0a4b6e2f80aa ("Merge branch
>> 'for-4.16/block'"). Instead of letting the test run longer, I'm going
>> to switch to doing several shorter test boots per kernel and see if
>> that helps. One more bisect coming...
>
> Okay, so I can confirm the bisect points at the _merge_ itself, not a
> specific patch. I'm not sure how to proceed here. It looks like some
> kind of interaction between separate trees? Jens, do you have
> suggestions on how to track this down?

Turning off HARDENED_USERCOPY and turning on KASAN, I see the same report:

[   38.274106] BUG: KASAN: slab-out-of-bounds in _copy_to_user+0x42/0x60
[   38.274841] Read of size 22 at addr 8800122b8c4b by task smartctl/1064
[   38.275630]
[   38.275818] CPU: 2 PID: 1064 Comm: smartctl Not tainted 4.17.0-rc1-ARCH+ #266
[   38.276631] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   38.277690] Call Trace:
[   38.277988]  dump_stack+0x71/0xab
[   38.278397]  ? _copy_to_user+0x42/0x60
[   38.278833]  print_address_description+0x6a/0x270
[   38.279368]  ? _copy_to_user+0x42/0x60
[   38.279800]  kasan_report+0x243/0x360
[   38.280221]  _copy_to_user+0x42/0x60
[   38.280635]  sg_io+0x459/0x660
...

Though we get slightly more details (some we already knew):

[   38.301330] Allocated by task 329:
[   38.301734]  kmem_cache_alloc_node+0xca/0x220
[   38.302239]  scsi_mq_init_request+0x64/0x130 [scsi_mod]
[   38.302821]  blk_mq_alloc_rqs+0x2cf/0x370
[   38.303265]  blk_mq_sched_alloc_tags.isra.4+0x7d/0xb0
[   38.303820]  blk_mq_init_sched+0xf0/0x220
[   38.304268]  elevator_switch+0x17a/0x2c0
[   38.304705]  elv_iosched_store+0x173/0x220
[   38.305171]  queue_attr_store+0x72/0xb0
[   38.305602]  kernfs_fop_write+0x188/0x220
[   38.306049]  __vfs_write+0xb6/0x330
[   38.306436]  vfs_write+0xe9/0x240
[   38.306804]  ksys_write+0x98/0x110
[   38.307181]  do_syscall_64+0x6d/0x1d0
[   38.307590]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   38.308142]
[   38.308316] Freed by task 0:
[   38.308652] (stack is not available)
[   38.309060]
[   38.309243] The buggy address belongs to the object at 8800122b8c00
[   38.309243]  which belongs to the cache scsi_sense_cache of size 96
[   38.310625] The buggy address is located 75 bytes inside of
[   38.310625]  96-byte region [8800122b8c00, 8800122b8c60)


-Kees

-- 
Kees Cook
Pixel Security

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bart Van Assche

On Mon, 2018-04-16 at 20:27 +, Bean Huo (beanhuo) wrote:
> By the way, these patches are not to add new feature, they are just to
> add print tag along with the other exist printed request parameters.

Are you aware that there are two tag fields in struct request, namely "tag"
and "internal_tag"? Are you aware that how these fields are used depends on
whether or not a scheduler is attached to a request queue? Have you verified
that the "tag" field contains a meaningful value for blk-mq for every blk-mq
tracepoint?

Thanks,

Bart.

[PATCH] blktests: regression test "block: do not use interruptible wait anywhere"

2018-04-16 Thread Alan Jenkins

> Without this fix, I get an IO error in this test:
>
> # dd if=/dev/sda of=/dev/null iflag=direct & \
>   while killall -SIGUSR1 dd; do sleep 0.1; done & \
>   echo mem > /sys/power/state ; \
>   sleep 5; killall dd  # stop after 5 seconds

linux-block insisted they wanted this, based on my reproducer above.
If you start wondering why you wouldn't base it on scsi_debug with a new
"quiesce" option, that makes two of us.

Thread: http://lkml.kernel.org/r/891e334c-cf19-032c-b996-59ac166fc...@gmail.com

Signed-off-by: Alan Jenkins 
---
 tests/scsi/004 | 235 +
 tests/scsi/004.out |   7 ++
 2 files changed, 242 insertions(+)
 create mode 100755 tests/scsi/004
 create mode 100644 tests/scsi/004.out

diff --git a/tests/scsi/004 b/tests/scsi/004
new file mode 100755
index 000..791c76a
--- /dev/null
+++ b/tests/scsi/004
@@ -0,0 +1,235 @@
+#!/bin/bash
+#
+# Regression test for patch "block: do not use interruptible wait anywhere".
+#
+# > Without this fix, I get an IO error in this test:
+# >
+# > # dd if=/dev/sda of=/dev/null iflag=direct & \
+# >   while killall -SIGUSR1 dd; do sleep 0.1; done & \
+# >   echo mem > /sys/power/state ; \
+# >   sleep 5; killall dd  # stop after 5 seconds
+#
+# AJ: linux-block insisted they wanted this, based on my reproducer above.
+# If you start wondering why you wouldn't base it on scsi_debug with a new
+# "quiesce" option, that makes two of us.
+# Thread: 
http://lkml.kernel.org/r/891e334c-cf19-032c-b996-59ac166fc...@gmail.com
+#
+#
+# RATIONALE for a suspend test:
+#
+# The original root cause issue was the behaviour around blk_queue_freeze().
+# It put tasks into an interruptible wait, which is wrong for block devices.
+#
+# The freeze feature is not directly exposed to userspace, so I can not test
+# it directly :(.  (It's used to "guarantee no request is in use, so we can
+# change any data structure of the queue afterward".  I.e. freeze, modify the
+# queue structure, unfreeze).
+#
+# However, this lead to a regression with a decent reproducer.  In v4.15 the
+# same interruptible wait was also used for SCSI suspend/resume.  SCSI resume
+# can take a second or so... hence we like to do it asynchronously.  This
+# means we can observe the wait at resume time, and we can test if it is
+# interruptible.
+#
+# Note `echo quiesce > /sys/class/scsi_device/*/device/state` can *not*
+# trigger the specific wait in the block layer.  That code path only
+# sets the SCSI device state; it does not set any block device state.
+# (It does not call into blk_queue_freeze() or blk_set_preempt_only();
+#  it literally just sets sdev->sdev_state to SDEV_QUIESCE).
+#
+#
+# Copyright (C) 2018 Alan Jenkins
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+
+DESCRIPTION="check SCSI blockdev suspend is not interruptible"
+
+QUICK=1
+
+requires() {
+   # I can't expect to hit the window using bash, if the device is
+   # emulated by cpu.
+   #
+   # Maybe annoying for Xen dom0, but I'm guessing it's not common.
+   if grep -q ^flags.*\ hypervisor /proc/cpuinfo &&
+  ! [[ -e "$HAVE_BARE_METAL_SCSI" ]]; then
+   SKIP_REASON=\
+"Hypervisor detected, but this test wants bare-metal SCSI timings.
+If you have a pass-through device, you may set HAVE_BARE_METAL_SCSI=1."
+   return 1
+   fi
+
+   # "If a user has disabled async probing a likely reason
+   #  is due to a storage enclosure that does not inject
+   #  staggered spin-ups. For safety, make resume
+   #  synchronous as well in that case."
+   if ! scan="$(cat /sys/module/scsi_mod/parameters/scan)"; then
+   SKIP_REASON="Could not read 
'/sys/module/scsi_mod/parameters/scan'"
+   return 1
+   fi
+   if [[ "$scan" != async ]]; then
+   SKIP_REASON="This test does not work if you have set 
'scsi_mod.scan=sync'"
+   return 1
+   fi
+
+   if ! cat /sys/power/pm_test > /dev/null; then
+   SKIP_REASON="Error reading pm_test.  Maybe kernel lacks 
CONFIG_PM_TEST?"
+   return 1
+   fi
+
+   _have_fio
+}
+
+do_test_device() ( # run whole function in a subshell
+
+   sysfs_pm_test_delay=/sys/module/suspend/parameters/pm_test_delay
+   saved_pm_test_delay=
+   dd_pid=
+   subshell_pid=
+
+

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bean Huo (beanhuo)

>>> This patch is not acceptable because it adds support for tag tracing
>>> to the legacy block layer only. Any patch that adds a new feature to
>>> the legacy block layer must also add it to blk-mq.
>>>
>> To be honest, I don't understand your point, can you give me more
>explanation?
>
>The legacy block layer will be removed as soon as blk-mq provides all the
>functionality of the legacy block layer and as soon as it performs at least as
>well as the legacy block layer for all use cases. If new features are added to
>the legacy block layer but not to blk-mq that prevents removal of the legacy
>block layer. Hence the requirement I explained in my previous e-mail.
>
>Bart.

Thanks for your information.
I have several questions again.
When the legacy block layer will be replaced by blk-mq? 
And "include/trece/event/block.h .. scsi.h" will also be changed? 
Do you have the related git rep or mail list about this topic?
Maybe this is great big change, I am very interested in that. And want to have 
a look at.

By the way, these patches are not to add new feature, they are just to add 
print tag along with the other exist
Printed request parameters.  The blk-mq is now still using 
"include/trace/evet/block.h" defined trace events.

For example: 
void blk_mq_start_request(struct request *rq)  
{
...
...
trace_block_rq_issue(q, rq);
...
...
}
Do you mean that this will also be removed/replaced by someone else?

Bean Huo

Re: [PATCH 08/12] mmc: reduce use of block bounce buffers (fwd)

2018-04-16 Thread Julia Lawall

There is a duplicated test on line 360.

julia

-- Forwarded message --
Date: Mon, 16 Apr 2018 23:04:18 +0800
From: kbuild test robot <l...@intel.com>
To: kbu...@01.org
Cc: Julia Lawall <julia.law...@lip6.fr>
Subject: Re: [PATCH 08/12] mmc: reduce use of block bounce buffers

CC: kbuild-...@01.org
In-Reply-To: <20180416085032.7367-9-...@lst.de>
References: <20180416085032.7367-9-...@lst.de>
TO: Christoph Hellwig <h...@lst.de>
CC: io...@lists.linux-foundation.org, linux-a...@vger.kernel.org, 
linux-block@vger.kernel.org, linux-...@vger.kernel.org, 
linux-s...@vger.kernel.org, net...@vger.kernel.org
CC: linux-ker...@vger.kernel.org

Hi Christoph,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc1 next-20180416]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christoph-Hellwig/iscsi_tcp-don-t-set-a-bounce-limit/20180416-172618
:: branch date: 6 hours ago
:: commit date: 6 hours ago

>> drivers/mmc/core/queue.c:360:5-29: duplicated argument to && or ||

# 
https://github.com/0day-ci/linux/commit/6620a69f0eea8e8b7586f08f721c95a336022497
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 6620a69f0eea8e8b7586f08f721c95a336022497
vim +360 drivers/mmc/core/queue.c

81196976 Adrian Hunter 2017-11-29  350
c8b5fd03 Adrian Hunter 2017-09-22  351  static void mmc_setup_queue(struct 
mmc_queue *mq, struct mmc_card *card)
c8b5fd03 Adrian Hunter 2017-09-22  352  {
c8b5fd03 Adrian Hunter 2017-09-22  353  struct mmc_host *host = 
card->host;
c8b5fd03 Adrian Hunter 2017-09-22  354
8b904b5b Bart Van Assche   2018-03-07  355  
blk_queue_flag_set(QUEUE_FLAG_NONROT, mq->queue);
8b904b5b Bart Van Assche   2018-03-07  356  
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, mq->queue);
c8b5fd03 Adrian Hunter 2017-09-22  357  if (mmc_can_erase(card))
c8b5fd03 Adrian Hunter 2017-09-22  358  
mmc_queue_setup_discard(mq->queue, card);
c8b5fd03 Adrian Hunter 2017-09-22  359
6620a69f Christoph Hellwig 2018-04-16 @360  if (!mmc_dev(host)->dma_mask || 
!mmc_dev(host)->dma_mask)
6620a69f Christoph Hellwig 2018-04-16  361  
blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_HIGH);
c8b5fd03 Adrian Hunter 2017-09-22  362  
blk_queue_max_hw_sectors(mq->queue,
c8b5fd03 Adrian Hunter 2017-09-22  363  
min(host->max_blk_count, host->max_req_size / 512));
c8b5fd03 Adrian Hunter 2017-09-22  364  
blk_queue_max_segments(mq->queue, host->max_segs);
c8b5fd03 Adrian Hunter 2017-09-22  365  
blk_queue_max_segment_size(mq->queue, host->max_seg_size);
c8b5fd03 Adrian Hunter 2017-09-22  366
1e8e55b6 Adrian Hunter 2017-11-29  367  INIT_WORK(>recovery_work, 
mmc_mq_recovery_handler);
81196976 Adrian Hunter 2017-11-29  368  INIT_WORK(>complete_work, 
mmc_blk_mq_complete_work);
81196976 Adrian Hunter 2017-11-29  369
81196976 Adrian Hunter 2017-11-29  370  mutex_init(>complete_lock);
81196976 Adrian Hunter 2017-11-29  371
81196976 Adrian Hunter 2017-11-29  372  init_waitqueue_head(>wait);
81196976 Adrian Hunter 2017-11-29  373  }
81196976 Adrian Hunter 2017-11-29  374

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH] lightnvm: pblk: add CNEX Granby to the pblk device list

2018-04-16 Thread Matias Bjørling


On 4/16/18 11:19 AM, Wei Xu wrote:


Add a new lightnvm quirk to identify CNEX’s Granby controller.

Signed-off-by: Wei Xu 
---
  drivers/nvme/host/pci.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cb73bc8..9419e88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2529,6 +2529,8 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807),   /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+   { PCI_DEVICE(0x1d1d, 0x2601),   /* CNEX Granby */
+   .driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },

Hi Wei,

The patch looks good, but the title is not description of the change.

Could the title be changed to "nvme: lightnvm: add granby support". 
Also, please send it to the nvme mailing list and CC Keith Busch. It is 
a pure NVMe device driver change, and can go directly up through the 
NVMe maintainers. I'll add a Reviewed-by: when you've sent it.


-Matias

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bart Van Assche


On 04/16/18 10:05, Bean Huo (beanhuo) wrote:

On Mon, 2018-04-16 at 09:41 -0700, Rajat Jain wrote:

On Mon, Apr 16, 2018 at 8:28 AM, Steven Rostedt 

wrote:

On Mon, 16 Apr 2018 14:31:49 +
"Bean Huo (beanhuo)"  wrote:


Print the request tag along with other information while tracing a
command.

Signed-off-by: Bean Huo 


Acked-by: Rajat Jain 


This patch is not acceptable because it adds support for tag tracing to the
legacy block layer only. Any patch that adds a new feature to the legacy block
layer must also add it to blk-mq.


To be honest, I don't understand your point, can you give me more explanation?


The legacy block layer will be removed as soon as blk-mq provides all 
the functionality of the legacy block layer and as soon as it performs 
at least as well as the legacy block layer for all use cases. If new 
features are added to the legacy block layer but not to blk-mq that 
prevents removal of the legacy block layer. Hence the requirement I 
explained in my previous e-mail.


Bart.

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bean Huo (beanhuo)

Hi, Bart

>On Mon, 2018-04-16 at 09:41 -0700, Rajat Jain wrote:
>> On Mon, Apr 16, 2018 at 8:28 AM, Steven Rostedt 
>wrote:
>> > On Mon, 16 Apr 2018 14:31:49 +
>> > "Bean Huo (beanhuo)"  wrote:
>> >
>> > > Print the request tag along with other information while tracing a
>> > > command.
>> > >
>> > > Signed-off-by: Bean Huo 
>>
>> Acked-by: Rajat Jain 
>
>This patch is not acceptable because it adds support for tag tracing to the
>legacy block layer only. Any patch that adds a new feature to the legacy block
>layer must also add it to blk-mq.
>
To be honest, I don't understand your point, can you give me more explanation?

>Bart.
>
>

[PATCH 0/2] pblk bugfixes

2018-04-16 Thread Hans Holmberg

From: Hans Holmberg 

This is a couple of bugfixes, nothing very urgent.

Hans Holmberg (2):
  lightnvm: pblk: only try to recover lines with written smeta
  lightnvm: pblk: kick writer on new flush points

 drivers/lightnvm/pblk-cache.c| 10 ++
 drivers/lightnvm/pblk-core.c |  2 +-
 drivers/lightnvm/pblk-rb.c   |  4 ++--
 drivers/lightnvm/pblk-recovery.c | 30 +-
 drivers/lightnvm/pblk.h  |  1 +
 5 files changed, 31 insertions(+), 16 deletions(-)

-- 
2.7.4

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bart Van Assche

On Mon, 2018-04-16 at 09:41 -0700, Rajat Jain wrote:
> On Mon, Apr 16, 2018 at 8:28 AM, Steven Rostedt  wrote:
> > On Mon, 16 Apr 2018 14:31:49 +
> > "Bean Huo (beanhuo)"  wrote:
> > 
> > > Print the request tag along with other information
> > > while tracing a command.
> > > 
> > > Signed-off-by: Bean Huo 
> 
> Acked-by: Rajat Jain 

This patch is not acceptable because it adds support for tag tracing to the
legacy block layer only. Any patch that adds a new feature to the legacy block
layer must also add it to blk-mq.

Bart.

[PATCH 1/2] lightnvm: pblk: only try to recover lines with written smeta

2018-04-16 Thread Hans Holmberg

From: Hans Holmberg 

When switching between different lun configurations, there is no
guarantee that all lines that contain closed/open chunks have some
valid data to recover.

Check that the smeta chunk has been written to instead. Also
skip bad lines (that does not have enough good chunks).

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-recovery.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 3e079c2..9cb6d5d 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -865,18 +865,30 @@ static void pblk_recov_wa_counters(struct pblk *pblk,
 }
 
 static int pblk_line_was_written(struct pblk_line *line,
-   struct pblk_line_meta *lm)
+   struct pblk *pblk)
 {
 
-   int i;
-   int state_mask = NVM_CHK_ST_OFFLINE | NVM_CHK_ST_FREE;
+   struct pblk_line_meta *lm = >lm;
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = >geo;
+   struct nvm_chk_meta *chunk;
+   struct ppa_addr bppa;
+   int smeta_blk;
 
-   for (i = 0; i < lm->blk_per_line; i++) {
-   if (!(line->chks[i].state & state_mask))
-   return 1;
-   }
+   if (line->state == PBLK_LINESTATE_BAD)
+   return 0;
 
-   return 0;
+   smeta_blk = find_first_zero_bit(line->blk_bitmap, lm->blk_per_line);
+   if (smeta_blk >= lm->blk_per_line)
+   return 0;
+
+   bppa = pblk->luns[smeta_blk].bppa;
+   chunk = >chks[pblk_ppa_to_pos(geo, bppa)];
+
+   if (chunk->state & NVM_CHK_ST_FREE)
+   return 0;
+
+   return 1;
 }
 
 struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
@@ -915,7 +927,7 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
line->lun_bitmap = ((void *)(smeta_buf)) +
sizeof(struct line_smeta);
 
-   if (!pblk_line_was_written(line, lm))
+   if (!pblk_line_was_written(line, pblk))
continue;
 
/* Lines that cannot be read are assumed as not written here */
-- 
2.7.4

[PATCH 2/2] lightnvm: pblk: kick writer on new flush points

2018-04-16 Thread Hans Holmberg

From: Hans Holmberg 

Unless we kick the writer directly when setting a new flush point, the
user risks having to wait for up to one second (the default timeout for
the write thread to be kicked) for the IO to complete.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-cache.c | 10 ++
 drivers/lightnvm/pblk-core.c  |  2 +-
 drivers/lightnvm/pblk-rb.c|  4 ++--
 drivers/lightnvm/pblk.h   |  1 +
 4 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/lightnvm/pblk-cache.c b/drivers/lightnvm/pblk-cache.c
index 29a2311..b1c6d7e 100644
--- a/drivers/lightnvm/pblk-cache.c
+++ b/drivers/lightnvm/pblk-cache.c
@@ -44,13 +44,15 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, 
unsigned long flags)
goto out;
}
 
-   if (unlikely(!bio_has_data(bio)))
-   goto out;
-
pblk_ppa_set_empty(_ctx.ppa);
w_ctx.flags = flags;
-   if (bio->bi_opf & REQ_PREFLUSH)
+   if (bio->bi_opf & REQ_PREFLUSH) {
w_ctx.flags |= PBLK_FLUSH_ENTRY;
+   pblk_write_kick(pblk);
+   }
+
+   if (unlikely(!bio_has_data(bio)))
+   goto out;
 
for (i = 0; i < nr_entries; i++) {
void *data = bio_data(bio);
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 94d5d97..ceacd10 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -320,7 +320,7 @@ int pblk_bio_add_pages(struct pblk *pblk, struct bio *bio, 
gfp_t flags,
return -1;
 }
 
-static void pblk_write_kick(struct pblk *pblk)
+void pblk_write_kick(struct pblk *pblk)
 {
wake_up_process(pblk->writer_ts);
mod_timer(>wtimer, jiffies + msecs_to_jiffies(1000));
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 52fdd85..7a63291 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -350,7 +350,7 @@ void pblk_rb_write_entry_gc(struct pblk_rb *rb, void *data,
 }
 
 static int pblk_rb_flush_point_set(struct pblk_rb *rb, struct bio *bio,
- unsigned int pos)
+  unsigned int pos)
 {
struct pblk_rb_entry *entry;
unsigned int sync, flush_point;
@@ -420,7 +420,7 @@ void pblk_rb_flush(struct pblk_rb *rb)
if (pblk_rb_flush_point_set(rb, NULL, mem))
return;
 
-   pblk_write_should_kick(pblk);
+   pblk_write_kick(pblk);
 }
 
 static int pblk_rb_may_write_flush(struct pblk_rb *rb, unsigned int nr_entries,
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 9c682ac..856ac41 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -837,6 +837,7 @@ void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, 
unsigned int sentry,
 int pblk_write_ts(void *data);
 void pblk_write_timer_fn(struct timer_list *t);
 void pblk_write_should_kick(struct pblk *pblk);
+void pblk_write_kick(struct pblk *pblk);
 
 /*
  * pblk read path
-- 
2.7.4

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Rajat Jain

On Mon, Apr 16, 2018 at 8:28 AM, Steven Rostedt  wrote:
> On Mon, 16 Apr 2018 14:31:49 +
> "Bean Huo (beanhuo)"  wrote:
>
>> Print the request tag along with other information
>> while tracing a command.
>>
>> Signed-off-by: Bean Huo 
Acked-by: Rajat Jain 

>> ---
>
> I don't see any issue with the tracing part.
>
> Acked-by: Steven Rostedt (VMware) 
>
> Others need to check the content.
>
> -- Steve

Re: [PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events

2018-04-16 Thread Steffen Maier


Hi Greg,

On 04/15/2018 10:31 AM, Greg Kroah-Hartman wrote:

On Fri, Apr 13, 2018 at 03:07:18PM +0200, Steffen Maier wrote:

Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") to be equivalent to traditional blktrace
output. Also this allows event filtering to not always get all (un)plug
events.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.
The change did work for my cases (block device read/write I/O on
zfcp-attached SCSI disks and dm-mpath on top).

While I haven't seen any prior art using driver core (parent) relations
for trace events, there are other cases using this when no direct pointer
exists between objects, such as:
  #define to_scsi_target(d) container_of(d, struct scsi_target, dev)
  static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
  {
return to_scsi_target(sdev->sdev_gendev.parent);
  }


That is because you "know" the parent of a target device is a
scsi_target.


true


This is the object model we make use of here:

struct gendisk {
 struct hd_struct {
 struct device {  /*container_of*/
 struct kobject kobj; <--+
 dev_t  devt; /*deref*/  |
 } __dev;|
 } part0;|
 struct request_queue *queue; ..+|
}  :|
:|
struct request_queue {  <..+|
 /* queue kobject */ |
 struct kobject {|
 struct kobject *parent; +


Are you sure about this?


I double checked it with crash on a running system chasing pointers and 
looking at structure debug symbols.

But of course I cannot guarantee it's always been like this and will be.


 } kobj;
}




The difference to blktrace parsed output is that block events don't use the
partition's minor number but the containing block device's minor number:


Why do you want the block device's minor number here?  What is wrong
with the partition's minor number?  I would think you want that instead.


No change introduced with my patch. I just describe state of the art 
since the mentioned 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=55782138e47d.
It (or even its predecessor) used request_queue as trace function 
argument (plus mostly either request or bio). So that's the currently 
available context for these events. My change is consistent with that.
But then again, it's not much of a problem as we do have the remap event 
which shows the mapping from partition to blockdev.


blktrace, hooking with callbacks on the block trace events, has its own 
context information [struct blk_trace] and can get to e.g. the dev_t 
with its own real pointers without using driver core relations. But I 
had the impression that's only wired if one uses the blktrace IOCTL or 
the blk tracer [do_blk_trace_setup()], not for "pure" block events.

static void blk_add_trace_plug(void *ignore, struct request_queue *q)
{
struct blk_trace *bt = q->blk_trace;

   


if (bt)
__blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
}

static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
unsigned int depth, bool explicit)
{
struct blk_trace *bt = q->blk_trace;

   


if (bt) {
__be64 rpdu = cpu_to_be64(depth);
u32 what;

if (explicit)
what = BLK_TA_UNPLUG_IO;
else
what = BLK_TA_UNPLUG_TIMER;

__blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), , 
NULL);
}
}



struct blk_trace {
int trace_state;
struct rchan *rchan;
unsigned long __percpu *sequence;
unsigned char __percpu *msg_data;
u16 act_mask;
u64 start_lba;
u64 end_lba;
u32 pid;
u32 dev;

^^^

struct dentry *dir;
struct dentry *dropped_file;
struct dentry *msg_file;
struct list_head running_list;
atomic_t dropped;
};





$ dd if=/dev/sdf1 count=1

$ cat /sys/kernel/debug/tracing/trace
block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
block_bio_queue: 8,80 R 2048 + 32 [dd]
block_getrq: 8,80 R 2048 + 32 [dd]
block_plug: 8,80 [dd]
 
block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
block_unplug: 8,80 [dd] 1 explicit
   
block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
block_rq_complete: 8,80 R () 2048 + 32 [0]



diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index

Re: [RESEND PATCH v1 2/2] trace: events: block: Add tag in block trace events

2018-04-16 Thread Steven Rostedt

On Mon, 16 Apr 2018 14:33:29 +
"Bean Huo (beanhuo)"  wrote:

> Print the request tag along with other information in block trace events
> when tracing request , and unplug type (Sync / Async).
> 
> Signed-off-by: Bean Huo 

I don't see any issue with the tracing part.

Acked-by: Steven Rostedt (VMware) 

Others need to check the content.

-- Steve


> ---
>  include/trace/events/block.h | 36 +---
>  1 file changed, 25 insertions(+), 11 deletions(-)
>

Re: [RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Steven Rostedt

On Mon, 16 Apr 2018 14:31:49 +
"Bean Huo (beanhuo)"  wrote:

> Print the request tag along with other information
> while tracing a command.
> 
> Signed-off-by: Bean Huo 
> ---

I don't see any issue with the tracing part.

Acked-by: Steven Rostedt (VMware) 

Others need to check the content.

-- Steve

[RESEND PATCH v1 1/2] trace: events: scsi: Add tag in SCSI trace events

2018-04-16 Thread Bean Huo (beanhuo)

Print the request tag along with other information
while tracing a command.

Signed-off-by: Bean Huo 
---
 include/trace/events/scsi.h | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/scsi.h b/include/trace/events/scsi.h
index f624969..a4ada90 100644
--- a/include/trace/events/scsi.h
+++ b/include/trace/events/scsi.h
@@ -210,6 +210,7 @@ TRACE_EVENT(scsi_dispatch_cmd_start,
__field( unsigned int,  lun )
__field( unsigned int,  opcode  )
__field( unsigned int,  cmd_len )
+   __field( int,   tag )
__field( unsigned int,  data_sglen )
__field( unsigned int,  prot_sglen )
__field( unsigned char, prot_op )
@@ -223,6 +224,7 @@ TRACE_EVENT(scsi_dispatch_cmd_start,
__entry->lun= cmd->device->lun;
__entry->opcode = cmd->cmnd[0];
__entry->cmd_len= cmd->cmd_len;
+   __entry->tag= cmd->request->tag;
__entry->data_sglen = scsi_sg_count(cmd);
__entry->prot_sglen = scsi_prot_sg_count(cmd);
__entry->prot_op= scsi_get_prot_op(cmd);
@@ -230,10 +232,10 @@ TRACE_EVENT(scsi_dispatch_cmd_start,
),
 
TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u prot_sgl=%u" \
- " prot_op=%s cmnd=(%s %s raw=%s)",
+ " prot_op=%s tag=%d cmnd=(%s %s raw=%s)",
  __entry->host_no, __entry->channel, __entry->id,
  __entry->lun, __entry->data_sglen, __entry->prot_sglen,
- show_prot_op_name(__entry->prot_op),
+ show_prot_op_name(__entry->prot_op), __entry->tag,
  show_opcode_name(__entry->opcode),
  __parse_cdb(__get_dynamic_array(cmnd), __entry->cmd_len),
  __print_hex(__get_dynamic_array(cmnd), __entry->cmd_len))
@@ -253,6 +255,7 @@ TRACE_EVENT(scsi_dispatch_cmd_error,
__field( int,   rtn )
__field( unsigned int,  opcode  )
__field( unsigned int,  cmd_len )
+   __field( int,   tag )
__field( unsigned int,  data_sglen )
__field( unsigned int,  prot_sglen )
__field( unsigned char, prot_op )
@@ -267,6 +270,7 @@ TRACE_EVENT(scsi_dispatch_cmd_error,
__entry->rtn= rtn;
__entry->opcode = cmd->cmnd[0];
__entry->cmd_len= cmd->cmd_len;
+   __entry->tag= cmd->request->tag;
__entry->data_sglen = scsi_sg_count(cmd);
__entry->prot_sglen = scsi_prot_sg_count(cmd);
__entry->prot_op= scsi_get_prot_op(cmd);
@@ -274,10 +278,10 @@ TRACE_EVENT(scsi_dispatch_cmd_error,
),
 
TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u prot_sgl=%u" \
- " prot_op=%s cmnd=(%s %s raw=%s) rtn=%d",
+ " prot_op=%s tag=%d cmnd=(%s %s raw=%s) rtn=%d",
  __entry->host_no, __entry->channel, __entry->id,
  __entry->lun, __entry->data_sglen, __entry->prot_sglen,
- show_prot_op_name(__entry->prot_op),
+ show_prot_op_name(__entry->prot_op), __entry->tag,
  show_opcode_name(__entry->opcode),
  __parse_cdb(__get_dynamic_array(cmnd), __entry->cmd_len),
  __print_hex(__get_dynamic_array(cmnd), __entry->cmd_len),
@@ -298,6 +302,7 @@ DECLARE_EVENT_CLASS(scsi_cmd_done_timeout_template,
__field( int,   result  )
__field( unsigned int,  opcode  )
__field( unsigned int,  cmd_len )
+   __field( int,   tag )
__field( unsigned int,  data_sglen )
__field( unsigned int,  prot_sglen )
__field( unsigned char, prot_op )
@@ -312,6 +317,7 @@ DECLARE_EVENT_CLASS(scsi_cmd_done_timeout_template,
__entry->result = cmd->result;
__entry->opcode = cmd->cmnd[0];
__entry->cmd_len= cmd->cmd_len;
+   __entry->tag= cmd->request->tag;
__entry->data_sglen = scsi_sg_count(cmd);
__entry->prot_sglen = scsi_prot_sg_count(cmd);
__entry->prot_op= scsi_get_prot_op(cmd);
@@ -319,11 +325,11 @@ DECLARE_EVENT_CLASS(scsi_cmd_done_timeout_template,
),
 
TP_printk("host_no=%u channel=%u id=%u lun=%u data_sgl=%u " \
- "prot_sgl=%u prot_op=%s cmnd=(%s %s raw=%s) result=(driver=" \
- "%s host=%s message=%s status=%s)",
+ "prot_sgl=%u prot_op=%s tag=%d cmnd=(%s %s raw=%s) " \
+ "result=(driver=%s host=%s message=%s

[RESEND PATCH v1 0/2] Print the request tag in Block/SCSI trace events

2018-04-16 Thread Bean Huo (beanhuo)

These patches are to add the printout of the request tag in Block/SCSI trace
events when tracing one request or command, this is very useful for tracing
the task running status in the storage device which supports multiple command
queue.

As for the first patch " Add tag in SCSI trace events", copied from
Rajat Jain [1]. I am just re-sending here.
 
[1]https://patchwork.kernel.org/patch/6399661/

-- Resend since patches need to go through the block and SCSI subsystem, and 
need related
maintainer's confirmation.

Bean Huo (2):
  trace: events: scsi: Add tag in SCSI trace events
  trace: events: block: Add tag in block trace events

 include/trace/events/block.h | 36 +---
 include/trace/events/scsi.h  | 20 +---
 2 files changed, 38 insertions(+), 18 deletions(-)

-- 
2.7.4

Re: [PATCH 01/12] iommu-common: move to arch/sparc

2018-04-16 Thread David Miller

From: Anshuman Khandual 
Date: Mon, 16 Apr 2018 14:26:07 +0530

> On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
>> This code is only used by sparc, and all new iommu drivers should use the
>> drivers/iommu/ framework.  Also remove the unused exports.
>> 
>> Signed-off-by: Christoph Hellwig 
> 
> Right, these functions are used only from SPARC architecture. Simple
> git grep confirms it as well. Hence it makes sense to move them into
> arch code instead.

Well, we put these into a common location and used type friendly for
powerpc because we hoped powerpc would convert over to using this
common piece of code as well.

But nobody did the powerpc work.

If you look at the powerpc iommu support, it's the same code basically
for entry allocation.

Re: [PATCH V4 0/2] blk-mq: fix race between completion and BLK_EH_RESET_TIMER

2018-04-16 Thread Martin Steigerwald

Ming Lei - 16.04.18, 02:45:
> On Sun, Apr 15, 2018 at 06:31:44PM +0200, Martin Steigerwald wrote:
> > Hi Ming.
> > 
> > Ming Lei - 15.04.18, 17:43:
> > > Hi Jens,
> > > 
> > > This two patches fixes the recently discussed race between
> > > completion
> > > and BLK_EH_RESET_TIMER.
> > > 
> > > Israel & Martin, this one is a simpler fix on this issue and can
> > > cover the potencial hang of MQ_RQ_COMPLETE_IN_TIMEOUT request,
> > > could
> > > you test V4 and see if your issue can be fixed?
> > 
> > In replacement of all the three other patches I applied?
> > 
> > - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a
> > request.mbox'
> > 
> > - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair
> > into rcu_read_{lock,unlock}().mbox'
> > 
> > - '[PATCH v4] blk-mq_Fix race conditions in request timeout
> > handling.mbox'
> 
> You only need to replace the above one '[PATCH v4] blk-mq_Fix race
> conditions in request timeout' with V4 in this thread.

Ming, a 4.16.2 with the patches:

'[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
request.mbox'
'[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}().mbox'
'[PATCH V4 1_2] blk-mq_set RQF_MQ_TIMEOUT_EXPIRED when the rq'\''s 
timeout isn'\''t handled.mbox'
'[PATCH V4 2_2] blk-mq_fix race between complete and 
BLK_EH_RESET_TIMER.mbox'

hung on boot 3 out of 4 times.

See

[Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime 
and boot failures with blk_mq_terminate_expired in backtrace
https://bugzilla.kernel.org/show_bug.cgi?id=199077#c13

I tried to add your mail address to Cc of the bug report, but Bugzilla 
did not know it.

Fortunately it booted on the fourth attempt, cause I forgot my GRUB 
password.

Reverting back to previous 4.16.1 kernel with patches from Bart.

> > These patches worked reliably so far both for the hang on boot and
> > error reading SMART data.
> 
> And you may see the reason in the following thread:
> 
> https://marc.info/?l=linux-block=152366441625786=2

So requests could never be completed?

> > I´d compile a kernel tomorrow or Tuesday I think.

Thanks,
-- 
Martin

Re: [PATCH 08/12] mmc: reduce use of block bounce buffers

2018-04-16 Thread Robin Murphy


On 16/04/18 09:50, Christoph Hellwig wrote:

We can rely on the dma-mapping code to handle any DMA limits that is
bigger than the ISA DMA mask for us (either using an iommu or swiotlb),
so remove setting the block layer bounce limit for anything but bouncing
for highmem pages.

Signed-off-by: Christoph Hellwig 
---
  drivers/mmc/core/queue.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 56e9a803db21..60a02a763d01 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -351,17 +351,14 @@ static const struct blk_mq_ops mmc_mq_ops = {
  static void mmc_setup_queue(struct mmc_queue *mq, struct mmc_card *card)
  {
struct mmc_host *host = card->host;
-   u64 limit = BLK_BOUNCE_HIGH;
-
-   if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
-   limit = (u64)dma_max_pfn(mmc_dev(host)) << PAGE_SHIFT;
  
  	blk_queue_flag_set(QUEUE_FLAG_NONROT, mq->queue);

blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, mq->queue);
if (mmc_can_erase(card))
mmc_queue_setup_discard(mq->queue, card);
  
-	blk_queue_bounce_limit(mq->queue, limit);

+   if (!mmc_dev(host)->dma_mask || !mmc_dev(host)->dma_mask)


I'm almost surprised that GCC doesn't warn about "x || x", but 
nevertheless I think you've lost a "*" here...


Robin.


+   blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_HIGH);
blk_queue_max_hw_sectors(mq->queue,
min(host->max_blk_count, host->max_req_size / 512));
blk_queue_max_segments(mq->queue, host->max_segs);

[PATCH 00/11] lightnvm: pblk: small fixes

2018-04-16 Thread Javier González

A bunch of small fixes and extra checks for pblk. Non is critical, though
("lightnvm: pblk: check for chunk size before allocating it") might be nice to
get into 4.17 as it is a fix for the 2.0 pblk patches.

Javier

Javier González (11):
  lightnvm: pblk: fail gracefully on line alloc. failure
  lightnvm: pblk: recheck for bad lines at runtime
  lightnvm: pblk: check read lba on gc path
  lightnvn: pblk: improve error msg on corrupted LBAs
  lightnvm: pblk: warn in case of corrupted write buffer
  lightnvm: pblk: return NVM_ error on failed submission
  lightnvm: pblk: remove unnecessary indirection
  lightnvm: pblk: remove unnecessary argument
  lightnvm: pblk: check for chunk size before allocating it
  lightnvn: pass flag on graceful teardown to targets
  lightnvm: pblk: remove dead function

 drivers/lightnvm/core.c  | 10 +++---
 drivers/lightnvm/pblk-core.c | 86 ++--
 drivers/lightnvm/pblk-gc.c   | 10 +++---
 drivers/lightnvm/pblk-init.c | 38 
 drivers/lightnvm/pblk-map.c  | 33 -
 drivers/lightnvm/pblk-rb.c   |  5 ++-
 drivers/lightnvm/pblk-read.c | 73 -
 drivers/lightnvm/pblk.h  |  7 ++--
 include/linux/lightnvm.h |  2 +-
 9 files changed, 173 insertions(+), 91 deletions(-)

-- 
2.7.4

[PATCH 03/11] lightnvm: pblk: check read lba on gc path

2018-04-16 Thread Javier González

Check that the lba stored in the LBA metadata is correct in the GC path
too. This requires a new helper function to check random reads in the
vector read.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-read.c | 39 +--
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 9eee10f69df0..0d45d4ffc370 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -113,15 +113,14 @@ static int pblk_submit_read_io(struct pblk *pblk, struct 
nvm_rq *rqd)
return NVM_IO_OK;
 }
 
-static void pblk_read_check(struct pblk *pblk, struct nvm_rq *rqd,
-  sector_t blba)
+static void pblk_read_check_seq(struct pblk *pblk, void *meta_list,
+   sector_t blba, int nr_lbas)
 {
-   struct pblk_sec_meta *meta_list = rqd->meta_list;
-   int nr_lbas = rqd->nr_ppas;
+   struct pblk_sec_meta *meta_lba_list = meta_list;
int i;
 
for (i = 0; i < nr_lbas; i++) {
-   u64 lba = le64_to_cpu(meta_list[i].lba);
+   u64 lba = le64_to_cpu(meta_lba_list[i].lba);
 
if (lba == ADDR_EMPTY)
continue;
@@ -130,6 +129,32 @@ static void pblk_read_check(struct pblk *pblk, struct 
nvm_rq *rqd,
}
 }
 
+/*
+ * There can be wholes in the lba list.
+ */
+static void pblk_read_check_rand(struct pblk *pblk, void *meta_list,
+   u64 *lba_list, int nr_lbas)
+{
+   struct pblk_sec_meta *meta_lba_list = meta_list;
+   int i, j;
+
+   for (i = 0, j = 0; i < nr_lbas; i++) {
+   u64 lba = lba_list[i];
+   u64 meta_lba;
+
+   if (lba == ADDR_EMPTY)
+   continue;
+
+   meta_lba = le64_to_cpu(meta_lba_list[j++].lba);
+
+   if (lba != meta_lba) {
+   pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
+   lba, meta_lba);
+   WARN_ON(1);
+   }
+   }
+}
+
 static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd)
 {
struct ppa_addr *ppa_list;
@@ -172,7 +197,7 @@ static void __pblk_end_io_read(struct pblk *pblk, struct 
nvm_rq *rqd,
WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n");
 #endif
 
-   pblk_read_check(pblk, rqd, r_ctx->lba);
+   pblk_read_check_seq(pblk, rqd->meta_list, r_ctx->lba, rqd->nr_ppas);
 
bio_put(bio);
if (r_ctx->private)
@@ -585,6 +610,8 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
goto err_free_bio;
}
 
+   pblk_read_check_rand(pblk, rqd.meta_list, gc_rq->lba_list, rqd.nr_ppas);
+
atomic_dec(>inflight_io);
 
if (rqd.error) {
-- 
2.7.4

[PATCH 02/11] lightnvm: pblk: recheck for bad lines at runtime

2018-04-16 Thread Javier González

Bad blocks can grow at runtime. Check that the number of valid blocks in
a line are within the sanity threshold before allocating the line for
new writes.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c | 38 --
 drivers/lightnvm/pblk-init.c | 11 +++
 2 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index ceacd10a043e..128101f9e606 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1174,7 +1174,8 @@ static int pblk_prepare_new_line(struct pblk *pblk, 
struct pblk_line *line)
 static int pblk_line_prepare(struct pblk *pblk, struct pblk_line *line)
 {
struct pblk_line_meta *lm = >lm;
-   int blk_to_erase;
+   int blk_in_line = atomic_read(>blk_in_line);
+   int blk_to_erase, ret;
 
line->map_bitmap = kzalloc(lm->sec_bitmap_len, GFP_ATOMIC);
if (!line->map_bitmap)
@@ -1183,8 +1184,8 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
/* will be initialized using bb info from map_bitmap */
line->invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_ATOMIC);
if (!line->invalid_bitmap) {
-   kfree(line->map_bitmap);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto fail_free_map_bitmap;
}
 
/* Bad blocks do not need to be erased */
@@ -1199,16 +1200,19 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
blk_to_erase = pblk_prepare_new_line(pblk, line);
line->state = PBLK_LINESTATE_FREE;
} else {
-   blk_to_erase = atomic_read(>blk_in_line);
+   blk_to_erase = blk_in_line;
+   }
+
+   if (blk_in_line < lm->min_blk_line) {
+   ret = -EAGAIN;
+   goto fail_free_invalid_bitmap;
}
 
if (line->state != PBLK_LINESTATE_FREE) {
-   kfree(line->map_bitmap);
-   kfree(line->invalid_bitmap);
-   spin_unlock(>lock);
WARN(1, "pblk: corrupted line %d, state %d\n",
line->id, line->state);
-   return -EAGAIN;
+   ret = -EINTR;
+   goto fail_free_invalid_bitmap;
}
 
line->state = PBLK_LINESTATE_OPEN;
@@ -1222,6 +1226,16 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
kref_init(>ref);
 
return 0;
+
+fail_free_invalid_bitmap:
+   spin_unlock(>lock);
+   kfree(line->invalid_bitmap);
+   line->invalid_bitmap = NULL;
+fail_free_map_bitmap:
+   kfree(line->map_bitmap);
+   line->map_bitmap = NULL;
+
+   return ret;
 }
 
 int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line)
@@ -1292,10 +1306,14 @@ struct pblk_line *pblk_line_get(struct pblk *pblk)
 
ret = pblk_line_prepare(pblk, line);
if (ret) {
-   if (ret == -EAGAIN) {
+   switch (ret) {
+   case -EAGAIN:
+   list_add(>list, _mg->bad_list);
+   goto retry;
+   case -EINTR:
list_add(>list, _mg->corrupt_list);
goto retry;
-   } else {
+   default:
pr_err("pblk: failed to prepare line %d\n", line->id);
list_add(>list, _mg->free_list);
l_mg->nr_free_lines++;
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index dee64f91227d..8f8c9abd14fc 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -127,10 +127,8 @@ static int pblk_l2p_recover(struct pblk *pblk, bool 
factory_init)
if (!line) {
/* Configure next line for user data */
line = pblk_line_get_first_data(pblk);
-   if (!line) {
-   pr_err("pblk: line list corrupted\n");
+   if (!line)
return -EFAULT;
-   }
}
 
return 0;
@@ -141,6 +139,7 @@ static int pblk_l2p_init(struct pblk *pblk, bool 
factory_init)
sector_t i;
struct ppa_addr ppa;
size_t map_size;
+   int ret = 0;
 
map_size = pblk_trans_map_size(pblk);
pblk->trans_map = vmalloc(map_size);
@@ -152,7 +151,11 @@ static int pblk_l2p_init(struct pblk *pblk, bool 
factory_init)
for (i = 0; i < pblk->rl.nr_secs; i++)
pblk_trans_map_set(pblk, i, ppa);
 
-   return pblk_l2p_recover(pblk, factory_init);
+   ret = pblk_l2p_recover(pblk, factory_init);
+   if (ret)
+   vfree(pblk->trans_map);
+
+   return ret;
 }
 
 static void pblk_rwb_free(struct pblk *pblk)
-- 
2.7.4

[PATCH 04/11] lightnvn: pblk: improve error msg on corrupted LBAs

2018-04-16 Thread Javier González

In the event of a mismatch between the read LBA and the metadata pointer
reported by the device, improve the error message to be able to detect
the offending physical address (PPA) mapped to the corrupted LBA.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-read.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 0d45d4ffc370..89aed634333a 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -113,10 +113,11 @@ static int pblk_submit_read_io(struct pblk *pblk, struct 
nvm_rq *rqd)
return NVM_IO_OK;
 }
 
-static void pblk_read_check_seq(struct pblk *pblk, void *meta_list,
-   sector_t blba, int nr_lbas)
+static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd,
+   sector_t blba)
 {
-   struct pblk_sec_meta *meta_lba_list = meta_list;
+   struct pblk_sec_meta *meta_lba_list = rqd->meta_list;
+   int nr_lbas = rqd->nr_ppas;
int i;
 
for (i = 0; i < nr_lbas; i++) {
@@ -125,17 +126,27 @@ static void pblk_read_check_seq(struct pblk *pblk, void 
*meta_list,
if (lba == ADDR_EMPTY)
continue;
 
-   WARN(lba != blba + i, "pblk: corrupted read LBA\n");
+   if (lba != blba + i) {
+#ifdef CONFIG_NVM_DEBUG
+   struct ppa_addr *p;
+
+   p = (nr_lbas == 1) ? >ppa_list[i] : >ppa_addr;
+   print_ppa(>dev->geo, p, "seq", i);
+#endif
+   pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
+   lba, (u64)blba + i);
+   WARN_ON(1);
+   }
}
 }
 
 /*
  * There can be wholes in the lba list.
  */
-static void pblk_read_check_rand(struct pblk *pblk, void *meta_list,
-   u64 *lba_list, int nr_lbas)
+static void pblk_read_check_rand(struct pblk *pblk, struct nvm_rq *rqd,
+u64 *lba_list, int nr_lbas)
 {
-   struct pblk_sec_meta *meta_lba_list = meta_list;
+   struct pblk_sec_meta *meta_lba_list = rqd->meta_list;
int i, j;
 
for (i = 0, j = 0; i < nr_lbas; i++) {
@@ -145,14 +156,25 @@ static void pblk_read_check_rand(struct pblk *pblk, void 
*meta_list,
if (lba == ADDR_EMPTY)
continue;
 
-   meta_lba = le64_to_cpu(meta_lba_list[j++].lba);
+   meta_lba = le64_to_cpu(meta_lba_list[j].lba);
 
if (lba != meta_lba) {
+#ifdef CONFIG_NVM_DEBUG
+   struct ppa_addr *p;
+   int nr_ppas = rqd->nr_ppas;
+
+   p = (nr_ppas == 1) ? >ppa_list[j] : >ppa_addr;
+   print_ppa(>dev->geo, p, "seq", j);
+#endif
pr_err("pblk: corrupted read LBA (%llu/%llu)\n",
lba, meta_lba);
WARN_ON(1);
}
+
+   j++;
}
+
+   WARN_ONCE(j != rqd->nr_ppas, "pblk: corrupted random request\n");
 }
 
 static void pblk_read_put_rqd_kref(struct pblk *pblk, struct nvm_rq *rqd)
@@ -197,7 +219,7 @@ static void __pblk_end_io_read(struct pblk *pblk, struct 
nvm_rq *rqd,
WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n");
 #endif
 
-   pblk_read_check_seq(pblk, rqd->meta_list, r_ctx->lba, rqd->nr_ppas);
+   pblk_read_check_seq(pblk, rqd, r_ctx->lba);
 
bio_put(bio);
if (r_ctx->private)
@@ -610,7 +632,7 @@ int pblk_submit_read_gc(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
goto err_free_bio;
}
 
-   pblk_read_check_rand(pblk, rqd.meta_list, gc_rq->lba_list, rqd.nr_ppas);
+   pblk_read_check_rand(pblk, , gc_rq->lba_list, gc_rq->nr_secs);
 
atomic_dec(>inflight_io);
 
-- 
2.7.4

[PATCH 05/11] lightnvm: pblk: warn in case of corrupted write buffer

2018-04-16 Thread Javier González

When cleaning up buffer entries as we wrap up, their state should be
"completed". If any of the entries is in "submitted" state, it means
that something bad has happened. Trigger a warning immediately instead of
waiting for the state flag to eventually be updated, thus hiding the
issue.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-rb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 7a632913475f..024a366a995c 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -142,10 +142,9 @@ static void clean_wctx(struct pblk_w_ctx *w_ctx)
 {
int flags;
 
-try:
flags = READ_ONCE(w_ctx->flags);
-   if (!(flags & PBLK_SUBMITTED_ENTRY))
-   goto try;
+   WARN_ONCE(!(flags & PBLK_SUBMITTED_ENTRY),
+   "pblk: overwriting unsubmitted data\n");
 
/* Release flags on context. Protect from writes and reads */
smp_store_release(_ctx->flags, PBLK_WRITABLE_ENTRY);
-- 
2.7.4

[PATCH 09/11] lightnvm: pblk: check for chunk size before allocating it

2018-04-16 Thread Javier González

Do the check for the chunk state after making sure that the chunk type
is supported.

Fixes: 32ef9412c114 ("lightnvm: pblk: implement get log report chunk")
Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index b52855f9336b..9e3a43346d4c 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -751,14 +751,14 @@ static int pblk_setup_line_meta_20(struct pblk *pblk, 
struct pblk_line *line,
chunk->cnlb = chunk_meta->cnlb;
chunk->wp = chunk_meta->wp;
 
-   if (!(chunk->state & NVM_CHK_ST_OFFLINE))
-   continue;
-
if (chunk->type & NVM_CHK_TP_SZ_SPEC) {
WARN_ONCE(1, "pblk: custom-sized chunks unsupported\n");
continue;
}
 
+   if (!(chunk->state & NVM_CHK_ST_OFFLINE))
+   continue;
+
set_bit(pos, line->blk_bitmap);
nr_bad_chks++;
}
-- 
2.7.4

[PATCH 10/11] lightnvn: pass flag on graceful teardown to targets

2018-04-16 Thread Javier González

If the namespace is unregistered before the LightNVM target is removed
(e.g., on hot unplug) it is too late for the target to store any metadata
on the device - any attempt to write to the device will fail. In this
case, pass on a "gracefull teardown" flag to the target to let it know
when this happens.

In the case of pblk, we pad the open line (close all open chunks) to
improve data retention. In the event of an ungraceful shutdown, avoid
this part and just clean up.

Signed-off-by: Javier González 
---
 drivers/lightnvm/core.c  | 10 +-
 drivers/lightnvm/pblk-core.c | 13 -
 drivers/lightnvm/pblk-gc.c   | 10 ++
 drivers/lightnvm/pblk-init.c | 14 --
 drivers/lightnvm/pblk.h  |  4 +++-
 include/linux/lightnvm.h |  2 +-
 6 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index 63171cdce270..60aa7bc5a630 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -431,7 +431,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
return 0;
 err_sysfs:
if (tt->exit)
-   tt->exit(targetdata);
+   tt->exit(targetdata, true);
 err_init:
blk_cleanup_queue(tqueue);
tdisk->queue = NULL;
@@ -446,7 +446,7 @@ static int nvm_create_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_create *create)
return ret;
 }
 
-static void __nvm_remove_target(struct nvm_target *t)
+static void __nvm_remove_target(struct nvm_target *t, bool graceful)
 {
struct nvm_tgt_type *tt = t->type;
struct gendisk *tdisk = t->disk;
@@ -459,7 +459,7 @@ static void __nvm_remove_target(struct nvm_target *t)
tt->sysfs_exit(tdisk);
 
if (tt->exit)
-   tt->exit(tdisk->private_data);
+   tt->exit(tdisk->private_data, graceful);
 
nvm_remove_tgt_dev(t->dev, 1);
put_disk(tdisk);
@@ -489,7 +489,7 @@ static int nvm_remove_tgt(struct nvm_dev *dev, struct 
nvm_ioctl_remove *remove)
mutex_unlock(>mlock);
return 1;
}
-   __nvm_remove_target(t);
+   __nvm_remove_target(t, true);
mutex_unlock(>mlock);
 
return 0;
@@ -963,7 +963,7 @@ void nvm_unregister(struct nvm_dev *dev)
list_for_each_entry_safe(t, tmp, >targets, list) {
if (t->dev->parent != dev)
continue;
-   __nvm_remove_target(t);
+   __nvm_remove_target(t, false);
}
mutex_unlock(>mlock);
 
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 1b871441d816..cc34d5d9652d 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1461,7 +1461,7 @@ static void pblk_line_close_meta_sync(struct pblk *pblk)
flush_workqueue(pblk->close_wq);
 }
 
-void pblk_pipeline_stop(struct pblk *pblk)
+void __pblk_pipeline_flush(struct pblk *pblk)
 {
struct pblk_line_mgmt *l_mg = >l_mg;
int ret;
@@ -1486,6 +1486,11 @@ void pblk_pipeline_stop(struct pblk *pblk)
 
flush_workqueue(pblk->bb_wq);
pblk_line_close_meta_sync(pblk);
+}
+
+void __pblk_pipeline_stop(struct pblk *pblk)
+{
+   struct pblk_line_mgmt *l_mg = >l_mg;
 
spin_lock(_mg->free_lock);
pblk->state = PBLK_STATE_STOPPED;
@@ -1494,6 +1499,12 @@ void pblk_pipeline_stop(struct pblk *pblk)
spin_unlock(_mg->free_lock);
 }
 
+void pblk_pipeline_stop(struct pblk *pblk)
+{
+   __pblk_pipeline_flush(pblk);
+   __pblk_pipeline_stop(pblk);
+}
+
 struct pblk_line *pblk_line_replace_data(struct pblk *pblk)
 {
struct pblk_line_mgmt *l_mg = >l_mg;
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 6851a5c67189..b0cc277bf972 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -649,7 +649,7 @@ int pblk_gc_init(struct pblk *pblk)
return ret;
 }
 
-void pblk_gc_exit(struct pblk *pblk)
+void pblk_gc_exit(struct pblk *pblk, bool graceful)
 {
struct pblk_gc *gc = >gc;
 
@@ -663,10 +663,12 @@ void pblk_gc_exit(struct pblk *pblk)
if (gc->gc_reader_ts)
kthread_stop(gc->gc_reader_ts);
 
-   flush_workqueue(gc->gc_reader_wq);
+   if (graceful) {
+   flush_workqueue(gc->gc_reader_wq);
+   flush_workqueue(gc->gc_line_reader_wq);
+   }
+
destroy_workqueue(gc->gc_reader_wq);
-
-   flush_workqueue(gc->gc_line_reader_wq);
destroy_workqueue(gc->gc_line_reader_wq);
 
if (gc->gc_writer_ts)
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 9e3a43346d4c..bfc488d0dda9 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -1118,23 +1118,25 @@ static void pblk_free(struct pblk *pblk)
kfree(pblk);
 }
 
-static void pblk_tear_down(struct pblk *pblk)
+static void pblk_tear_down(struct pblk *pblk, bool graceful)
 {

[PATCH 07/11] lightnvm: pblk: remove unnecessary indirection

2018-04-16 Thread Javier González

Remove unnecessary indirection on the read path.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-read.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 89aed634333a..2f8224354c62 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -102,16 +102,6 @@ static void pblk_read_ppalist_rq(struct pblk *pblk, struct 
nvm_rq *rqd,
 #endif
 }
 
-static int pblk_submit_read_io(struct pblk *pblk, struct nvm_rq *rqd)
-{
-   int err;
-
-   err = pblk_submit_io(pblk, rqd);
-   if (err)
-   return NVM_IO_ERR;
-
-   return NVM_IO_OK;
-}
 
 static void pblk_read_check_seq(struct pblk *pblk, struct nvm_rq *rqd,
sector_t blba)
@@ -485,7 +475,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
rqd->bio = int_bio;
r_ctx->private = bio;
 
-   ret = pblk_submit_read_io(pblk, rqd);
+   ret = pblk_submit_io(pblk, rqd);
if (ret) {
pr_err("pblk: read IO submission failed\n");
if (int_bio)
-- 
2.7.4

[PATCH 06/11] lightnvm: pblk: return NVM_ error on failed submission

2018-04-16 Thread Javier González

Return a meaningful error when the sanity vector I/O check fails.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 128101f9e606..6bc0c7f61aac 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -467,16 +467,13 @@ int pblk_submit_io(struct pblk *pblk, struct nvm_rq *rqd)
 {
struct nvm_tgt_dev *dev = pblk->dev;
 
+   atomic_inc(>inflight_io);
+
 #ifdef CONFIG_NVM_DEBUG
-   int ret;
-
-   ret = pblk_check_io(pblk, rqd);
-   if (ret)
-   return ret;
+   if (pblk_check_io(pblk, rqd))
+   return NVM_IO_ERR;
 #endif
 
-   atomic_inc(>inflight_io);
-
return nvm_submit_io(dev, rqd);
 }
 
@@ -484,16 +481,13 @@ int pblk_submit_io_sync(struct pblk *pblk, struct nvm_rq 
*rqd)
 {
struct nvm_tgt_dev *dev = pblk->dev;
 
+   atomic_inc(>inflight_io);
+
 #ifdef CONFIG_NVM_DEBUG
-   int ret;
-
-   ret = pblk_check_io(pblk, rqd);
-   if (ret)
-   return ret;
+   if (pblk_check_io(pblk, rqd))
+   return NVM_IO_ERR;
 #endif
 
-   atomic_inc(>inflight_io);
-
return nvm_submit_io_sync(dev, rqd);
 }
 
-- 
2.7.4

[PATCH 08/11] lightnvm: pblk: remove unnecessary argument

2018-04-16 Thread Javier González

Remove unnecessary argument on pblk_line_free()

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c | 6 +++---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk.h  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 6bc0c7f61aac..1b871441d816 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1337,7 +1337,7 @@ static struct pblk_line *pblk_line_retry(struct pblk 
*pblk,
retry_line->emeta = line->emeta;
retry_line->meta_line = line->meta_line;
 
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
l_mg->data_line = retry_line;
spin_unlock(_mg->free_lock);
 
@@ -1562,7 +1562,7 @@ struct pblk_line *pblk_line_replace_data(struct pblk 
*pblk)
return new;
 }
 
-void pblk_line_free(struct pblk *pblk, struct pblk_line *line)
+void pblk_line_free(struct pblk_line *line)
 {
kfree(line->map_bitmap);
kfree(line->invalid_bitmap);
@@ -1584,7 +1584,7 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
WARN_ON(line->state != PBLK_LINESTATE_GC);
line->state = PBLK_LINESTATE_FREE;
line->gc_group = PBLK_LINEGC_NONE;
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
spin_unlock(>lock);
 
atomic_dec(>pipeline_gc);
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 8f8c9abd14fc..b52855f9336b 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -509,7 +509,7 @@ static void pblk_lines_free(struct pblk *pblk)
for (i = 0; i < l_mg->nr_lines; i++) {
line = >lines[i];
 
-   pblk_line_free(pblk, line);
+   pblk_line_free(line);
pblk_line_meta_free(line);
}
spin_unlock(_mg->free_lock);
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 856ac41e1201..ff0e5e6421dd 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -766,7 +766,7 @@ struct pblk_line *pblk_line_get_data(struct pblk *pblk);
 struct pblk_line *pblk_line_get_erase(struct pblk *pblk);
 int pblk_line_erase(struct pblk *pblk, struct pblk_line *line);
 int pblk_line_is_full(struct pblk_line *line);
-void pblk_line_free(struct pblk *pblk, struct pblk_line *line);
+void pblk_line_free(struct pblk_line *line);
 void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line);
 void pblk_line_close(struct pblk *pblk, struct pblk_line *line);
 void pblk_line_close_ws(struct work_struct *work);
-- 
2.7.4

Change device block count from userspace?

2018-04-16 Thread Manuel Lauss

Hello,

I have a SATA SSD which suddenly reports its size as 2.2TB, 0x
block count:

[   30.620086] sd 10:0:0:0: [sdc] 4294967295 512-byte logical blocks:
(2.20 TB/2.00 TiB)
[   30.620327] sd 10:0:0:0: [sdc] Write Protect is off
[   30.620329] sd 10:0:0:0: [sdc] Mode Sense: 43 00 00 00
[   30.620771] sd 10:0:0:0: [sdc] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA
[   34.170412] GPT:Primary header thinks Alt. header is not at the end
of the disk.
[   34.170413] GPT:4004704367 != 4294967294
[   34.170414] GPT:Alternate GPT header not at the end of the disk.
[   34.170414] GPT:4004704367 != 4294967294
[   34.170414] GPT: Use GNU Parted to correct GPT errors.
[   34.170419]  sdc: sdc1
[   34.172408] sd 10:0:0:0: [sdc] Attached SCSI disk
[   64.293698] sd 10:0:0:0: [sdc] tag#15 uas_eh_abort_handler 0
uas-tag 1 inflight: CMD IN
[   64.293702] sd 10:0:0:0: [sdc] tag#15 CDB: Read(10) 28 00 ff ff ff
00 00 00 08 00

It's impossible to access this device with either dd or any other
tools because apparently
the kernel tries to access higher blocks which, one after the other,
time out after around 5 minutes.  It's possible to access the disk on
Windows with an Ext4 driver, data seems fine but retrieval is very
slow.

Is there a way from userspace to tell the block layer that this device
has only 4004704368 blocks, and that it should not touch the higher
ones?

Thanks!
  Manuel

[PATCH] lightnvm: pblk: remove unnecessary bio_get/put

2018-04-16 Thread Javier González

In the read path, pblk gets a reference to the incoming bio and puts it
after ending the bio. Though this behavior is correct, it is unnecessary
since pblk is the one putting the bio, therefore, it cannot disappear
underneath it.

Removing this reference, allows to clean up rqd->bio and avoid pointer
bouncing for the different read paths. Now, the incoming bio always
resides in the read context and pblk's internal bios (if any) reside in
rqd->bio.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-read.c | 57 +++-
 1 file changed, 24 insertions(+), 33 deletions(-)

diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 2f8224354c62..5464e4177c87 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -39,10 +39,10 @@ static int pblk_read_from_cache(struct pblk *pblk, struct 
bio *bio,
 }
 
 static void pblk_read_ppalist_rq(struct pblk *pblk, struct nvm_rq *rqd,
-sector_t blba, unsigned long *read_bitmap)
+struct bio *bio, sector_t blba,
+unsigned long *read_bitmap)
 {
struct pblk_sec_meta *meta_list = rqd->meta_list;
-   struct bio *bio = rqd->bio;
struct ppa_addr ppas[PBLK_MAX_REQ_ADDRS];
int nr_secs = rqd->nr_ppas;
bool advanced_bio = false;
@@ -189,7 +189,6 @@ static void pblk_end_user_read(struct bio *bio)
WARN_ONCE(bio->bi_status, "pblk: corrupted read bio\n");
 #endif
bio_endio(bio);
-   bio_put(bio);
 }
 
 static void __pblk_end_io_read(struct pblk *pblk, struct nvm_rq *rqd,
@@ -197,23 +196,18 @@ static void __pblk_end_io_read(struct pblk *pblk, struct 
nvm_rq *rqd,
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
-   struct bio *bio = rqd->bio;
+   struct bio *int_bio = rqd->bio;
unsigned long start_time = r_ctx->start_time;
 
generic_end_io_acct(dev->q, READ, >disk->part0, start_time);
 
if (rqd->error)
pblk_log_read_err(pblk, rqd);
-#ifdef CONFIG_NVM_DEBUG
-   else
-   WARN_ONCE(bio->bi_status, "pblk: corrupted read error\n");
-#endif
 
pblk_read_check_seq(pblk, rqd, r_ctx->lba);
 
-   bio_put(bio);
-   if (r_ctx->private)
-   pblk_end_user_read((struct bio *)r_ctx->private);
+   if (int_bio)
+   bio_put(int_bio);
 
if (put_line)
pblk_read_put_rqd_kref(pblk, rqd);
@@ -230,16 +224,19 @@ static void __pblk_end_io_read(struct pblk *pblk, struct 
nvm_rq *rqd,
 static void pblk_end_io_read(struct nvm_rq *rqd)
 {
struct pblk *pblk = rqd->private;
+   struct pblk_g_ctx *r_ctx = nvm_rq_to_pdu(rqd);
+   struct bio *bio = (struct bio *)r_ctx->private;
 
+   pblk_end_user_read(bio);
__pblk_end_io_read(pblk, rqd, true);
 }
 
-static int pblk_partial_read_bio(struct pblk *pblk, struct nvm_rq *rqd,
-unsigned int bio_init_idx,
-unsigned long *read_bitmap)
+static int pblk_partial_read(struct pblk *pblk, struct nvm_rq *rqd,
+struct bio *orig_bio, unsigned int bio_init_idx,
+unsigned long *read_bitmap)
 {
-   struct bio *new_bio, *bio = rqd->bio;
struct pblk_sec_meta *meta_list = rqd->meta_list;
+   struct bio *new_bio;
struct bio_vec src_bv, dst_bv;
void *ppa_ptr = NULL;
void *src_p, *dst_p;
@@ -319,7 +316,7 @@ static int pblk_partial_read_bio(struct pblk *pblk, struct 
nvm_rq *rqd,
meta_list[hole].lba = lba_list_media[i];
 
src_bv = new_bio->bi_io_vec[i++];
-   dst_bv = bio->bi_io_vec[bio_init_idx + hole];
+   dst_bv = orig_bio->bi_io_vec[bio_init_idx + hole];
 
src_p = kmap_atomic(src_bv.bv_page);
dst_p = kmap_atomic(dst_bv.bv_page);
@@ -338,28 +335,26 @@ static int pblk_partial_read_bio(struct pblk *pblk, 
struct nvm_rq *rqd,
 
bio_put(new_bio);
 
-   /* Complete the original bio and associated request */
-   bio_endio(bio);
-   rqd->bio = bio;
+   /* restore original request */
+   rqd->bio = NULL;
rqd->nr_ppas = nr_secs;
 
__pblk_end_io_read(pblk, rqd, false);
-   return NVM_IO_OK;
+   return NVM_IO_DONE;
 
 err:
pr_err("pblk: failed to perform partial read\n");
 
/* Free allocated pages in new bio */
-   pblk_bio_free_pages(pblk, bio, 0, new_bio->bi_vcnt);
+   pblk_bio_free_pages(pblk, orig_bio, 0, new_bio->bi_vcnt);
__pblk_end_io_read(pblk, rqd, false);
return NVM_IO_ERR;
 }
 
-static void pblk_read_rq(struct pblk *pblk, struct nvm_rq *rqd,
+static void pblk_read_rq(struct pblk *pblk, struct nvm_rq *rqd, struct bio 
*bio,
 sector_t lba, unsigned long *read_bitmap)
 {

[PATCH] lightnvm: pblk: take bitmap alloc. out of critical section

2018-04-16 Thread Javier González

Allocate line bitmaps outside of the line lock on line preparation.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c | 96 +---
 1 file changed, 55 insertions(+), 41 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 5f960a6609c8..7762e89984ee 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1058,6 +1058,25 @@ static int pblk_line_init_metadata(struct pblk *pblk, 
struct pblk_line *line,
return 1;
 }
 
+static int pblk_line_alloc_bitmaps(struct pblk *pblk, struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = >lm;
+
+   line->map_bitmap = kzalloc(lm->sec_bitmap_len, GFP_KERNEL);
+   if (!line->map_bitmap)
+   return -ENOMEM;
+
+   /* will be initialized using bb info from map_bitmap */
+   line->invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_KERNEL);
+   if (!line->invalid_bitmap) {
+   kfree(line->map_bitmap);
+   line->map_bitmap = NULL;
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+
 /* For now lines are always assumed full lines. Thus, smeta former and current
  * lun bitmaps are omitted.
  */
@@ -1162,18 +1181,7 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
 {
struct pblk_line_meta *lm = >lm;
int blk_in_line = atomic_read(>blk_in_line);
-   int blk_to_erase, ret;
-
-   line->map_bitmap = kzalloc(lm->sec_bitmap_len, GFP_ATOMIC);
-   if (!line->map_bitmap)
-   return -ENOMEM;
-
-   /* will be initialized using bb info from map_bitmap */
-   line->invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_ATOMIC);
-   if (!line->invalid_bitmap) {
-   ret = -ENOMEM;
-   goto fail_free_map_bitmap;
-   }
+   int blk_to_erase;
 
/* Bad blocks do not need to be erased */
bitmap_copy(line->erase_bitmap, line->blk_bitmap, lm->blk_per_line);
@@ -1191,15 +1199,15 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
}
 
if (blk_in_line < lm->min_blk_line) {
-   ret = -EAGAIN;
-   goto fail_free_invalid_bitmap;
+   spin_unlock(>lock);
+   return -EAGAIN;
}
 
if (line->state != PBLK_LINESTATE_FREE) {
WARN(1, "pblk: corrupted line %d, state %d\n",
line->id, line->state);
-   ret = -EINTR;
-   goto fail_free_invalid_bitmap;
+   spin_unlock(>lock);
+   return -EINTR;
}
 
line->state = PBLK_LINESTATE_OPEN;
@@ -1213,16 +1221,6 @@ static int pblk_line_prepare(struct pblk *pblk, struct 
pblk_line *line)
kref_init(>ref);
 
return 0;
-
-fail_free_invalid_bitmap:
-   spin_unlock(>lock);
-   kfree(line->invalid_bitmap);
-   line->invalid_bitmap = NULL;
-fail_free_map_bitmap:
-   kfree(line->map_bitmap);
-   line->map_bitmap = NULL;
-
-   return ret;
 }
 
 int pblk_line_recov_alloc(struct pblk *pblk, struct pblk_line *line)
@@ -1242,13 +1240,15 @@ int pblk_line_recov_alloc(struct pblk *pblk, struct 
pblk_line *line)
}
spin_unlock(_mg->free_lock);
 
-   pblk_rl_free_lines_dec(>rl, line, true);
+   if (pblk_line_alloc_bitmaps(pblk, line))
+   return -EINTR;
 
if (!pblk_line_init_bb(pblk, line, 0)) {
list_add(>list, _mg->free_list);
return -EINTR;
}
 
+   pblk_rl_free_lines_dec(>rl, line, true);
return 0;
 }
 
@@ -1260,6 +1260,24 @@ void pblk_line_recov_close(struct pblk *pblk, struct 
pblk_line *line)
line->emeta = NULL;
 }
 
+static void pblk_line_clear(struct pblk_line *line)
+{
+   *line->vsc = cpu_to_le32(EMPTY_ENTRY);
+
+   line->map_bitmap = NULL;
+   line->invalid_bitmap = NULL;
+   line->smeta = NULL;
+   line->emeta = NULL;
+}
+
+void pblk_line_free(struct pblk_line *line)
+{
+   kfree(line->map_bitmap);
+   kfree(line->invalid_bitmap);
+
+   pblk_line_clear(line);
+}
+
 struct pblk_line *pblk_line_get(struct pblk *pblk)
 {
struct pblk_line_mgmt *l_mg = >l_mg;
@@ -1326,11 +1344,14 @@ static struct pblk_line *pblk_line_retry(struct pblk 
*pblk,
return NULL;
}
 
+   retry_line->map_bitmap = line->map_bitmap;
+   retry_line->invalid_bitmap = line->invalid_bitmap;
retry_line->smeta = line->smeta;
retry_line->emeta = line->emeta;
retry_line->meta_line = line->meta_line;
 
-   pblk_line_free(line);
+   pblk_line_clear(line);
+
l_mg->data_line = retry_line;
spin_unlock(_mg->free_lock);
 
@@ -1383,6 +1404,9 @@ struct pblk_line *pblk_line_get_first_data(struct pblk 
*pblk)
}
spin_unlock(_mg->free_lock);
 
+   if (pblk_line_alloc_bitmaps(pblk, line))
+

Re: [PATCH 06/12] dma-mapping: move the NEED_DMA_MAP_STATE config symbol to lib/Kconfig

2018-04-16 Thread Anshuman Khandual

On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
> This way we have one central definition of it, and user can select it as
> needed.  Note that we now also always select it when CONFIG_DMA_API_DEBUG
> is select, which fixes some incorrect checks in a few network drivers.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Anshuman Khandual

Re: [PATCH 05/12] scatterlist: move the NEED_SG_DMA_LENGTH config symbol to lib/Kconfig

2018-04-16 Thread Anshuman Khandual

On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
> This way we have one central definition of it, and user can select it as
> needed.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Anshuman Khandual

[PATCH] target: fix crash with iscsi target and dvd

2018-04-16 Thread Ming Lei

When the current page can't be added to bio, one new bio should be
created for adding this page again, instead of ignoring this page.

This patch fixes kernel crash with iscsi target and dvd, as reported
by Wakko.

Cc: Wakko Warner 
Cc: Bart Van Assche 
Cc: target-de...@vger.kernel.org
Cc: linux-s...@vger.kernel.org
Cc: "Nicholas A. Bellinger" 
Cc: Christoph Hellwig 
Fixes: 84c8590646d5b35804 ("target: avoid accessing .bi_vcnt directly")
Signed-off-by: Ming Lei 
---
 drivers/target/target_core_pscsi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/target/target_core_pscsi.c 
b/drivers/target/target_core_pscsi.c
index 0d99b242e82e..6cb933ecc084 100644
--- a/drivers/target/target_core_pscsi.c
+++ b/drivers/target/target_core_pscsi.c
@@ -890,6 +890,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
u32 sgl_nents,
bytes = min(bytes, data_len);
 
if (!bio) {
+new_bio:
nr_vecs = min_t(int, BIO_MAX_PAGES, nr_pages);
nr_pages -= nr_vecs;
/*
@@ -931,6 +932,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
u32 sgl_nents,
 * be allocated with pscsi_get_bio() above.
 */
bio = NULL;
+   goto new_bio;
}
 
data_len -= bytes;
-- 
2.9.5

Re: [PATCH 04/12] iommu-helper: move the IOMMU_HELPER config symbol to lib/

2018-04-16 Thread Anshuman Khandual

On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
> This way we have one central definition of it, and user can select it as
> needed.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Anshuman Khandual

Re: [PATCH] bcache: not access dc->bdev when backing device is offline

2018-04-16 Thread tang . junhui

Hi Coly,

>On 2018/4/16 4:56 PM, tang.jun...@zte.com.cn wrote:
>> Hi Coly,
>> 
>>> When backing device is offline, memory object pointed by dc->bdev might be
>>> released in some condititions. I have bug report that bdevname() triggers
>>> a NULL pointer deference panic inside bch_cached_dev_error(), where
>>> dc->bdev is NULL.
>>>
>>> This patch adds char backing_dev_name[BDEVNAME_SIZE] in struct cached_dev,
>>> and stored backing device name in backing_dev_nmae during backing device
>>> registration. Then when backing device is offline, pr_err() can still
>>> print message with backing device name without calling bdevname(). And
>>> backing_dev_name is also helpful when debugging crash code.
>>>
>>> Fixes: c7b7bd07404c ("bcache: add io_disable to struct cached_dev")
>>> Signed-off-by: Coly Li 
>>> ---
>>> drivers/md/bcache/bcache.h |  2 ++
>>> drivers/md/bcache/super.c  | 10 --
>>> 2 files changed, 6 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>>> index d338b7086013..9af0ad7f6243 100644
>>> --- a/drivers/md/bcache/bcache.h
>>> +++ b/drivers/md/bcache/bcache.h
>>> @@ -392,6 +392,8 @@ struct cached_dev {
>>> #define DEFAULT_CACHED_DEV_ERROR_LIMIT64
>>> atomic_tio_errors;
>>> unsignederror_limit;
>>> +
>>> +charbacking_dev_name[BDEVNAME_SIZE];
>>> };
>>>
>>> enum alloc_reserve {
>>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>>> index d90d9e59ca00..f2512e3e67e6 100644
>>> --- a/drivers/md/bcache/super.c
>>> +++ b/drivers/md/bcache/super.c
>>> @@ -1225,13 +1225,13 @@ static void register_bdev(struct cache_sb *sb, 
>>> struct page *sb_page,
>>>  struct block_device *bdev,
>>>  struct cached_dev *dc)
>>> {
>>> -char name[BDEVNAME_SIZE];
>>> const char *err = "cannot allocate memory";
>>> struct cache_set *c;
>>>
>>> memcpy(>sb, sb, sizeof(struct cache_sb));
>>> dc->bdev = bdev;
>>> dc->bdev->bd_holder = dc;
>>> +bdevname(bdev, dc->backing_dev_name);
>>>
>>> bio_init(>sb_bio, dc->sb_bio.bi_inline_vecs, 1);
>>> bio_first_bvec_all(>sb_bio)->bv_page = sb_page;
>>> @@ -1247,7 +1247,7 @@ static void register_bdev(struct cache_sb *sb, struct 
>>> page *sb_page,
>>> if (bch_cache_accounting_add_kobjs(>accounting, >disk.kobj))
>>> goto err;
>>>
>>> -pr_info("registered backing device %s", bdevname(bdev, name));
>>> +pr_info("registered backing device %s", dc->backing_dev_name);
>>>
>>> list_add(>list, _devices);
>>> list_for_each_entry(c, _cache_sets, list)
>>> @@ -1259,7 +1259,7 @@ static void register_bdev(struct cache_sb *sb, struct 
>>> page *sb_page,
>>>
>>> return;
>>> err:
>>> -pr_notice("error %s: %s", bdevname(bdev, name), err);
>>> +pr_notice("error %s: %s", dc->backing_dev_name, err);
>>> bcache_device_stop(>disk);
>>> }
>>>
>>> @@ -1367,8 +1367,6 @@ int bch_flash_dev_create(struct cache_set *c, 
>>> uint64_t size)
>>>
>>> bool bch_cached_dev_error(struct cached_dev *dc)
>>> {
>>> -char name[BDEVNAME_SIZE];
>>> -
>>> if (!dc || test_bit(BCACHE_DEV_CLOSING, >disk.flags))
>>> return false;
>>>
>>> @@ -1377,7 +1375,7 @@ bool bch_cached_dev_error(struct cached_dev *dc)
>>> smp_mb();
>>>
>>> pr_err("stop %s: too many IO errors on backing device %s\n",
>>> -dc->disk.disk->disk_name, bdevname(dc->bdev, name));
>>> +dc->disk.disk->disk_name, dc->backing_dev_name);
>>>
>>> bcache_device_stop(>disk);
>>> return true;
>>> -- 
>>> 2.16.2
>> 
>> Since you have wrote it in dc->backing_dev_name, could you change all
>> bdevname(dc->bdev, buf) into dc->backing_dev_name in bcache?
>
>Hi Junhui,
>
>Yes, this is on my pipeline. So far I just want to make a minimal
>change, since the problematic patch is in 4.17 pre-rc1 already. Now this
>patch is under our internal testing already, once the fix is verified, I
>will ask Michael and Jens to pick it as a minor fix.
>
>The changes as you suggested will be a bit larger and not in emergent
>pipe for this moment.

OK, LGTM.
Reviewed-by: Tang Junhui 

Thanks.
Tang Junhui

[PATCH] lightnvm: pblk: add CNEX Granby to the pblk device list

2018-04-16 Thread Wei Xu

Add a new lightnvm quirk to identify CNEX’s Granby controller.

Signed-off-by: Wei Xu 
---
 drivers/nvme/host/pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cb73bc8..9419e88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2529,6 +2529,8 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2807),   /* CNEX WL */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+   { PCI_DEVICE(0x1d1d, 0x2601),   /* CNEX Granby */
+   .driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.7.4

Re: [PATCH] bcache: not access dc->bdev when backing device is offline

2018-04-16 Thread Coly Li

On 2018/4/16 4:56 PM, tang.jun...@zte.com.cn wrote:
> Hi Coly,
> 
>> When backing device is offline, memory object pointed by dc->bdev might be
>> released in some condititions. I have bug report that bdevname() triggers
>> a NULL pointer deference panic inside bch_cached_dev_error(), where
>> dc->bdev is NULL.
>>
>> This patch adds char backing_dev_name[BDEVNAME_SIZE] in struct cached_dev,
>> and stored backing device name in backing_dev_nmae during backing device
>> registration. Then when backing device is offline, pr_err() can still
>> print message with backing device name without calling bdevname(). And
>> backing_dev_name is also helpful when debugging crash code.
>>
>> Fixes: c7b7bd07404c ("bcache: add io_disable to struct cached_dev")
>> Signed-off-by: Coly Li 
>> ---
>> drivers/md/bcache/bcache.h |  2 ++
>> drivers/md/bcache/super.c  | 10 --
>> 2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index d338b7086013..9af0ad7f6243 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -392,6 +392,8 @@ struct cached_dev {
>> #define DEFAULT_CACHED_DEV_ERROR_LIMIT64
>> atomic_tio_errors;
>> unsignederror_limit;
>> +
>> +charbacking_dev_name[BDEVNAME_SIZE];
>> };
>>
>> enum alloc_reserve {
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index d90d9e59ca00..f2512e3e67e6 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1225,13 +1225,13 @@ static void register_bdev(struct cache_sb *sb, 
>> struct page *sb_page,
>>  struct block_device *bdev,
>>  struct cached_dev *dc)
>> {
>> -char name[BDEVNAME_SIZE];
>> const char *err = "cannot allocate memory";
>> struct cache_set *c;
>>
>> memcpy(>sb, sb, sizeof(struct cache_sb));
>> dc->bdev = bdev;
>> dc->bdev->bd_holder = dc;
>> +bdevname(bdev, dc->backing_dev_name);
>>
>> bio_init(>sb_bio, dc->sb_bio.bi_inline_vecs, 1);
>> bio_first_bvec_all(>sb_bio)->bv_page = sb_page;
>> @@ -1247,7 +1247,7 @@ static void register_bdev(struct cache_sb *sb, struct 
>> page *sb_page,
>> if (bch_cache_accounting_add_kobjs(>accounting, >disk.kobj))
>> goto err;
>>
>> -pr_info("registered backing device %s", bdevname(bdev, name));
>> +pr_info("registered backing device %s", dc->backing_dev_name);
>>
>> list_add(>list, _devices);
>> list_for_each_entry(c, _cache_sets, list)
>> @@ -1259,7 +1259,7 @@ static void register_bdev(struct cache_sb *sb, struct 
>> page *sb_page,
>>
>> return;
>> err:
>> -pr_notice("error %s: %s", bdevname(bdev, name), err);
>> +pr_notice("error %s: %s", dc->backing_dev_name, err);
>> bcache_device_stop(>disk);
>> }
>>
>> @@ -1367,8 +1367,6 @@ int bch_flash_dev_create(struct cache_set *c, uint64_t 
>> size)
>>
>> bool bch_cached_dev_error(struct cached_dev *dc)
>> {
>> -char name[BDEVNAME_SIZE];
>> -
>> if (!dc || test_bit(BCACHE_DEV_CLOSING, >disk.flags))
>> return false;
>>
>> @@ -1377,7 +1375,7 @@ bool bch_cached_dev_error(struct cached_dev *dc)
>> smp_mb();
>>
>> pr_err("stop %s: too many IO errors on backing device %s\n",
>> -dc->disk.disk->disk_name, bdevname(dc->bdev, name));
>> +dc->disk.disk->disk_name, dc->backing_dev_name);
>>
>> bcache_device_stop(>disk);
>> return true;
>> -- 
>> 2.16.2
> 
> Since you have wrote it in dc->backing_dev_name, could you change all
> bdevname(dc->bdev, buf) into dc->backing_dev_name in bcache?

Hi Junhui,

Yes, this is on my pipeline. So far I just want to make a minimal
change, since the problematic patch is in 4.17 pre-rc1 already. Now this
patch is under our internal testing already, once the fix is verified, I
will ask Michael and Jens to pick it as a minor fix.

The changes as you suggested will be a bit larger and not in emergent
pipe for this moment.

Coly Li

Re: [PATCH 02/12] iommu-helper: unexport iommu_area_alloc

2018-04-16 Thread Anshuman Khandual

On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
> This function is only used by built-in code.
> 
> Reviewed-by: Christoph Hellwig 

Reviewed-by: Anshuman Khandual

Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-16 Thread Sreekanth Reddy

On Mon, Apr 16, 2018 at 1:46 PM, Jinpu Wang  wrote:
> On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
>  wrote:
>>
>> Jinpu,
>>
>> [CC:ed the mpt3sas maintainers]
>>
>> The ratelimit patch is just an attempt to treat the symptom, not the
>> cause.
> Agree. If we can fix the root cause, it will be great.
>>
>>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>>> After reboot, kernel reports the IO errors from all the drives behind
>>> HBA, seems for almost every read IO, which turns the system unusable:
>>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>>
>> That sounds like a bug in the mpt3sas driver or firmware. I guess the
>> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
>> the PI on the drive side. But that doesn't seem to be a particularly
>> useful mode of operation.
>>
>> Jinpu: Which firmware are you running? Also, please send us the output
>> of:
>>
>> sg_readcap -l /dev/sda
>> sg_inq -x /dev/sda
>> sg_vpd /dev/sda
>>
> Disks are INTEL SSDSC2BX48, directly attached to HBA.
> LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), 
> BiosVersion(08.11.00.00)
> mpt3sas_cm2: Protocol=(Initiator,Target),
> Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
> Full,NCQ)
>
> jwang@x:~$ sudo sg_vpd /dev/sdz
> Supported VPD pages VPD page:
>   Supported VPD pages [sv]
>   Unit serial number [sn]
>   Device identification [di]
>   Mode page policy [mpp]
>   ATA information (SAT) [ai]
>   Block limits (SBC) [bl]
>   Block device characteristics (SBC) [bdc]
>   Logical block provisioning (SBC) [lbpv]
> jwang@x:~$ sudo sg_inq -x /dev/sdz
> VPD INQUIRY: extended INQUIRY data page
> inquiry: field in cdb illegal (page not supported)
> jwang@x:~$ sudo sg_readcap -l /dev/sdz
> Read Capacity results:
>Protection: prot_en=0, p_type=0, p_i_exponent=0
>Logical block provisioning: lbpme=1, lbprz=1
>Last logical block address=937703087 (0x37e436af), Number of
> logical blocks=937703088
>Logical block length=512 bytes
>Logical blocks per physical block exponent=3 [so physical block
> length=4096 bytes]
>Lowest aligned logical block address=0
> Hence:
>Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB
>
>
>> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
>> controller?

[Sreekanth] Current Upstream mpt3sas driver doesn't have DIX support
capabilities,
it supports only DIF feature.


Thanks,
Sreekanth
>>
>> --
>> Martin K. Petersen  Oracle Linux Engineering
>
>
> Thanks!
>
> --
> Jack Wang
> Linux Kernel Developer
>
> ProfitBricks GmbH
> Greifswalder Str. 207
> D - 10405 Berlin

Re: v4.16-rc1 + dm-mpath + BFQ

2018-04-16 Thread Paolo Valente



> Il giorno 01 apr 2018, alle ore 10:56, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
>>  ha scritto:
>> 
>> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
>>> Still 4.16-rc1, being that the version for which you reported this
>>> issue in the first place.
>> 
>> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
>> since RDMA/CM support for the SRP target driver is missing from that kernel.
>> That's why I asked you to use the for-next branch from my github repository
>> in a previous e-mail.
> 
> Yep, that's the branch/top commit I used (as you suggested):
> 190943ce1824 [bvanassche/for-next] scsi: mpt3sas: fix oops in error handlers 
> after shutdown/unload
> with
> bvanasschehttps://github.com/bvanassche/linux.git
> 
> The kernel in that branch presents itself as 4.16-rc1, but, as you
> point out, it should contain the needed support.
> 
>> Anyway, since the necessary patches are now in
>> linux-next, the srp-test software can also be run against linux-next. Here
>> are the results that I obtained with label next-20180329 and the kernel
>> config attached to your previous e-mail:
>> 
>> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
>> 
>> BUG: unable to handle kernel NULL pointer dereference at 0200
>> PGD 0 P4D 0 
>> Oops: 0002 [#1] SMP PTI
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> RIP: 0010:rb_erase+0x284/0x380
>> Call Trace:
>> 
>> elv_rb_del+0x24/0x30
>> bfq_remove_request+0x9a/0x2e0 [bfq]
>> ? rcu_read_lock_sched_held+0x64/0x70
>> ? update_load_avg+0x72b/0x760
>> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
>> ? __lock_is_held+0x5a/0xa0
>> blk_mq_free_request+0x5f/0x1a0
>> blk_put_request+0x23/0x60
>> multipath_release_clone+0xe/0x10
>> dm_softirq_done+0xe3/0x270
>> __blk_mq_complete_request_remote+0x18/0x20
>> flush_smp_call_function_queue+0xa1/0x150
>> generic_smp_call_function_single_interrupt+0x13/0x30
>> smp_call_function_single_interrupt+0x4d/0x220
>> call_function_single_interrupt+0xf/0x20
>> 
>> 
> 
> This new trace just confirms my suspects.  Looking forward to some
> feedback from Mike or Jens.  Otherwise I'll try to look into it
> myself, although I don't think I am the right person to suggest the
> best cure for this cloning issue.
> 

Hi Bart,
I tried to investigate this further, but the corruption of a cloned
request (or some other mishappening) that then causes this failure
occurs somewhere, earlier, in the cloning phase; and, as I feared, I
was not able to spot the mistake in that part of the code, especially
because I'm not able to reproduce the failure itself.

I might possibly have more luck after some hints from knowledgeable
people.

Otherwise, if, in your test, this failure occurs immediately after you
start the test, and if you are willing to repeat this test with my
development version of bfq, then we may have hope to get a detailed
trace of what happens under the hood.

Thanks,
Paolo



> Thanks,
> Paolo
> 
>> Bart.

Re: [PATCH] bcache: not access dc->bdev when backing device is offline

2018-04-16 Thread tang . junhui

Hi Coly,

>When backing device is offline, memory object pointed by dc->bdev might be
>released in some condititions. I have bug report that bdevname() triggers
>a NULL pointer deference panic inside bch_cached_dev_error(), where
>dc->bdev is NULL.
>
>This patch adds char backing_dev_name[BDEVNAME_SIZE] in struct cached_dev,
>and stored backing device name in backing_dev_nmae during backing device
>registration. Then when backing device is offline, pr_err() can still
>print message with backing device name without calling bdevname(). And
>backing_dev_name is also helpful when debugging crash code.
>
>Fixes: c7b7bd07404c ("bcache: add io_disable to struct cached_dev")
>Signed-off-by: Coly Li 
>---
> drivers/md/bcache/bcache.h |  2 ++
> drivers/md/bcache/super.c  | 10 --
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
>diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>index d338b7086013..9af0ad7f6243 100644
>--- a/drivers/md/bcache/bcache.h
>+++ b/drivers/md/bcache/bcache.h
>@@ -392,6 +392,8 @@ struct cached_dev {
> #define DEFAULT_CACHED_DEV_ERROR_LIMIT64
> atomic_tio_errors;
> unsignederror_limit;
>+
>+charbacking_dev_name[BDEVNAME_SIZE];
> };
> 
> enum alloc_reserve {
>diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>index d90d9e59ca00..f2512e3e67e6 100644
>--- a/drivers/md/bcache/super.c
>+++ b/drivers/md/bcache/super.c
>@@ -1225,13 +1225,13 @@ static void register_bdev(struct cache_sb *sb, struct 
>page *sb_page,
>  struct block_device *bdev,
>  struct cached_dev *dc)
> {
>-char name[BDEVNAME_SIZE];
> const char *err = "cannot allocate memory";
> struct cache_set *c;
> 
> memcpy(>sb, sb, sizeof(struct cache_sb));
> dc->bdev = bdev;
> dc->bdev->bd_holder = dc;
>+bdevname(bdev, dc->backing_dev_name);
> 
> bio_init(>sb_bio, dc->sb_bio.bi_inline_vecs, 1);
> bio_first_bvec_all(>sb_bio)->bv_page = sb_page;
>@@ -1247,7 +1247,7 @@ static void register_bdev(struct cache_sb *sb, struct 
>page *sb_page,
> if (bch_cache_accounting_add_kobjs(>accounting, >disk.kobj))
> goto err;
> 
>-pr_info("registered backing device %s", bdevname(bdev, name));
>+pr_info("registered backing device %s", dc->backing_dev_name);
> 
> list_add(>list, _devices);
> list_for_each_entry(c, _cache_sets, list)
>@@ -1259,7 +1259,7 @@ static void register_bdev(struct cache_sb *sb, struct 
>page *sb_page,
> 
> return;
> err:
>-pr_notice("error %s: %s", bdevname(bdev, name), err);
>+pr_notice("error %s: %s", dc->backing_dev_name, err);
> bcache_device_stop(>disk);
> }
> 
>@@ -1367,8 +1367,6 @@ int bch_flash_dev_create(struct cache_set *c, uint64_t 
>size)
> 
> bool bch_cached_dev_error(struct cached_dev *dc)
> {
>-char name[BDEVNAME_SIZE];
>-
> if (!dc || test_bit(BCACHE_DEV_CLOSING, >disk.flags))
> return false;
> 
>@@ -1377,7 +1375,7 @@ bool bch_cached_dev_error(struct cached_dev *dc)
> smp_mb();
> 
> pr_err("stop %s: too many IO errors on backing device %s\n",
>-dc->disk.disk->disk_name, bdevname(dc->bdev, name));
>+dc->disk.disk->disk_name, dc->backing_dev_name);
> 
> bcache_device_stop(>disk);
> return true;
>-- 
>2.16.2

Since you have wrote it in dc->backing_dev_name, could you change all
bdevname(dc->bdev, buf) into dc->backing_dev_name in bcache?

Thanks.
Tang Junhui

Re: [PATCH 01/12] iommu-common: move to arch/sparc

2018-04-16 Thread Anshuman Khandual

On 04/15/2018 08:29 PM, Christoph Hellwig wrote:
> This code is only used by sparc, and all new iommu drivers should use the
> drivers/iommu/ framework.  Also remove the unused exports.
> 
> Signed-off-by: Christoph Hellwig 

Right, these functions are used only from SPARC architecture. Simple
git grep confirms it as well. Hence it makes sense to move them into
arch code instead.

git grep iommu_tbl_pool_init


arch/sparc/include/asm/iommu-common.h:extern void iommu_tbl_pool_init(struct 
iommu_map_table *iommu,
arch/sparc/kernel/iommu-common.c:void iommu_tbl_pool_init(struct 
iommu_map_table *iommu,
arch/sparc/kernel/iommu.c:  iommu_tbl_pool_init(>tbl, 
num_tsb_entries, IO_PAGE_SHIFT,
arch/sparc/kernel/ldc.c:iommu_tbl_pool_init(iommu, num_tsb_entries, 
PAGE_SHIFT,
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_pool_init(>tbl, num_iotte, 
IO_PAGE_SHIFT,
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_pool_init(>tbl, 
num_tsb_entries, IO_PAGE_SHIFT,

git grep iommu_tbl_range_alloc
--

arch/sparc/include/asm/iommu-common.h:extern unsigned long 
iommu_tbl_range_alloc(struct device *dev,
arch/sparc/kernel/iommu-common.c:unsigned long iommu_tbl_range_alloc(struct 
device *dev,
arch/sparc/kernel/iommu.c:  entry = iommu_tbl_range_alloc(dev, >tbl, 
npages, NULL,
arch/sparc/kernel/iommu.c:  entry = iommu_tbl_range_alloc(dev, 
>tbl, npages,
arch/sparc/kernel/ldc.c:entry = iommu_tbl_range_alloc(NULL, 
>iommu_map_table,
arch/sparc/kernel/pci_sun4v.c:  entry = iommu_tbl_range_alloc(dev, tbl, npages, 
NULL,
arch/sparc/kernel/pci_sun4v.c:  entry = iommu_tbl_range_alloc(dev, tbl, npages, 
NULL,
arch/sparc/kernel/pci_sun4v.c:  entry = iommu_tbl_range_alloc(dev, tbl, 
npages,

git grep iommu_tbl_range_free
-

arch/sparc/include/asm/iommu-common.h:extern void iommu_tbl_range_free(struct 
iommu_map_table *iommu,
arch/sparc/kernel/iommu-common.c:void iommu_tbl_range_free(struct 
iommu_map_table *iommu, u64 dma_addr,
arch/sparc/kernel/iommu.c:  iommu_tbl_range_free(>tbl, dvma, npages, 
IOMMU_ERROR_CODE);
arch/sparc/kernel/iommu.c:  iommu_tbl_range_free(>tbl, bus_addr, 
npages, IOMMU_ERROR_CODE);
arch/sparc/kernel/iommu.c:  
iommu_tbl_range_free(>tbl, vaddr, npages,
arch/sparc/kernel/iommu.c:  iommu_tbl_range_free(>tbl, 
dma_handle, npages,
arch/sparc/kernel/ldc.c:iommu_tbl_range_free(>iommu_map_table, 
cookie, npages, entry);
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_range_free(tbl, *dma_addrp, npages, 
IOMMU_ERROR_CODE);
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_range_free(tbl, dvma, npages, 
IOMMU_ERROR_CODE);
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_range_free(tbl, bus_addr, npages, 
IOMMU_ERROR_CODE);
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_range_free(tbl, bus_addr, npages, 
IOMMU_ERROR_CODE);
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_range_free(tbl, 
vaddr, npages,
arch/sparc/kernel/pci_sun4v.c:  iommu_tbl_range_free(tbl, dma_handle, 
npages,

Reviewed-by: Anshuman Khandual

remove PCI_DMA_BUS_IS_PHYS

2018-04-16 Thread Christoph Hellwig

Hi all,

this series tries to get rid of the global and PCI_DMA_BUS_IS_PHYS flag,
which causes the block layer and networking code to bounce buffer memory
above the dma mask in some cases.  It is a leftover from i386 + highmem
days and is obsolete now that we have swiotlb or iommus so that the
dma ops implementations can always (well minus the ISA DMA case which
will require further attention) handle memory passed to them.

[PATCH 02/12] storsvc: don't set a bounce limit

2018-04-16 Thread Christoph Hellwig

The default already is to never bounce, so the call is a no-op.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/storvsc_drv.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 8c51d628b52e..5f2d177c3bd9 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1382,9 +1382,6 @@ static int storvsc_device_alloc(struct scsi_device 
*sdevice)
 
 static int storvsc_device_configure(struct scsi_device *sdevice)
 {
-
-   blk_queue_bounce_limit(sdevice->request_queue, BLK_BOUNCE_ANY);
-
blk_queue_rq_timeout(sdevice->request_queue, (storvsc_timeout * HZ));
 
/* Ensure there are no gaps in presented sgls */
-- 
2.17.0

[PATCH 04/12] DAC960: don't use block layer bounce buffers

2018-04-16 Thread Christoph Hellwig

DAC960 just sets the block bounce limit to the dma mask, which means
that the iommu or swiotlb already take care of the bounce buffering,
and the block bouncing can be removed.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/DAC960.c | 9 ++---
 drivers/block/DAC960.h | 1 -
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/block/DAC960.c b/drivers/block/DAC960.c
index f781eff7d23e..c9ba48519d0f 100644
--- a/drivers/block/DAC960.c
+++ b/drivers/block/DAC960.c
@@ -1179,7 +1179,6 @@ static bool 
DAC960_V1_EnableMemoryMailboxInterface(DAC960_Controller_T
 
   if (pci_set_dma_mask(Controller->PCIDevice, DMA_BIT_MASK(32)))
return DAC960_Failure(Controller, "DMA mask out of range");
-  Controller->BounceBufferLimit = DMA_BIT_MASK(32);
 
   if ((hw_type == DAC960_PD_Controller) || (hw_type == DAC960_P_Controller)) {
 CommandMailboxesSize =  0;
@@ -1380,11 +1379,8 @@ static bool 
DAC960_V2_EnableMemoryMailboxInterface(DAC960_Controller_T
   dma_addr_t   CommandMailboxDMA;
   DAC960_V2_CommandStatus_T CommandStatus;
 
-   if (!pci_set_dma_mask(Controller->PCIDevice, DMA_BIT_MASK(64)))
-   Controller->BounceBufferLimit = DMA_BIT_MASK(64);
-   else if (!pci_set_dma_mask(Controller->PCIDevice, DMA_BIT_MASK(32)))
-   Controller->BounceBufferLimit = DMA_BIT_MASK(32);
-   else
+   if (pci_set_dma_mask(Controller->PCIDevice, DMA_BIT_MASK(64)) &&
+   pci_set_dma_mask(Controller->PCIDevice, DMA_BIT_MASK(32)))
return DAC960_Failure(Controller, "DMA mask out of range");
 
   /* This is a temporary dma mapping, used only in the scope of this function 
*/
@@ -2540,7 +2536,6 @@ static bool 
DAC960_RegisterBlockDevice(DAC960_Controller_T *Controller)
continue;
}
Controller->RequestQueue[n] = RequestQueue;
-   blk_queue_bounce_limit(RequestQueue, Controller->BounceBufferLimit);
RequestQueue->queuedata = Controller;
blk_queue_max_segments(RequestQueue, 
Controller->DriverScatterGatherLimit);
blk_queue_max_hw_sectors(RequestQueue, Controller->MaxBlocksPerCommand);
diff --git a/drivers/block/DAC960.h b/drivers/block/DAC960.h
index 21aff470d268..1439e651928b 100644
--- a/drivers/block/DAC960.h
+++ b/drivers/block/DAC960.h
@@ -2295,7 +2295,6 @@ typedef struct DAC960_Controller
   unsigned short MaxBlocksPerCommand;
   unsigned short ControllerScatterGatherLimit;
   unsigned short DriverScatterGatherLimit;
-  u64  BounceBufferLimit;
   unsigned int CombinedStatusBufferLength;
   unsigned int InitialStatusLength;
   unsigned int CurrentStatusLength;
-- 
2.17.0

[PATCH 03/12] mtip32xx: don't use block layer bounce buffers

2018-04-16 Thread Christoph Hellwig

mtip32xx just sets the block bounce limit to the dma mask, which means
that the iommu or swiotlb already take care of the bounce buffering,
and the block bouncing can be removed.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/mtip32xx/mtip32xx.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 769c551e3d71..b03bb27dcc58 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3862,7 +3862,6 @@ static int mtip_block_initialize(struct driver_data *dd)
blk_queue_max_hw_sectors(dd->queue, 0x);
blk_queue_max_segment_size(dd->queue, 0x40);
blk_queue_io_min(dd->queue, 4096);
-   blk_queue_bounce_limit(dd->queue, dd->pdev->dma_mask);
 
/* Signal trim support */
if (dd->trim_supp == true) {
-- 
2.17.0

[PATCH 01/12] iscsi_tcp: don't set a bounce limit

2018-04-16 Thread Christoph Hellwig

The default already is to never bounce, so the call is a no-op.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/iscsi_tcp.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 2ba4b68fdb73..b025a0b74341 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -962,7 +962,6 @@ static int iscsi_sw_tcp_slave_configure(struct scsi_device 
*sdev)
if (conn->datadgst_en)
sdev->request_queue->backing_dev_info->capabilities
|= BDI_CAP_STABLE_WRITES;
-   blk_queue_bounce_limit(sdev->request_queue, BLK_BOUNCE_ANY);
blk_queue_dma_alignment(sdev->request_queue, 0);
return 0;
 }
-- 
2.17.0

[PATCH 05/12] sata_nv: don't use block layer bounce buffers

2018-04-16 Thread Christoph Hellwig

sata_nv sets the block bounce limit to the reduce dma mask for ATAPI
devices, which means that the iommu or swiotlb already take care of
the bounce buffering, and the block bouncing can be removed.

Signed-off-by: Christoph Hellwig 
---
 drivers/ata/sata_nv.c | 62 +--
 1 file changed, 24 insertions(+), 38 deletions(-)

diff --git a/drivers/ata/sata_nv.c b/drivers/ata/sata_nv.c
index 8c683ddd0f58..b6e9ad6d33c9 100644
--- a/drivers/ata/sata_nv.c
+++ b/drivers/ata/sata_nv.c
@@ -740,32 +740,16 @@ static int nv_adma_slave_config(struct scsi_device *sdev)
sdev1 = ap->host->ports[1]->link.device[0].sdev;
if ((port0->flags & NV_ADMA_ATAPI_SETUP_COMPLETE) ||
(port1->flags & NV_ADMA_ATAPI_SETUP_COMPLETE)) {
-   /** We have to set the DMA mask to 32-bit if either port is in
-   ATAPI mode, since they are on the same PCI device which is
-   used for DMA mapping. If we set the mask we also need to set
-   the bounce limit on both ports to ensure that the block
-   layer doesn't feed addresses that cause DMA mapping to
-   choke. If either SCSI device is not allocated yet, it's OK
-   since that port will discover its correct setting when it
-   does get allocated.
-   Note: Setting 32-bit mask should not fail. */
-   if (sdev0)
-   blk_queue_bounce_limit(sdev0->request_queue,
-  ATA_DMA_MASK);
-   if (sdev1)
-   blk_queue_bounce_limit(sdev1->request_queue,
-  ATA_DMA_MASK);
-
-   dma_set_mask(>dev, ATA_DMA_MASK);
+   /*
+* We have to set the DMA mask to 32-bit if either port is in
+* ATAPI mode, since they are on the same PCI device which is
+* used for DMA mapping.  If either SCSI device is not allocated
+* yet, it's OK since that port will discover its correct
+* setting when it does get allocated.
+*/
+   rc = dma_set_mask(>dev, ATA_DMA_MASK);
} else {
-   /** This shouldn't fail as it was set to this value before */
-   dma_set_mask(>dev, pp->adma_dma_mask);
-   if (sdev0)
-   blk_queue_bounce_limit(sdev0->request_queue,
-  pp->adma_dma_mask);
-   if (sdev1)
-   blk_queue_bounce_limit(sdev1->request_queue,
-  pp->adma_dma_mask);
+   rc = dma_set_mask(>dev, pp->adma_dma_mask);
}
 
blk_queue_segment_boundary(sdev->request_queue, segment_boundary);
@@ -1131,12 +1115,11 @@ static int nv_adma_port_start(struct ata_port *ap)
 
VPRINTK("ENTER\n");
 
-   /* Ensure DMA mask is set to 32-bit before allocating legacy PRD and
-  pad buffers */
-   rc = dma_set_mask(>dev, DMA_BIT_MASK(32));
-   if (rc)
-   return rc;
-   rc = dma_set_coherent_mask(>dev, DMA_BIT_MASK(32));
+   /*
+* Ensure DMA mask is set to 32-bit before allocating legacy PRD and
+* pad buffers.
+*/
+   rc = dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(32));
if (rc)
return rc;
 
@@ -1156,13 +1139,16 @@ static int nv_adma_port_start(struct ata_port *ap)
pp->notifier_clear_block = pp->gen_block +
   NV_ADMA_NOTIFIER_CLEAR + (4 * ap->port_no);
 
-   /* Now that the legacy PRD and padding buffer are allocated we can
-  safely raise the DMA mask to allocate the CPB/APRD table.
-  These are allowed to fail since we store the value that ends up
-  being used to set as the bounce limit in slave_config later if
-  needed. */
-   dma_set_mask(>dev, DMA_BIT_MASK(64));
-   dma_set_coherent_mask(>dev, DMA_BIT_MASK(64));
+   /*
+* Now that the legacy PRD and padding buffer are allocated we can
+* try to raise the DMA mask to allocate the CPB/APRD table.
+*/
+   rc = dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(64));
+   if (rc) {
+   rc = dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(32));
+   if (rc)
+   return rc;
+   }
pp->adma_dma_mask = *dev->dma_mask;
 
mem = dmam_alloc_coherent(dev, NV_ADMA_PORT_PRIV_DMA_SZ,
-- 
2.17.0

[PATCH 08/12] mmc: reduce use of block bounce buffers

2018-04-16 Thread Christoph Hellwig

We can rely on the dma-mapping code to handle any DMA limits that is
bigger than the ISA DMA mask for us (either using an iommu or swiotlb),
so remove setting the block layer bounce limit for anything but bouncing
for highmem pages.

Signed-off-by: Christoph Hellwig 
---
 drivers/mmc/core/queue.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 56e9a803db21..60a02a763d01 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -351,17 +351,14 @@ static const struct blk_mq_ops mmc_mq_ops = {
 static void mmc_setup_queue(struct mmc_queue *mq, struct mmc_card *card)
 {
struct mmc_host *host = card->host;
-   u64 limit = BLK_BOUNCE_HIGH;
-
-   if (mmc_dev(host)->dma_mask && *mmc_dev(host)->dma_mask)
-   limit = (u64)dma_max_pfn(mmc_dev(host)) << PAGE_SHIFT;
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, mq->queue);
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, mq->queue);
if (mmc_can_erase(card))
mmc_queue_setup_discard(mq->queue, card);
 
-   blk_queue_bounce_limit(mq->queue, limit);
+   if (!mmc_dev(host)->dma_mask || !mmc_dev(host)->dma_mask)
+   blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_HIGH);
blk_queue_max_hw_sectors(mq->queue,
min(host->max_blk_count, host->max_req_size / 512));
blk_queue_max_segments(mq->queue, host->max_segs);
-- 
2.17.0

[PATCH 06/12] memstick: don't call blk_queue_bounce_limit

2018-04-16 Thread Christoph Hellwig

All in-tree host drivers set up a proper dma mask and use the dma-mapping
helpers.  This means they will be able to deal with any address that we
are throwing at them.

Signed-off-by: Christoph Hellwig 
---
 drivers/memstick/core/ms_block.c| 5 -
 drivers/memstick/core/mspro_block.c | 5 -
 2 files changed, 10 deletions(-)

diff --git a/drivers/memstick/core/ms_block.c b/drivers/memstick/core/ms_block.c
index 57b13dfbd21e..b2d025f42d14 100644
--- a/drivers/memstick/core/ms_block.c
+++ b/drivers/memstick/core/ms_block.c
@@ -2096,12 +2096,8 @@ static int msb_init_disk(struct memstick_dev *card)
struct msb_data *msb = memstick_get_drvdata(card);
struct memstick_host *host = card->host;
int rc;
-   u64 limit = BLK_BOUNCE_HIGH;
unsigned long capacity;
 
-   if (host->dev.dma_mask && *(host->dev.dma_mask))
-   limit = *(host->dev.dma_mask);
-
mutex_lock(_disk_lock);
msb->disk_id = idr_alloc(_disk_idr, card, 0, 256, GFP_KERNEL);
mutex_unlock(_disk_lock);
@@ -2123,7 +2119,6 @@ static int msb_init_disk(struct memstick_dev *card)
 
msb->queue->queuedata = card;
 
-   blk_queue_bounce_limit(msb->queue, limit);
blk_queue_max_hw_sectors(msb->queue, MS_BLOCK_MAX_PAGES);
blk_queue_max_segments(msb->queue, MS_BLOCK_MAX_SEGS);
blk_queue_max_segment_size(msb->queue,
diff --git a/drivers/memstick/core/mspro_block.c 
b/drivers/memstick/core/mspro_block.c
index 8897962781bb..a2fadc605750 100644
--- a/drivers/memstick/core/mspro_block.c
+++ b/drivers/memstick/core/mspro_block.c
@@ -1175,12 +1175,8 @@ static int mspro_block_init_disk(struct memstick_dev 
*card)
struct mspro_sys_info *sys_info = NULL;
struct mspro_sys_attr *s_attr = NULL;
int rc, disk_id;
-   u64 limit = BLK_BOUNCE_HIGH;
unsigned long capacity;
 
-   if (host->dev.dma_mask && *(host->dev.dma_mask))
-   limit = *(host->dev.dma_mask);
-
for (rc = 0; msb->attr_group.attrs[rc]; ++rc) {
s_attr = mspro_from_sysfs_attr(msb->attr_group.attrs[rc]);
 
@@ -1219,7 +1215,6 @@ static int mspro_block_init_disk(struct memstick_dev 
*card)
 
msb->queue->queuedata = card;
 
-   blk_queue_bounce_limit(msb->queue, limit);
blk_queue_max_hw_sectors(msb->queue, MSPRO_BLOCK_MAX_PAGES);
blk_queue_max_segments(msb->queue, MSPRO_BLOCK_MAX_SEGS);
blk_queue_max_segment_size(msb->queue,
-- 
2.17.0

[PATCH 07/12] scsi: reduce use of block bounce buffers

2018-04-16 Thread Christoph Hellwig

We can rely on the dma-mapping code to handle any DMA limits that is
bigger than the ISA DMA mask for us (either using an iommu or swiotlb),
so remove setting the block layer bounce limit for anything but the
unchecked_isa_dma case, or the bouncing for highmem pages.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/scsi_lib.c | 24 ++--
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0dfec0dedd5e..ebe2cbb48b80 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2142,27 +2142,6 @@ static int scsi_map_queues(struct blk_mq_tag_set *set)
return blk_mq_map_queues(set);
 }
 
-static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
-{
-   struct device *host_dev;
-   u64 bounce_limit = 0x;
-
-   if (shost->unchecked_isa_dma)
-   return BLK_BOUNCE_ISA;
-   /*
-* Platforms with virtual-DMA translation
-* hardware have no practical limit.
-*/
-   if (!PCI_DMA_BUS_IS_PHYS)
-   return BLK_BOUNCE_ANY;
-
-   host_dev = scsi_get_device(shost);
-   if (host_dev && host_dev->dma_mask)
-   bounce_limit = (u64)dma_max_pfn(host_dev) << PAGE_SHIFT;
-
-   return bounce_limit;
-}
-
 void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
struct device *dev = shost->dma_dev;
@@ -2182,7 +2161,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
}
 
blk_queue_max_hw_sectors(q, shost->max_sectors);
-   blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
+   if (shost->unchecked_isa_dma)
+   blk_queue_bounce_limit(q, BLK_BOUNCE_ISA);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
 
-- 
2.17.0

[PATCH 10/12] ide: remove the PCI_DMA_BUS_IS_PHYS check

2018-04-16 Thread Christoph Hellwig

We now have ways to deal with drainage in the block layer, and libata has
been using it for ages.  We also want to get rid of PCI_DMA_BUS_IS_PHYS
now, so just reduce the PCI transfer size for ide - anyone who cares for
performance on PCI controllers should have switched to libata long ago.

Signed-off-by: Christoph Hellwig 
---
 drivers/ide/ide-probe.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 8d8ed036ca0a..56d7bc228cb3 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -796,8 +796,7 @@ static int ide_init_queue(ide_drive_t *drive)
 * This will be fixed once we teach pci_map_sg() about our boundary
 * requirements, hopefully soon. *FIXME*
 */
-   if (!PCI_DMA_BUS_IS_PHYS)
-   max_sg_entries >>= 1;
+   max_sg_entries >>= 1;
 #endif /* CONFIG_PCI */
 
blk_queue_max_segments(q, max_sg_entries);
-- 
2.17.0

[PATCH 09/12] ide: kill ide_toggle_bounce

2018-04-16 Thread Christoph Hellwig

ide_toggle_bounce did select various strange block bounce limits, including
not bouncing at all as soon as an iommu is present in the system.  Given
that the dma_map routines now handle any required bounce buffering except
for ISA DMA, and the ide code already must handle either ISA DMA or highmem
at least for iommu equipped systems we can get rid of the block layer
bounce limit setting entirely.

Signed-off-by: Christoph Hellwig 
---
 drivers/ide/ide-dma.c   |  2 --
 drivers/ide/ide-lib.c   | 26 --
 drivers/ide/ide-probe.c |  3 ---
 include/linux/ide.h |  2 --
 4 files changed, 33 deletions(-)

diff --git a/drivers/ide/ide-dma.c b/drivers/ide/ide-dma.c
index 54d4d78ca46a..6f344654ef22 100644
--- a/drivers/ide/ide-dma.c
+++ b/drivers/ide/ide-dma.c
@@ -180,7 +180,6 @@ EXPORT_SYMBOL_GPL(ide_dma_unmap_sg);
 void ide_dma_off_quietly(ide_drive_t *drive)
 {
drive->dev_flags &= ~IDE_DFLAG_USING_DMA;
-   ide_toggle_bounce(drive, 0);
 
drive->hwif->dma_ops->dma_host_set(drive, 0);
 }
@@ -211,7 +210,6 @@ EXPORT_SYMBOL(ide_dma_off);
 void ide_dma_on(ide_drive_t *drive)
 {
drive->dev_flags |= IDE_DFLAG_USING_DMA;
-   ide_toggle_bounce(drive, 1);
 
drive->hwif->dma_ops->dma_host_set(drive, 1);
 }
diff --git a/drivers/ide/ide-lib.c b/drivers/ide/ide-lib.c
index e1180fa46196..78cb79eddc8b 100644
--- a/drivers/ide/ide-lib.c
+++ b/drivers/ide/ide-lib.c
@@ -6,32 +6,6 @@
 #include 
 #include 
 
-/**
- * ide_toggle_bounce   -   handle bounce buffering
- * @drive: drive to update
- * @on: on/off boolean
- *
- * Enable or disable bounce buffering for the device. Drives move
- * between PIO and DMA and that changes the rules we need.
- */
-
-void ide_toggle_bounce(ide_drive_t *drive, int on)
-{
-   u64 addr = BLK_BOUNCE_HIGH; /* dma64_addr_t */
-
-   if (!PCI_DMA_BUS_IS_PHYS) {
-   addr = BLK_BOUNCE_ANY;
-   } else if (on && drive->media == ide_disk) {
-   struct device *dev = drive->hwif->dev;
-
-   if (dev && dev->dma_mask)
-   addr = *dev->dma_mask;
-   }
-
-   if (drive->queue)
-   blk_queue_bounce_limit(drive->queue, addr);
-}
-
 u64 ide_get_lba_addr(struct ide_cmd *cmd, int lba48)
 {
struct ide_taskfile *tf = >tf;
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 2019e66eada7..8d8ed036ca0a 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -805,9 +805,6 @@ static int ide_init_queue(ide_drive_t *drive)
/* assign drive queue */
drive->queue = q;
 
-   /* needs drive->queue to be set */
-   ide_toggle_bounce(drive, 1);
-
return 0;
 }
 
diff --git a/include/linux/ide.h b/include/linux/ide.h
index ca9d34feb572..11f0dd03a4b4 100644
--- a/include/linux/ide.h
+++ b/include/linux/ide.h
@@ -1508,8 +1508,6 @@ static inline void ide_set_hwifdata (ide_hwif_t * hwif, 
void *data)
hwif->hwif_data = data;
 }
 
-extern void ide_toggle_bounce(ide_drive_t *drive, int on);
-
 u64 ide_get_lba_addr(struct ide_cmd *, int);
 u8 ide_dump_status(ide_drive_t *, const char *, u8);
 
-- 
2.17.0

[PATCH 12/12] PCI: remove PCI_DMA_BUS_IS_PHYS

2018-04-16 Thread Christoph Hellwig

This was used by the ide, scsi and networking code in the past to
determine if they should bounce payloads.  Now that the dma mapping
always have to support dma to all physical memory (thanks to swiotlb
for non-iommu systems) there is no need to this crude hack any more.

Signed-off-by: Christoph Hellwig 
---
 arch/alpha/include/asm/pci.h  |  5 -
 arch/arc/include/asm/pci.h|  6 --
 arch/arm/include/asm/pci.h|  7 ---
 arch/arm64/include/asm/pci.h  |  5 -
 arch/h8300/include/asm/pci.h  |  2 --
 arch/hexagon/kernel/dma.c |  1 -
 arch/ia64/hp/common/sba_iommu.c   |  3 ---
 arch/ia64/include/asm/pci.h   | 17 -
 arch/ia64/kernel/setup.c  | 12 
 arch/ia64/sn/kernel/io_common.c   |  5 -
 arch/m68k/include/asm/pci.h   |  6 --
 arch/microblaze/include/asm/pci.h |  6 --
 arch/mips/include/asm/pci.h   |  7 ---
 arch/parisc/include/asm/pci.h | 23 ---
 arch/parisc/kernel/setup.c|  5 -
 arch/powerpc/include/asm/pci.h| 18 --
 arch/riscv/include/asm/pci.h  |  3 ---
 arch/s390/include/asm/pci.h   |  2 --
 arch/s390/pci/pci_dma.c   |  2 --
 arch/sh/include/asm/pci.h |  6 --
 arch/sh/kernel/dma-nommu.c|  1 -
 arch/sparc/include/asm/pci_32.h   |  4 
 arch/sparc/include/asm/pci_64.h   |  6 --
 arch/x86/include/asm/pci.h|  3 ---
 arch/x86/kernel/pci-nommu.c   |  1 -
 arch/xtensa/include/asm/pci.h |  2 --
 drivers/parisc/ccio-dma.c |  2 --
 drivers/parisc/sba_iommu.c|  2 --
 include/asm-generic/pci.h |  8 
 include/linux/dma-mapping.h   |  1 -
 lib/dma-direct.c  |  1 -
 tools/virtio/linux/dma-mapping.h  |  2 --
 32 files changed, 174 deletions(-)

diff --git a/arch/alpha/include/asm/pci.h b/arch/alpha/include/asm/pci.h
index b9ec55351924..cf6bc1e64d66 100644
--- a/arch/alpha/include/asm/pci.h
+++ b/arch/alpha/include/asm/pci.h
@@ -56,11 +56,6 @@ struct pci_controller {
 
 /* IOMMU controls.  */
 
-/* The PCI address space does not equal the physical memory address space.
-   The networking and block device layers use this boolean for bounce buffer
-   decisions.  */
-#define PCI_DMA_BUS_IS_PHYS  0
-
 /* TODO: integrate with include/asm-generic/pci.h ? */
 static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
 {
diff --git a/arch/arc/include/asm/pci.h b/arch/arc/include/asm/pci.h
index ba56c23c1b20..4ff53c041c64 100644
--- a/arch/arc/include/asm/pci.h
+++ b/arch/arc/include/asm/pci.h
@@ -16,12 +16,6 @@
 #define PCIBIOS_MIN_MEM 0x10
 
 #define pcibios_assign_all_busses()1
-/*
- * The PCI address space does equal the physical memory address space.
- * The networking and block device layers use this boolean for bounce
- * buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS1
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/arm/include/asm/pci.h b/arch/arm/include/asm/pci.h
index 1f0de808d111..0abd389cf0ec 100644
--- a/arch/arm/include/asm/pci.h
+++ b/arch/arm/include/asm/pci.h
@@ -19,13 +19,6 @@ static inline int pci_proc_domain(struct pci_bus *bus)
 }
 #endif /* CONFIG_PCI_DOMAINS */
 
-/*
- * The PCI address space does equal the physical memory address space.
- * The networking and block device layers use this boolean for bounce
- * buffer decisions.
- */
-#define PCI_DMA_BUS_IS_PHYS (1)
-
 #define HAVE_PCI_MMAP
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE
 
diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
index 8747f7c5e0e7..9e690686e8aa 100644
--- a/arch/arm64/include/asm/pci.h
+++ b/arch/arm64/include/asm/pci.h
@@ -18,11 +18,6 @@
 #define pcibios_assign_all_busses() \
(pci_has_flag(PCI_REASSIGN_ALL_BUS))
 
-/*
- * PCI address space differs from physical memory address space
- */
-#define PCI_DMA_BUS_IS_PHYS(0)
-
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1
 
 extern int isa_dma_bridge_buggy;
diff --git a/arch/h8300/include/asm/pci.h b/arch/h8300/include/asm/pci.h
index 7c9e55d62215..d4d345a52092 100644
--- a/arch/h8300/include/asm/pci.h
+++ b/arch/h8300/include/asm/pci.h
@@ -15,6 +15,4 @@ static inline void pcibios_penalize_isa_irq(int irq, int 
active)
/* We don't do dynamic PCI IRQ allocation */
 }
 
-#define PCI_DMA_BUS_IS_PHYS(1)
-
 #endif /* _ASM_H8300_PCI_H */
diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
index ad8347c29dcf..77459df34e2e 100644
--- a/arch/hexagon/kernel/dma.c
+++ b/arch/hexagon/kernel/dma.c
@@ -208,7 +208,6 @@ const struct dma_map_ops hexagon_dma_ops = {
.sync_single_for_cpu = hexagon_sync_single_for_cpu,
.sync_single_for_device = hexagon_sync_single_for_device,
.mapping_error  = hexagon_mapping_error,
-   .is_phys= 1,
 };
 
 void __init hexagon_dma_init(void)
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index aec4a3354abe..6f05aba9012f 100644
---

[PATCH 11/12] net: remove the PCI_DMA_BUS_IS_PHYS check in illegal_highdma

2018-04-16 Thread Christoph Hellwig

These days the dma mapping routines must be able to handle any address
supported by the device, be that using an iommu, or swiotlb if none is
supported.  With that the PCI_DMA_BUS_IS_PHYS check in illegal_highdma
is not needed and can be removed.

Signed-off-by: Christoph Hellwig 
---
 net/core/dev.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 969462ebb296..5802cf177b07 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2884,11 +2884,7 @@ void netdev_rx_csum_fault(struct net_device *dev)
 EXPORT_SYMBOL(netdev_rx_csum_fault);
 #endif
 
-/* Actually, we should eliminate this check as soon as we know, that:
- * 1. IOMMU is present and allows to map all the memory.
- * 2. No high memory really exists on this machine.
- */
-
+/* XXX: check that highmem exists at all on the given machine. */
 static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_HIGHMEM
@@ -2902,20 +2898,6 @@ static int illegal_highdma(struct net_device *dev, 
struct sk_buff *skb)
return 1;
}
}
-
-   if (PCI_DMA_BUS_IS_PHYS) {
-   struct device *pdev = dev->dev.parent;
-
-   if (!pdev)
-   return 0;
-   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-   skb_frag_t *frag = _shinfo(skb)->frags[i];
-   dma_addr_t addr = page_to_phys(skb_frag_page(frag));
-
-   if (!pdev->dma_mask || addr + PAGE_SIZE - 1 > 
*pdev->dma_mask)
-   return 1;
-   }
-   }
 #endif
return 0;
 }
-- 
2.17.0

Re: [PATCH] blk-mq: mark hctx RESTART when get budget fails

2018-04-16 Thread jianchao.wang

Hi Ming

Thanks for your kindly response.

On 04/16/2018 04:15 PM, Ming Lei wrote:
>> -if (!blk_mq_get_dispatch_budget(hctx))
>> +if (!blk_mq_get_dispatch_budget(hctx)) {
>> +blk_mq_sched_mark_restart_hctx(hctx);
> The RESTART flag still may not take into effect if all requests are
> completed before calling blk_mq_sched_mark_restart_hctx().
> 

Yes, this is indeed a very tricky scenario.

Sincerely
Jianchao

Re: [PATCH] blk-mq: mark hctx RESTART when get budget fails

2018-04-16 Thread Ming Lei

On Mon, Apr 16, 2018 at 03:55:36PM +0800, Jianchao Wang wrote:
> When get budget fails, blk_mq_sched_dispatch_requests does not do
> anything to ensure the hctx to be restarted. We can survive from
> this, because only the scsi implements .get_budget and it always
> runs the hctx queues when request is completed.

Exactly, that is why we don't handle the failure before dequeuing
request.

I suggest to just document this usage now.

> 
> Signed-off-by: Jianchao Wang 
> ---
>  block/blk-mq-sched.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 25c14c5..49e7cd9 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -102,8 +102,10 @@ static void blk_mq_do_dispatch_sched(struct 
> blk_mq_hw_ctx *hctx)
>   !e->type->ops.mq.has_work(hctx))
>   break;
>  
> - if (!blk_mq_get_dispatch_budget(hctx))
> + if (!blk_mq_get_dispatch_budget(hctx)) {
> + blk_mq_sched_mark_restart_hctx(hctx);

The RESTART flag still may not take into effect if all requests are
completed before calling blk_mq_sched_mark_restart_hctx().

>   break;
> + }
>  
>   rq = e->type->ops.mq.dispatch_request(hctx);
>   if (!rq) {
> @@ -148,8 +150,10 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx 
> *hctx)
>   if (!sbitmap_any_bit_set(>ctx_map))
>   break;
>  
> - if (!blk_mq_get_dispatch_budget(hctx))
> + if (!blk_mq_get_dispatch_budget(hctx)) {
> + blk_mq_sched_mark_restart_hctx(hctx);

Same with above.

Thanks,
Ming

Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-16 Thread Jinpu Wang

On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
 wrote:
>
> Jinpu,
>
> [CC:ed the mpt3sas maintainers]
>
> The ratelimit patch is just an attempt to treat the symptom, not the
> cause.
Agree. If we can fix the root cause, it will be great.
>
>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>> After reboot, kernel reports the IO errors from all the drives behind
>> HBA, seems for almost every read IO, which turns the system unusable:
>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>
> That sounds like a bug in the mpt3sas driver or firmware. I guess the
> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
> the PI on the drive side. But that doesn't seem to be a particularly
> useful mode of operation.
>
> Jinpu: Which firmware are you running? Also, please send us the output
> of:
>
> sg_readcap -l /dev/sda
> sg_inq -x /dev/sda
> sg_vpd /dev/sda
>
Disks are INTEL SSDSC2BX48, directly attached to HBA.
LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00)
mpt3sas_cm2: Protocol=(Initiator,Target),
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)

jwang@x:~$ sudo sg_vpd /dev/sdz
Supported VPD pages VPD page:
  Supported VPD pages [sv]
  Unit serial number [sn]
  Device identification [di]
  Mode page policy [mpp]
  ATA information (SAT) [ai]
  Block limits (SBC) [bl]
  Block device characteristics (SBC) [bdc]
  Logical block provisioning (SBC) [lbpv]
jwang@x:~$ sudo sg_inq -x /dev/sdz
VPD INQUIRY: extended INQUIRY data page
inquiry: field in cdb illegal (page not supported)
jwang@x:~$ sudo sg_readcap -l /dev/sdz
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=1, lbprz=1
   Last logical block address=937703087 (0x37e436af), Number of
logical blocks=937703088
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block
length=4096 bytes]
   Lowest aligned logical block address=0
Hence:
   Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB


> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
> controller?
>
> --
> Martin K. Petersen  Oracle Linux Engineering


Thanks!

-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

[PATCH] blk-mq: mark hctx RESTART when get budget fails

2018-04-16 Thread Jianchao Wang

When get budget fails, blk_mq_sched_dispatch_requests does not do
anything to ensure the hctx to be restarted. We can survive from
this, because only the scsi implements .get_budget and it always
runs the hctx queues when request is completed.

Signed-off-by: Jianchao Wang 
---
 block/blk-mq-sched.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 25c14c5..49e7cd9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -102,8 +102,10 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx 
*hctx)
!e->type->ops.mq.has_work(hctx))
break;
 
-   if (!blk_mq_get_dispatch_budget(hctx))
+   if (!blk_mq_get_dispatch_budget(hctx)) {
+   blk_mq_sched_mark_restart_hctx(hctx);
break;
+   }
 
rq = e->type->ops.mq.dispatch_request(hctx);
if (!rq) {
@@ -148,8 +150,10 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx 
*hctx)
if (!sbitmap_any_bit_set(>ctx_map))
break;
 
-   if (!blk_mq_get_dispatch_budget(hctx))
+   if (!blk_mq_get_dispatch_budget(hctx)) {
+   blk_mq_sched_mark_restart_hctx(hctx);
break;
+   }
 
rq = blk_mq_dequeue_from_ctx(hctx, ctx);
if (!rq) {
-- 
2.7.4

Re: [PATCH v2 1/4] bcache: finish incremental GC

2018-04-16 Thread Coly Li

On 2018/4/16 2:33 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> In GC thread, we record the latest GC key in gc_done, which is expected
> to be used for incremental GC, but in currently code, we didn't realize
> it. When GC runs, front side IO would be blocked until the GC over, it
> would be a long time if there is a lot of btree nodes.
> 
> This patch realizes incremental GC, the main ideal is that, when there
> are front side I/Os, after GC some nodes (100), we stop GC, release locker
> of the btree node, and go to process the front side I/Os for some times
> (100 ms), then go back to GC again.
> 
> By this patch, when we doing GC, I/Os are not blocked all the time, and
> there is no obvious I/Os zero jump problem any more.
> 
> Patch v2: Rename some variables and macros name as Coly suggested.
> 
> Signed-off-by: Tang Junhui 

Reviewed-by: Coly Li 

Thanks.

Coly Li

> ---
>  drivers/md/bcache/bcache.h  |  5 +
>  drivers/md/bcache/btree.c   | 15 ++-
>  drivers/md/bcache/request.c |  3 +++
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 843877e..5d52be8 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -445,6 +445,7 @@ struct cache {
>  
>  struct gc_stat {
>   size_t  nodes;
> + size_t  nodes_pre;
>   size_t  key_bytes;
>  
>   size_t  nkeys;
> @@ -568,6 +569,10 @@ struct cache_set {
>*/
>   atomic_trescale;
>   /*
> +  * used for GC, identify if any front side I/Os is inflight
> +  */
> + atomic_tsearch_inflight;
> + /*
>* When we invalidate buckets, we use both the priority and the amount
>* of good data to determine which buckets to reuse first - to weight
>* those together consistently we keep track of the smallest nonzero
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3..6edb00e 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -90,6 +90,9 @@
>  
>  #define MAX_NEED_GC  64
>  #define MAX_SAVE_PRIO72
> +#define MIN_GC_NODES 100
> +#define GC_SLEEP_MS  100
> +
>  
>  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
>  
> @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
> btree_op *op,
>   memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
>   r->b = NULL;
>  
> + if (atomic_read(>c->search_inflight) &&
> + gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
> + gc->nodes_pre =  gc->nodes;
> + ret = -EAGAIN;
> + break;
> + }
> +
>   if (need_resched()) {
>   ret = -EAGAIN;
>   break;
> @@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
>   closure_sync();
>   cond_resched();
>  
> - if (ret && ret != -EAGAIN)
> + if (ret == -EAGAIN)
> + schedule_timeout_interruptible(msecs_to_jiffies
> +(GC_SLEEP_MS));
> + else if (ret)
>   pr_warn("gc failed!");
>   } while (ret);
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 643c3021..a12afbc 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio 
> *orig_bio)
>  static void search_free(struct closure *cl)
>  {
>   struct search *s = container_of(cl, struct search, cl);
> +
>   bio_complete(s);
> + atomic_dec(>d->c->search_inflight);
>  
>   if (s->iop.bio)
>   bio_put(s->iop.bio);
> @@ -655,6 +657,7 @@ static inline struct search *search_alloc(struct bio *bio,
>  
>   closure_init(>cl, NULL);
>   do_bio_hook(s, bio);
> + atomic_inc(>c->search_inflight);
>  
>   s->orig_bio = bio;
>   s->cache_miss   = NULL;
>

[PATCH v2 1/4] bcache: finish incremental GC

2018-04-16 Thread tang . junhui

From: Tang Junhui 

In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.

This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.

By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.

Patch v2: Rename some variables and macros name as Coly suggested.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/bcache.h  |  5 +
 drivers/md/bcache/btree.c   | 15 ++-
 drivers/md/bcache/request.c |  3 +++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e..5d52be8 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -445,6 +445,7 @@ struct cache {
 
 struct gc_stat {
size_t  nodes;
+   size_t  nodes_pre;
size_t  key_bytes;
 
size_t  nkeys;
@@ -568,6 +569,10 @@ struct cache_set {
 */
atomic_trescale;
/*
+* used for GC, identify if any front side I/Os is inflight
+*/
+   atomic_tsearch_inflight;
+   /*
 * When we invalidate buckets, we use both the priority and the amount
 * of good data to determine which buckets to reuse first - to weight
 * those together consistently we keep track of the smallest nonzero
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3..6edb00e 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -90,6 +90,9 @@
 
 #define MAX_NEED_GC64
 #define MAX_SAVE_PRIO  72
+#define MIN_GC_NODES   100
+#define GC_SLEEP_MS100
+
 
 #define PTR_DIRTY_BIT  (((uint64_t) 1 << 36))
 
@@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
btree_op *op,
memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
r->b = NULL;
 
+   if (atomic_read(>c->search_inflight) &&
+   gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
+   gc->nodes_pre =  gc->nodes;
+   ret = -EAGAIN;
+   break;
+   }
+
if (need_resched()) {
ret = -EAGAIN;
break;
@@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
closure_sync();
cond_resched();
 
-   if (ret && ret != -EAGAIN)
+   if (ret == -EAGAIN)
+   schedule_timeout_interruptible(msecs_to_jiffies
+  (GC_SLEEP_MS));
+   else if (ret)
pr_warn("gc failed!");
} while (ret);
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 643c3021..a12afbc 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio 
*orig_bio)
 static void search_free(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
+
bio_complete(s);
+   atomic_dec(>d->c->search_inflight);
 
if (s->iop.bio)
bio_put(s->iop.bio);
@@ -655,6 +657,7 @@ static inline struct search *search_alloc(struct bio *bio,
 
closure_init(>cl, NULL);
do_bio_hook(s, bio);
+   atomic_inc(>c->search_inflight);
 
s->orig_bio = bio;
s->cache_miss   = NULL;
-- 
1.8.3.1

[PATCH v2 0/4] bcache: incremental GC and dirty data init

2018-04-16 Thread tang . junhui

Hi maintainers and folks,

Some patches of this patch set have been sent before, they are not merged
yet, and I add two new patches to solve some issues I found while testing.
since They are interdependent, so I make a patch set and resend them.

[PATCH 1/4] bcache: finish incremental GC
[PATCH 2/4] bcache: calculate the number of incremental GC nodes according to
the total of btree nodes
[PATCH 3/4] bcache: notify allocator to prepare for GC
[PATCH 4/4] bcache: fix I/O significant decline while backend devices 
registering

These patches are useful to prevent I/O fluctuations or even jump 0 while GC or
cached devices registering. I have test them for some times, I hope somebody 
could
have a review for these patch, any comment would be welcome.

Patch v2: modify patch as Coly's suggestion.

[PATCH v2 3/4] bcache: notify allocator to prepare for GC

2018-04-16 Thread tang . junhui

From: Tang Junhui 

Since no new bucket can be allocated during GC, and front side I/Os would
run out of all the buckets, so notify allocator to pack the free_inc queue
full of buckets before GC, then we could have enough buckets for front side
I/Os during GC period.

The main idea of this patch is:

GC thread   |   
allocator thread
==>triggered by sectors_to_gc   |
set ca->prepare_gc to GC_PREPARING  |
to notify allocator thread to   |
prepare for GC  |==>detect 
ca->prepare_gc is 

|   GC_PREPARING,

|   do invalidate_buckets(),
==>waiting for allocator|   and fill 
free_inc queue with
thread to prepare over  |   reclaimable 
buckets, after

|   that, set ca->prepare_gc to

|   GC_PREPARED to notify GC

|   thread being prepared
==>detect ca->prepare_gc is |
prepared, set   |
ca->prepare_gc back to  |
GC_PREPARE_NONE, and continue GC|

Patch v2: Refactoring code

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/alloc.c  | 11 -
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/btree.c  | 59 +++---
 drivers/md/bcache/btree.h  |  4 
 4 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a0cc1bc..85020cc 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -323,7 +323,8 @@ static int bch_allocator_thread(void *arg)
 * possibly issue discards to them, then we add the bucket to
 * the free list:
 */
-   while (!fifo_empty(>free_inc)) {
+   while (!fifo_empty(>free_inc) &&
+  ca->prepare_gc != GC_PREPARING) {
long bucket;
 
fifo_pop(>free_inc, bucket);
@@ -353,6 +354,14 @@ static int bch_allocator_thread(void *arg)
invalidate_buckets(ca);
 
/*
+* Let GC continue
+*/
+   if (ca->prepare_gc == GC_PREPARING) {
+   ca->prepare_gc = GC_PREPARED;
+   wake_up_gc(ca->set);
+   }
+
+   /*
 * Now, we write their new gens to disk so we can start writing
 * new stuff to them:
 */
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5d52be8..e6d5391 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -428,6 +428,8 @@ struct cache {
 * cpu
 */
unsignedinvalidate_needs_gc;
+   /* used to notify allocator to prepare GC*/
+   unsigned intprepare_gc;
 
booldiscard; /* Get rid of? */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 2ad0731e..0478821 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1805,19 +1805,72 @@ static void bch_btree_gc(struct cache_set *c)
bch_moving_gc(c);
 }
 
+unsigned int get_cache_gc_prepare_status(struct cache_set *c)
+{
+   struct cache *ca;
+   unsigned int i;
+   unsigned int status = GC_PREPARE_NONE;
+
+   for_each_cache(ca, c, i) {
+   if (ca->prepare_gc == GC_PREPARING)
+   return GC_PREPARING;
+
+   status = ca->prepare_gc;
+   }
+
+   return status;
+}
+
+static void set_cache_gc_prepare_status(struct cache_set *c,
+   unsigned int status)
+{
+   struct cache *ca;
+   unsigned int i;
+
+   for_each_cache(ca, c, i)
+   ca->prepare_gc = status;
+}
+
 static bool gc_should_run(struct cache_set *c)
 {
struct cache *ca;
unsigned i;
+   bool ret = false;
 
for_each_cache(ca, c, i)
if (ca->invalidate_needs_gc)
return true;
 
-   if (atomic_read(>sectors_to_gc) < 0)
-   return true;
+   if (atomic_read(>sectors_to_gc) < 0) {
+   unsigned int status;
 
-   return false;
+   mutex_lock(>bucket_lock);
+   status = get_cache_gc_prepare_status(c);
+   switch (status) {
+

[PATCH v2 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes

2018-04-16 Thread tang . junhui

From: Tang Junhui 

This patch base on "[PATCH] bcache: finish incremental GC".

Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.

So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/btree.c | 37 ++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 6edb00e..2ad0731e 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -90,10 +90,10 @@
 
 #define MAX_NEED_GC64
 #define MAX_SAVE_PRIO  72
+#define MAX_GC_TIMES   100
 #define MIN_GC_NODES   100
 #define GC_SLEEP_MS100
 
-
 #define PTR_DIRTY_BIT  (((uint64_t) 1 << 36))
 
 #define PTR_HASH(c, k) \
@@ -1519,6 +1519,31 @@ static unsigned btree_gc_count_keys(struct btree *b)
return ret;
 }
 
+static size_t btree_gc_min_nodes(struct cache_set *c)
+{
+   size_t  min_nodes;
+
+   /*
+* Since incremental GC would stop 100ms when front
+* side I/O comes, so when there are many btree nodes,
+* if GC only processes constant (100) nodes each time,
+* GC would last a long time, and the front side I/Os
+* would run out of the buckets (since no new bucket
+* can be allocated during GC), and be blocked again.
+* So GC should not process constant nodes, but varied
+* nodes according to the number of btree nodes, which
+* realized by dividing GC into constant(100) times,
+* so when there are many btree nodes, GC can process
+* more nodes each time, otherwise, GC will process less
+* nodes each time (but no less than MIN_GC_NODES)
+*/
+   min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
+   if (min_nodes < MIN_GC_NODES)
+   min_nodes = MIN_GC_NODES;
+
+   return min_nodes;
+}
+
 static int btree_gc_recurse(struct btree *b, struct btree_op *op,
struct closure *writes, struct gc_stat *gc)
 {
@@ -1585,7 +1610,7 @@ static int btree_gc_recurse(struct btree *b, struct 
btree_op *op,
r->b = NULL;
 
if (atomic_read(>c->search_inflight) &&
-   gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
+   gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) {
gc->nodes_pre =  gc->nodes;
ret = -EAGAIN;
break;
@@ -1841,8 +1866,14 @@ static int bch_btree_check_recurse(struct btree *b, 
struct btree_op *op)
do {
k = bch_btree_iter_next_filter(, >keys,
   bch_ptr_bad);
-   if (k)
+   if (k) {
btree_node_prefetch(b, k);
+   /*
+* initiallize c->gc_stats.nodes
+* for incremental GC
+*/
+   b->c->gc_stats.nodes++;
+   }
 
if (p)
ret = btree(check_recurse, p, b, op);
-- 
1.8.3.1

[PATCH v2 4/4] bcache: fix I/O significant decline while backend devices registering

2018-04-16 Thread tang . junhui

From: Tang Junhui 

I attached several backend devices in the same cache set, and produced lots
of dirty data by running small rand I/O writes in a long time, then I
continue run I/O in the others cached devices, and stopped a cached device,
after a mean while, I register the stopped device again, I see the running
I/O in the others cached devices dropped significantly, sometimes even
jumps to zero.

In currently code, bcache would traverse each keys and btree node to count
the dirty data under read locker, and the writes threads can not get the
btree write locker, and when there is a lot of keys and btree node in the
registering device, it would last several seconds, so the write I/Os in
others cached device are blocked and declined significantly.

In this patch, when a device registering to a ache set, which exist others
cached devices with running I/Os, we get the amount of dirty data of the
device in an incremental way, and do not block other cached devices all the
time.

Patch v2: Rename some variables and macros name as Coly suggested.

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
---
 drivers/md/bcache/writeback.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a3788..71ba861 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -488,10 +488,14 @@ static int bch_writeback_thread(void *arg)
 }
 
 /* Init */
+#define INIT_KEYS_EACH_TIME50
+#define INIT_KEYS_SLEEP_MS 100
 
 struct sectors_dirty_init {
struct btree_op op;
unsignedinode;
+   size_t  count;
+   struct bkey start;
 };
 
 static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
@@ -506,18 +510,38 @@ static int sectors_dirty_init_fn(struct btree_op *_op, 
struct btree *b,
bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
 KEY_START(k), KEY_SIZE(k));
 
+   op->count++;
+   if (atomic_read(>c->search_inflight) &&
+  !(op->count % INIT_KEYS_EACH_TIME)) {
+   bkey_copy_key(>start, k);
+   return -EAGAIN;
+   }
+
return MAP_CONTINUE;
 }
 
 void bch_sectors_dirty_init(struct bcache_device *d)
 {
struct sectors_dirty_init op;
+   int ret;
 
bch_btree_op_init(, -1);
op.inode = d->id;
+   op.count = 0;
+   op.start = KEY(op.inode, 0, 0);
 
-   bch_btree_map_keys(, d->c, (op.inode, 0, 0),
+   do {
+   ret = bch_btree_map_keys(, d->c, ,
   sectors_dirty_init_fn, 0);
+   if (ret == -EAGAIN)
+   schedule_timeout_interruptible(
+   msecs_to_jiffies(INIT_KEYS_SLEEP_MS));
+   else if (ret < 0) {
+   pr_warn("sectors dirty init failed, ret=%d!", ret);
+   break;
+   }
+   } while (ret == -EAGAIN);
+
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
-- 
1.8.3.1

89 matches

Mail list logo