Re: [PATCH V3] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-13 Thread Jens Axboe
On 4/12/18 5:59 AM, Ming Lei wrote:
> The normal request completion can be done before or during handling
> BLK_EH_RESET_TIMER, and this race may cause the request to never be
> completed since driver's .timeout() may always return
> BLK_EH_RESET_TIMER.
> 
> This issue can't be fixed completely by driver, since the normal
> completion can be done between returning .timeout() and handling
> BLK_EH_RESET_TIMER.
> 
> This patch fixes the race by introducing rq state of
> MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding
> queue lock, which can be per-request actually, but just not necessary
> to introduce one lock for so unusual event.
> 
> Also when .timeout() returns BLK_EH_HANDLED, sync with normal
> completion path before completing this timed-out rq finally for
> avoiding this rq's state touched by normal completion.

I like this approach since it keeps the cost outside of the fast
path. And it's fine to reuse the queue lock for this, instead of
adding a special lock for something we consider a rare occurrence.

>From a quick look this looks sane, but I'll take a closer look
tomrrow and add some testing too.

-- 
Jens Axboe



Re: [PATCH] nbd: do not update block size if file system is mounted

2018-04-13 Thread Josef Bacik
Yeah sorry I screwed that up again.  I’m wondering if we can just drop this 
altogether and leave the zero setting in the config put that we already have.  
Does doing that fix it for you?  Thanks,

Josef

Sent from my iPhone

> On Apr 13, 2018, at 10:26 AM, Michael Tretter  
> wrote:
> 
>> On Fri, 16 Mar 2018 09:36:45 -0700, Jens Axboe wrote:
>>> On 3/15/18 3:00 AM, Michael Tretter wrote:
 On Wed, 28 Feb 2018 10:54:56 +0100, Michael Tretter wrote:  
 If a file system is mounted on the nbd during a disconnect, resetting
 the size to 0, might change the block size and destroy the buffer_head
 mappings. This will cause a infinite loop when the file system looks for
 the buffer_heads for flushing.
 
 Only set the file size to 0, if we are the only opener of the nbd.
 
 Signed-off-by: Michael Tretter 
 ---
 drivers/block/nbd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
 index 5f2a4240a204..0d1dbb60430b 100644
 --- a/drivers/block/nbd.c
 +++ b/drivers/block/nbd.c
 @@ -1118,7 +1118,8 @@ static int nbd_start_device_ioctl(struct nbd_device 
 *nbd, struct block_device *b
if (ret)
sock_shutdown(nbd);
mutex_lock(>config_lock);
 -bd_set_size(bdev, 0);
 +if (bdev->bd_openers <= 1)
 +bd_set_size(bdev, 0);
/* user requested, ignore socket errors */
if (test_bit(NBD_DISCONNECT_REQUESTED, >runtime_flags))
ret = 0;  
>>> 
>>> Are there any comments on this patch?  
>> 
>> Josef?
>> 
> 
> Maybe some history helps for reviewing the patch:
> 
> Commit abbbdf12497d ("nbd: replace kill_bdev() with
> __invalidate_device()") already fixed this issue once by changing
> nbd_size_clear() to only set the size to 0 if the device is only opened
> once. The line was moved several times during the rework for the
> netlink interface and the check for the number of openers was dropped,
> reintroducing the issue.
> 
> Michael


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-13 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.

We should fix this issue in block layer because:

1) when normal completion is done after rq's state is updated
to COMPLETE during timeout handling, this patch still follows previous
behaviour to reset timer, instead of complete this request immediately.

2) if driver's .timeout() deals with this situation by returning
RESET_TIMER at the first time, it can be very possible to deal with
this situation as RESET_TIMER in next timeout, because both hw and sw state
wrt. this request isn't changed compared with 1st .timeout.
So it is very possible for this request to not be completed for ever.

3) normal completion may be done just after returning from .timeout(),
so this issue may not be avoided by fixing driver

And the simple approach in the following link can fix the issue
without introducing any cost in fast path:

https://marc.info/?l=linux-block=152353441722852=2

> 
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:
> - Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request.
> - Store the request state in the lowest two bits of the deadline instead
>   of the lowest two bits of 'gstate'.
> - Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
>   enumeration member into a #define such that its type can be changed
>   into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
>   ~(unsigned long)RQ_STATE_MASK.
> - Remove all request member variables that became superfluous due to
>   this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
> - Remove the request state information that became superfluous due to this
>   patch, namely RQF_MQ_TIMEOUT_EXPIRED.
> - Remove the hctx member that became superfluous due to these changes,
>   namely nr_expired.
> - Remove the code that became superfluous due to this change, namely
>   the RCU lock and unlock statements in blk_mq_complete_request() and
>   also the synchronize_rcu() call in the timeout handler.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Tejun Heo 
> Cc: Christoph Hellwig 
> Cc: Ming Lei 
> Cc: Sagi Grimberg 
> Cc: Israel Rukshin ,
> Cc: Max Gurtovoy 
> Cc:  # v4.16
> ---
> 
> Changes compared to v4:
> - Addressed multiple review comments from Christoph. The most important are
>   that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
>   there is now a nice and clean split between the legacy and blk-mq versions
>   of blk_add_timer().
> - Changed the patch name and modified the patch description because there is
>   disagreement about whether or not the v4.16 blk-mq core can complete a
>   single request twice. Kept the "Cc: stable" tag because of
>   https://bugzilla.kernel.org/show_bug.cgi?id=199077.
> 
> Changes compared to v3 (see also 
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
> - Removed the spinlock again that was introduced to protect the request state.
>   v4 uses atomic_long_cmpxchg() instead.
> - Split __deadline into two variables - one for the legacy block layer and one
>   for blk-mq.
> 
> Changes compared to v2 
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
> - Rebased and retested on top of kernel v4.16.
> 
> Changes compared to v1 
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
> - Removed the gstate and aborted_gstate members of struct request and used
>   the __deadline member to encode both the generation and state information.
> 
>  block/blk-core.c   |   2 -
>  block/blk-mq-debugfs.c |   1 -
>  block/blk-mq.c | 174 
> +
>  block/blk-mq.h |  65 ++
>  block/blk-timeout.c|  89 ++---
>  block/blk.h|  13 ++--
>  include/linux/blk-mq.h |   1 -
>  include/linux/blkdev.h |  30 ++---
>  8 files changed, 118 insertions(+), 257 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 8625ec929fe5..181b1a688a5b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -200,8 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request 
> *rq)
>   rq->start_time = jiffies;
>   set_start_time_ns(rq);
>   rq->part = NULL;
> - seqcount_init(>gstate_seq);
> - 

[PATCHSET 0/2] loop: dio fixup

2018-04-13 Thread Jens Axboe
Wanted to do a v2 of the loop dio short read fix, since I figured
we should retain the zero fill for if we have to give up doing
reads. Also noticed that we have cmd->rq for some reason, so
killed that off as a cleanup.

-- 
Jens Axboe



[PATCH 1/2] loop: remove cmd->rq member

2018-04-13 Thread Jens Axboe
We can always get at the request from the payload, no need to store
a pointer to it.

Signed-off-by: Jens Axboe 
---
 drivers/block/loop.c | 36 +++-
 drivers/block/loop.h |  1 -
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9d04497a415..8b2fde2109fc 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -452,9 +452,9 @@ static void lo_complete_rq(struct request *rq)
 {
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
-   if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
-cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
-   struct bio *bio = cmd->rq->bio;
+   if (unlikely(req_op(rq) == REQ_OP_READ && cmd->use_aio &&
+cmd->ret >= 0 && cmd->ret < blk_rq_bytes(rq))) {
+   struct bio *bio = rq->bio;
 
bio_advance(bio, cmd->ret);
zero_fill_bio(bio);
@@ -465,11 +465,13 @@ static void lo_complete_rq(struct request *rq)
 
 static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
 {
+   struct request *rq = blk_mq_rq_from_pdu(cmd);
+
if (!atomic_dec_and_test(>ref))
return;
kfree(cmd->bvec);
cmd->bvec = NULL;
-   blk_mq_complete_request(cmd->rq);
+   blk_mq_complete_request(rq);
 }
 
 static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
@@ -487,7 +489,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 {
struct iov_iter iter;
struct bio_vec *bvec;
-   struct request *rq = cmd->rq;
+   struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
unsigned int offset;
@@ -1702,15 +1704,16 @@ EXPORT_SYMBOL(loop_unregister_transfer);
 static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
 {
-   struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
-   struct loop_device *lo = cmd->rq->q->queuedata;
+   struct request *rq = bd->rq;
+   struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   struct loop_device *lo = rq->q->queuedata;
 
-   blk_mq_start_request(bd->rq);
+   blk_mq_start_request(rq);
 
if (lo->lo_state != Lo_bound)
return BLK_STS_IOERR;
 
-   switch (req_op(cmd->rq)) {
+   switch (req_op(rq)) {
case REQ_OP_FLUSH:
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
@@ -1723,8 +1726,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
 
/* always use the first bio's css */
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) {
-   cmd->css = cmd->rq->bio->bi_css;
+   if (cmd->use_aio && rq->bio && rq->bio->bi_css) {
+   cmd->css = rq->bio->bi_css;
css_get(cmd->css);
} else
 #endif
@@ -1736,8 +1739,9 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
 
 static void loop_handle_cmd(struct loop_cmd *cmd)
 {
-   const bool write = op_is_write(req_op(cmd->rq));
-   struct loop_device *lo = cmd->rq->q->queuedata;
+   struct request *rq = blk_mq_rq_from_pdu(cmd);
+   const bool write = op_is_write(req_op(rq));
+   struct loop_device *lo = rq->q->queuedata;
int ret = 0;
 
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
@@ -1745,12 +1749,12 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
goto failed;
}
 
-   ret = do_req_filebacked(lo, cmd->rq);
+   ret = do_req_filebacked(lo, rq);
  failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
cmd->ret = ret ? -EIO : 0;
-   blk_mq_complete_request(cmd->rq);
+   blk_mq_complete_request(rq);
}
 }
 
@@ -1767,9 +1771,7 @@ static int loop_init_request(struct blk_mq_tag_set *set, 
struct request *rq,
 {
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
 
-   cmd->rq = rq;
kthread_init_work(>work, loop_queue_work);
-
return 0;
 }
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 0f45416e4fcf..b78de9879f4f 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -66,7 +66,6 @@ struct loop_device {
 
 struct loop_cmd {
struct kthread_work work;
-   struct request *rq;
bool use_aio; /* use AIO interface to handle I/O */
atomic_t ref; /* only for aio */
long ret;
-- 
2.7.4



[PATCH 2/2] loop: handle short DIO reads

2018-04-13 Thread Jens Axboe
We ran into an issue with loop and btrfs, where btrfs would complain about
checksum errors. It turns out that is because we don't handle short reads
at all, we just zero fill the remainder. Worse than that, we don't handle
the filling properly, which results in loop trying to advance a single
bio by much more than its size, since it doesn't take chaining into
account.

Handle short reads appropriately, by simply retrying at the new correct
offset. End the remainder of the request with EIO, if we get a 0 read.

Signed-off-by: Jens Axboe 
---
 drivers/block/loop.c | 34 +++---
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8b2fde2109fc..5d4e31655d96 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -451,16 +451,36 @@ static int lo_req_flush(struct loop_device *lo, struct 
request *rq)
 static void lo_complete_rq(struct request *rq)
 {
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   blk_status_t ret = BLK_STS_OK;
 
-   if (unlikely(req_op(rq) == REQ_OP_READ && cmd->use_aio &&
-cmd->ret >= 0 && cmd->ret < blk_rq_bytes(rq))) {
-   struct bio *bio = rq->bio;
-
-   bio_advance(bio, cmd->ret);
-   zero_fill_bio(bio);
+   if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
+   req_op(rq) != REQ_OP_READ) {
+   if (cmd->ret < 0)
+   ret = BLK_STS_IOERR;
+   goto end_io;
}
 
-   blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK);
+   /*
+* Short READ - if we got some data, advance our request and
+* retry it. If we got no data, end the rest with EIO.
+*/
+   if (cmd->ret) {
+   blk_update_request(rq, BLK_STS_OK, cmd->ret);
+   cmd->ret = 0;
+   blk_mq_requeue_request(rq, true);
+   } else {
+   if (cmd->use_aio) {
+   struct bio *bio = rq->bio;
+
+   while (bio) {
+   zero_fill_bio(bio);
+   bio = bio->bi_next;
+   }
+   }
+   ret = BLK_STS_IOERR;
+end_io:
+   blk_mq_end_request(rq, ret);
+   }
 }
 
 static void lo_rw_aio_do_completion(struct loop_cmd *cmd)
-- 
2.7.4



Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-04-13 Thread Stephen Bates
 
>  I'll see if I can get our PCI SIG people to follow this through 

Hi Jonathan

Can you let me know if this moves forward within PCI-SIG? I would like to track 
it. I can see this being doable between Root Ports that reside in the same Root 
Complex but might become more challenging to standardize for RPs that reside in 
different RCs in the same (potentially multi-socket) system. I know in the past 
we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that 
is not something that would work in all systems and must fall outside the remit 
of PCI-SIG ;-).

I agree such a capability bit would be very useful but it's going to be quite 
some time before we can rely on hardware being available that supports it.

Stephen




[GIT PULL] Followup pull request for block

2018-04-13 Thread Jens Axboe
Hi Linus,

Followup fixes for this merge window. This pull request contains:

- Series from Ming, fixing corner cases in our CPU <-> queue mapping.
  This triggered repeated warnings on especially s390, but I also hit it
  in cpu hot plug/unplug testing while doing IO on NVMe on x86-64.

- Another fix from Ming, ensuring that we always order budget and driver
  tag identically, avoiding a deadlock on QD=1 devices.

- Loop locking regression fix from this merge window, from Omar.

- Another loop locking fix, this time missing an unlock, from Tetsuo
  Handa.

- Fix for racing IO submission with device removal from Bart.

- sr reference fix from me, fixing a case where disk change or getevents
  can race with device removal.

- Set of nvme fixes by way of Keith, from various contributors.

Please pull!


  git://git.kernel.dk/linux-block.git tags/for-linus-20180413



Arnd Bergmann (1):
  nvme: target: fix buffer overflow

Bart Van Assche (1):
  blk-mq: Avoid that submitting a bio concurrently with device removal 
triggers a crash

Daniel Verkamp (1):
  nvmet: fix space padding in serial number

James Smart (1):
  nvme: expand nvmf_check_if_ready checks

Jens Axboe (1):
  sr: get/drop reference to device in revalidate and check_events

Johannes Thumshirn (2):
  nvme: unexport nvme_start_keep_alive
  nvme: don't send keep-alives to the discovery controller

Keith Busch (4):
  nvme-pci: Skip queue deletion if there are no queues
  nvme-pci: Remove unused queue parameter
  nvme-pci: Separate IO and admin queue IRQ vectors
  nvme: Use admin command effects for admin commands

Mathieu Malaterre (1):
  backing: silence compiler warning using __printf

Matias Bjørling (1):
  nvme: enforce 64bit offset for nvme_get_log_ext fn

Max Gurtovoy (1):
  nvme: check return value of init_srcu_struct function

Ming Lei (11):
  blk-mq: order getting budget and driver tag
  blk-mq: make sure that correct hctx->next_cpu is set
  blk-mq: don't keep offline CPUs mapped to hctx 0
  blk-mq: avoid to write intermediate result to hctx->next_cpu
  blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu
  blk-mq: remove blk_mq_delay_queue()
  blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
  blk-mq: reimplement blk_mq_hw_queue_mapped
  blk-mq: remove code for dealing with remapping queue
  blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"
  nvme-loop: fix kernel oops in case of unhandled command

Omar Sandoval (1):
  loop: fix LOOP_GET_STATUS lock imbalance

Rodrigo R. Galvao (1):
  nvmet: Fix nvmet_execute_write_zeroes sector count

Tetsuo Handa (1):
  block/loop: fix deadlock after loop_set_status

 block/blk-core.c|  35 ++--
 block/blk-mq-cpumap.c   |   5 --
 block/blk-mq-debugfs.c  |   1 -
 block/blk-mq.c  | 122 +++-
 drivers/block/loop.c|  45 ---
 drivers/nvme/host/core.c|  33 +++
 drivers/nvme/host/fabrics.c |  83 ++-
 drivers/nvme/host/fabrics.h |  33 +--
 drivers/nvme/host/fc.c  |  12 +---
 drivers/nvme/host/nvme.h|   4 +-
 drivers/nvme/host/pci.c |  35 +++-
 drivers/nvme/host/rdma.c|  14 +
 drivers/nvme/target/admin-cmd.c |   1 +
 drivers/nvme/target/discovery.c |   2 +-
 drivers/nvme/target/io-cmd.c|   4 +-
 drivers/nvme/target/loop.c  |  20 ++-
 drivers/scsi/sr.c   |  19 +--
 include/linux/backing-dev.h |   1 +
 include/linux/blk-mq.h  |   2 -
 19 files changed, 245 insertions(+), 226 deletions(-)

-- 
Jens Axboe



loop: handle short DIO reads

2018-04-13 Thread Jens Axboe
We ran into an issue with loop and btrfs, where btrfs would complain about
checksum errors. It turns out that is because we don't handle short reads
at all, we just zero fill the remainder. Worse than that, we don't handle
the filling properly, which results in loop trying to advance a single
bio by much more than its size, since it doesn't take chaining into
account.

Handle short reads appropriately, by simply retrying at the new correct
offset. End the remainder of the request with EIO, if we get a 0 read.

Signed-off-by: Jens Axboe 

---

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9d04497a415..66975c4fb6f1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -451,16 +451,28 @@ static int lo_req_flush(struct loop_device *lo, struct 
request *rq)
 static void lo_complete_rq(struct request *rq)
 {
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   blk_status_t ret = BLK_STS_OK;
 
-   if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
-cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
-   struct bio *bio = cmd->rq->bio;
-
-   bio_advance(bio, cmd->ret);
-   zero_fill_bio(bio);
+   if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
+   req_op(cmd->rq) != REQ_OP_READ) {
+   if (cmd->ret < 0)
+   ret = BLK_STS_IOERR;
+   goto end_io;
}
 
-   blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK);
+   /*
+* Short READ - if we got some data, advance our request and
+* retry it. If we got no data, end the rest with EIO.
+*/
+   if (cmd->ret) {
+   blk_update_request(cmd->rq, BLK_STS_OK, cmd->ret);
+   cmd->ret = 0;
+   blk_mq_requeue_request(cmd->rq, true);
+   } else {
+   ret = BLK_STS_IOERR;
+end_io:
+   blk_mq_end_request(rq, ret);
+   }
 }
 
 static void lo_rw_aio_do_completion(struct loop_cmd *cmd)

-- 
Jens Axboe



Re: [PATCH] loop: fix aio/dio end of request clearing

2018-04-13 Thread Jens Axboe
On 4/12/18 12:52 PM, Jens Axboe wrote:
> If we read more than the user asked for, we zero fill the last
> part. But the current code assumes that the request has just
> one bio, and since that's not guaranteed to be true, we can
> run into a situation where we attempt to advance a bio by
> a bigger amount than its size.
> 
> Handle the zero filling appropriately, regardless of what bio
> ends up needing to be cleared.
> 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c9d04497a415..d3aa96a2f369 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -455,9 +455,22 @@ static void lo_complete_rq(struct request *rq)
>   if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
>cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
>   struct bio *bio = cmd->rq->bio;
> + long left = cmd->ret;
>  
> - bio_advance(bio, cmd->ret);
> - zero_fill_bio(bio);
> + while (left >= bio->bi_iter.bi_size) {
> + left -= bio->bi_iter.bi_size;
> + bio = bio->bi_next;
> + if (WARN_ON_ONCE(!bio))
> + break;
> + }
> + while (bio) {
> + if (left) {
> + bio_advance(bio, left);
> + left = 0;
> + }
> + zero_fill_bio(bio);
> + bio = bio->bi_next;
> + }
>   }

Ignore this one, new version coming.

-- 
Jens Axboe



Re: [PATCH v2 0/2] tracing/events: block: bring more on a par with blktrace

2018-04-13 Thread Steven Rostedt
On Fri, 13 Apr 2018 18:36:20 +0200
Steffen Maier  wrote:

> I had the need to understand I/O request processing in detail.
> But I also had the need to enrich block traces with other trace events
> including my own dynamic kprobe events. So I preferred block trace events
> over blktrace to get everything nicely sorted into one ftrace output.
> However, I missed device filtering for (un)plug events and also
> the difference between the two flavors of unplug.
> 
> The first two patches bring block trace events closer to their
> counterpart in blktrace tooling.
> 
> The last patch is just an RFC. I still kept it in this patch set because
> it is inspired by PATCH 2/2.
> 
> Changes since v1:
> [PATCH v2 1/2]
> Use 0 and 1 instead of false and true for __print_symbolic table.
> Now "trace-cmd report" can decode it. [Steven Rostedt]
> 
> Steffen Maier (3):
>   tracing/events: block: track and print if unplug was explicit or
> schedule
>   tracing/events: block: dev_t via driver core for plug and unplug
> events
>   tracing/events: block: also try to get dev_t via driver core for some
> events
> 
>  include/trace/events/block.h | 33 -
>  1 file changed, 28 insertions(+), 5 deletions(-)
> 

>From just the tracing point of view:

Reviewed-by: Steven Rostedt (VMware) 

as for the race conditions in the block layer, I'll let someone else
decided that.

-- Steve


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Martin K. Petersen

Jinpu,

[CC:ed the mpt3sas maintainers]

The ratelimit patch is just an attempt to treat the symptom, not the
cause.

> Thanks for asking, we updated mpt3sas driver which enables DIX support
> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
> After reboot, kernel reports the IO errors from all the drives behind
> HBA, seems for almost every read IO, which turns the system unusable:
> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)

That sounds like a bug in the mpt3sas driver or firmware. I guess the
HBA could conceivably be operating a SATA device as DIX Type 0 and strip
the PI on the drive side. But that doesn't seem to be a particularly
useful mode of operation.

Jinpu: Which firmware are you running? Also, please send us the output
of:

sg_readcap -l /dev/sda
sg_inq -x /dev/sda
sg_vpd /dev/sda

Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
controller?

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] target: Fix Fortify_panic kernel exception

2018-04-13 Thread Christoph Hellwig
The patch looks fine, but in general I think descriptions of what
you fixed in the code or more important than starting out with
a backtrace.

E.g. please explain what was wrong, how you fixed it and only after
that mention how it was caught.  (preferably without the whole trace)


[PATCH v2 2/2] tracing/events: block: dev_t via driver core for plug and unplug events

2018-04-13 Thread Steffen Maier
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") to be equivalent to traditional blktrace
output. Also this allows event filtering to not always get all (un)plug
events.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.
The change did work for my cases (block device read/write I/O on
zfcp-attached SCSI disks and dm-mpath on top).

While I haven't seen any prior art using driver core (parent) relations
for trace events, there are other cases using this when no direct pointer
exists between objects, such as:
 #define to_scsi_target(d)  container_of(d, struct scsi_target, dev)
 static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
 {
return to_scsi_target(sdev->sdev_gendev.parent);
 }

This is the object model we make use of here:

struct gendisk {
struct hd_struct {
struct device {  /*container_of*/
struct kobject kobj; <--+
dev_t  devt; /*deref*/  |
} __dev;|
} part0;|
struct request_queue *queue; ..+|
}  :|
   :|
struct request_queue {  <..+|
/* queue kobject */ |
struct kobject {|
struct kobject *parent; +
} kobj;
}

The parent pointer comes from:
 #define disk_to_dev(disk)  (&(disk)->part0.__dev)
int blk_register_queue(struct gendisk *disk)
struct device *dev = disk_to_dev(disk);
struct request_queue *q = disk->queue;
ret = kobject_add(>kobj, kobject_get(>kobj), "%s", "queue");
^^^parent

$ ls -d /sys/block/sdf/queue
/sys/block/sda/queue
$ cat /sys/block/sdf/dev
80:0

A partition does not have its own request queue:

$ cat /sys/block/sdf/sdf1/dev
8:81
$ ls -d /sys/block/sdf/sdf1/queue
ls: cannot access '/sys/block/sdf/sdf1/queue': No such file or directory

The difference to blktrace parsed output is that block events don't use the
partition's minor number but the containing block device's minor number:

$ dd if=/dev/sdf1 count=1

$ cat /sys/kernel/debug/tracing/trace
block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
block_bio_queue: 8,80 R 2048 + 32 [dd]
block_getrq: 8,80 R 2048 + 32 [dd]
block_plug: 8,80 [dd]

block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
block_unplug: 8,80 [dd] 1 explicit
  
block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
block_rq_complete: 8,80 R () 2048 + 32 [0]

$ btrace /dev/sdf1
  8,80   11 0.0 240240  A   R 2048 + 32 <- (8,81) 0
  8,81   12 0.000220890 240240  Q   R 2048 + 32 [dd]
  8,81   13 0.000229639 240240  G   R 2048 + 32 [dd]
  8,81   14 0.000231805 240240  P   N [dd]
^^
  8,81   15 0.000234671 240240  I   R 2048 + 32 [dd]
  8,81   16 0.000236365 240240  U   N [dd] 1
^^
  8,81   17 0.000238527 240240  D   R 2048 + 32 [dd]
  8,81   22 0.000613741 0  C   R 2048 + 32 [0]

Signed-off-by: Steffen Maier 
---
 include/trace/events/block.h | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index e90bb6eb8097..6ea5a3899c2e 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
TP_ARGS(q),
 
TP_STRUCT__entry(
+   __field( dev_t, dev )
__array( char,  comm,   TASK_COMM_LEN   )
),
 
TP_fast_assign(
+   __entry->dev = q->kobj.parent ?
+   container_of(q->kobj.parent, struct device, kobj)->devt : 0;
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),
 
-   TP_printk("[%s]", __entry->comm)
+   TP_printk("%d,%d [%s]",
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->comm)
 );
 
 #define show_block_unplug_explicit(val)\
@@ -482,18 +486,23 @@ DECLARE_EVENT_CLASS(block_unplug,
TP_ARGS(q, depth, explicit),
 
TP_STRUCT__entry(
+   __field( dev_t, dev )
__field( int,   nr_rq   )
__field( bool,  explicit)
__array( char,  comm,   TASK_COMM_LEN   )
),
 
TP_fast_assign(
+   __entry->dev   = q->kobj.parent ?
+   container_of(q->kobj.parent, struct device, kobj)->devt : 0;
__entry->nr_rq = depth;
__entry->explicit = explicit;
memcpy(__entry->comm, 

[PATCH v2 1/2] tracing/events: block: track and print if unplug was explicit or schedule

2018-04-13 Thread Steffen Maier
Just like blktrace distinguishes explicit and schedule by means of
BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the
existing argument "explicit" to distinguish the two cases in the one
common tracepoint block_unplug.

Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace
event correspond to the schedule() unplug") and commit d9c978331790
("block: remove block_unplug_timer() trace point").

Signed-off-by: Steffen Maier 
---

Changes since v1:
Use 0 and 1 instead of false and true for __print_symbolic table.
Now "trace-cmd report" can decode it. [Steven Rostedt]

 include/trace/events/block.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 81b43f5bdf23..e90bb6eb8097 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -470,6 +470,11 @@ TRACE_EVENT(block_plug,
TP_printk("[%s]", __entry->comm)
 );
 
+#define show_block_unplug_explicit(val)\
+   __print_symbolic(val,   \
+{0, "schedule"},   \
+{1, "explicit"})
+
 DECLARE_EVENT_CLASS(block_unplug,
 
TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit),
@@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug,
 
TP_STRUCT__entry(
__field( int,   nr_rq   )
+   __field( bool,  explicit)
__array( char,  comm,   TASK_COMM_LEN   )
),
 
TP_fast_assign(
__entry->nr_rq = depth;
+   __entry->explicit = explicit;
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),
 
-   TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
+   TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
+ show_block_unplug_explicit(__entry->explicit))
 );
 
 /**
-- 
2.13.5



[PATCH v2 0/2] tracing/events: block: bring more on a par with blktrace

2018-04-13 Thread Steffen Maier
I had the need to understand I/O request processing in detail.
But I also had the need to enrich block traces with other trace events
including my own dynamic kprobe events. So I preferred block trace events
over blktrace to get everything nicely sorted into one ftrace output.
However, I missed device filtering for (un)plug events and also
the difference between the two flavors of unplug.

The first two patches bring block trace events closer to their
counterpart in blktrace tooling.

The last patch is just an RFC. I still kept it in this patch set because
it is inspired by PATCH 2/2.

Changes since v1:
[PATCH v2 1/2]
Use 0 and 1 instead of false and true for __print_symbolic table.
Now "trace-cmd report" can decode it. [Steven Rostedt]

Steffen Maier (3):
  tracing/events: block: track and print if unplug was explicit or
schedule
  tracing/events: block: dev_t via driver core for plug and unplug
events
  tracing/events: block: also try to get dev_t via driver core for some
events

 include/trace/events/block.h | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

-- 
2.13.5



[RFC v2] tracing/events: block: also try to get dev_t via driver core for some events

2018-04-13 Thread Steffen Maier
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") for cases where rq->rq_disk == NULL:
  block_rq_requeue, block_rq_insert, block_rq_issue;
and for cases where bio == NULL:
  block_getrq, block_sleeprq.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.

Since I don't know when above cases would occur,
I'm not sure this is worth it.

Signed-off-by: Steffen Maier 
---
 include/trace/events/block.h | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6ea5a3899c2e..001e4f369df9 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -86,7 +86,9 @@ TRACE_EVENT(block_rq_requeue,
),
 
TP_fast_assign(
-   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) :
+   (q->kobj.parent ?
+container_of(q->kobj.parent, struct device, kobj)->devt : 0);
__entry->sector= blk_rq_trace_sector(rq);
__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
 
@@ -162,7 +164,9 @@ DECLARE_EVENT_CLASS(block_rq,
),
 
TP_fast_assign(
-   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) :
+   (q->kobj.parent ?
+container_of(q->kobj.parent, struct device, kobj)->devt : 0);
__entry->sector= blk_rq_trace_sector(rq);
__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
__entry->bytes = blk_rq_bytes(rq);
@@ -397,7 +401,9 @@ DECLARE_EVENT_CLASS(block_get_rq,
 ),
 
TP_fast_assign(
-   __entry->dev= bio ? bio_dev(bio) : 0;
+   __entry->dev= bio ? bio_dev(bio) :
+   (q->kobj.parent ?
+container_of(q->kobj.parent, struct device, kobj)->devt : 0);
__entry->sector = bio ? bio->bi_iter.bi_sector : 0;
__entry->nr_sector  = bio ? bio_sectors(bio) : 0;
blk_fill_rwbs(__entry->rwbs,
-- 
2.13.5



Re: [PATCH 1/3] nbd: del_gendisk after we cleanup the queue

2018-04-13 Thread Bart Van Assche
On Fri, 2018-04-13 at 12:21 -0400, Josef Bacik wrote:
> On Fri, Apr 13, 2018 at 04:16:18PM +, Bart Van Assche wrote:
> > On Fri, 2018-04-13 at 12:03 -0400, Josef Bacik wrote:
> > > This fixes a use after free bug, we need to do the del_gendisk after we
> > > cleanup the queue on the device.
> > 
> > Hello Josef,
> > 
> > Which use-after-free bug does this patch fix? Are you aware that most block
> > drivers call blk_cleanup_queue() before calling del_gendisk()? Why do you
> > think that the nbd driver should use the opposite order?
> 
> I'm confused, that's what this patch does.  Before I had del_gendisk() first 
> and
> then the blk_cleanup_queue(), which was bugging out when I was testing stuff
> with a null pointer deref whenever I rmmod'ed the nbd.  Swapping it to the way
> everybody else did it fixed the problem.  Thanks,

Hello Josef,

Oops, I swapped "blk_cleanup_queue()" and "del_gendisk()" in my e-mail.

Can you share the call stack of the NULL pointer deref and also the translation
of the crash address into a source code line?

Thanks,

Bart.




Re: [PATCH 1/3] nbd: del_gendisk after we cleanup the queue

2018-04-13 Thread Josef Bacik
On Fri, Apr 13, 2018 at 04:16:18PM +, Bart Van Assche wrote:
> On Fri, 2018-04-13 at 12:03 -0400, Josef Bacik wrote:
> > This fixes a use after free bug, we need to do the del_gendisk after we
> > cleanup the queue on the device.
> 
> Hello Josef,
> 
> Which use-after-free bug does this patch fix? Are you aware that most block
> drivers call blk_cleanup_queue() before calling del_gendisk()? Why do you
> think that the nbd driver should use the opposite order?
> 

I'm confused, that's what this patch does.  Before I had del_gendisk() first and
then the blk_cleanup_queue(), which was bugging out when I was testing stuff
with a null pointer deref whenever I rmmod'ed the nbd.  Swapping it to the way
everybody else did it fixed the problem.  Thanks,

Josef


Re: [PATCH 1/3] nbd: del_gendisk after we cleanup the queue

2018-04-13 Thread Bart Van Assche
On Fri, 2018-04-13 at 12:03 -0400, Josef Bacik wrote:
> This fixes a use after free bug, we need to do the del_gendisk after we
> cleanup the queue on the device.

Hello Josef,

Which use-after-free bug does this patch fix? Are you aware that most block
drivers call blk_cleanup_queue() before calling del_gendisk()? Why do you
think that the nbd driver should use the opposite order?

Thanks,

Bart.



[PATCH 3/3] nbd: use bd_set_size when updating disk size

2018-04-13 Thread Josef Bacik
From: Josef Bacik 

When we stopped relying on the bdev everywhere I broke updating the
block device size on the fly, which ceph relies on.  We can't just do
set_capacity, we also have to do bd_set_size so things like parted will
notice the device size change.

Fixes: 29eaadc ("nbd: stop using the bdev everywhere")
cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 1520383b12f6..61dd95aa47fa 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -231,9 +231,15 @@ static void nbd_size_clear(struct nbd_device *nbd)
 static void nbd_size_update(struct nbd_device *nbd)
 {
struct nbd_config *config = nbd->config;
+   struct block_device *bdev = bdget_disk(nbd->disk, 0);
+
blk_queue_logical_block_size(nbd->disk->queue, config->blksize);
blk_queue_physical_block_size(nbd->disk->queue, config->blksize);
set_capacity(nbd->disk, config->bytesize >> 9);
+   if (bdev) {
+   bd_set_size(bdev, config->bytesize);
+   bdput(bdev);
+   }
kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE);
 }
 
@@ -,7 +1117,6 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, 
struct block_device *b
if (ret)
return ret;
 
-   bd_set_size(bdev, config->bytesize);
if (max_part)
bdev->bd_invalidated = 1;
mutex_unlock(>config_lock);
-- 
2.14.3



[PATCH 1/3] nbd: del_gendisk after we cleanup the queue

2018-04-13 Thread Josef Bacik
From: Josef Bacik 

This fixes a use after free bug, we need to do the del_gendisk after we
cleanup the queue on the device.

Fixes: c6a4759ea0c9 ("nbd: add device refcounting")
cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 86258b00a1d4..e33da3e6aa20 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -174,8 +174,8 @@ static void nbd_dev_remove(struct nbd_device *nbd)
 {
struct gendisk *disk = nbd->disk;
if (disk) {
-   del_gendisk(disk);
blk_cleanup_queue(disk->queue);
+   del_gendisk(disk);
blk_mq_free_tag_set(>tag_set);
disk->private_data = NULL;
put_disk(disk);
-- 
2.14.3



[PATCH 2/3] nbd: update size when connected

2018-04-13 Thread Josef Bacik
From: Josef Bacik 

I messed up changing the size of an NBD device while it was connected by
not actually updating the device or doing the uevent.  Fix this by
updating everything if we're connected and we change the size.

cc: sta...@vger.kernel.org
Fixes: 639812a ("nbd: don't set the device size until we're connected")
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index e33da3e6aa20..1520383b12f6 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -243,6 +243,8 @@ static void nbd_size_set(struct nbd_device *nbd, loff_t 
blocksize,
struct nbd_config *config = nbd->config;
config->blksize = blocksize;
config->bytesize = blocksize * nr_blocks;
+   if (nbd->task_recv != NULL)
+   nbd_size_update(nbd);
 }
 
 static void nbd_complete_rq(struct request *req)
-- 
2.14.3



Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-13 Thread Ming Lei
On Thu, Apr 12, 2018 at 06:57:12AM -0700, Tejun Heo wrote:
> On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
> > > Not really because aborted_gstate right now doesn't have any memory
> > > barrier around it, so nothing ensures blk_add_timer() actually appears
> > > before.  We can either add the matching barriers in aborted_gstate
> > > update and when it's read in the normal completion path, or we can
> > > wait for the update to be visible everywhere by waiting for rcu grace
> > > period (because the reader is rcu protected).
> > 
> > Seems not necessary.
> > 
> > Suppose it is out of order, the only side-effect is that the new
> > recycled request is timed out as a bit late, I think that is what
> > we can survive, right?
> 
> It at least can mess up the timeout duration for the next recycle
> instance because there can be two competing blk_add_timer() instances.
> I'm not sure whether there can be other consequences.  When ownership
> isn't clear, it becomes really difficult to reason about these things
> and can lead to subtle failures.  I think it'd be best to always
> establish who owns what.

Please see the code of blk_add_timer() for blk-mq:

blk_rq_set_deadline(req, jiffies + req->timeout);
req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;

if (!timer_pending(>timeout) ||
time_before(expiry, q->timeout.expires))
mod_timer(>timeout, expiry);

If this rq is recycled, blk_add_timer() only touches rq->deadline
and the EXPIRED flags, and the only effect is that the timeout
may be handled a bit late, but the timeout monitor won't be lost.

And this thing shouldn't be difficult to avoid, as you mentioned,
synchronize_rcu() can be added between blk_add_timer() and
resetting aborted gstate for avoiding it.


thanks,
Ming


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-13 Thread t...@kernel.org
Hello, Bart.

On Thu, Apr 12, 2018 at 10:40:52PM +, Bart Van Assche wrote:
> Is something like the patch below perhaps what you had in mind?

Yeah, exactly.  It'd be really great to have the lockdep asserts add
to the right places.

Thanks.

-- 
tejun


Re: [PATCH] nbd: do not update block size if file system is mounted

2018-04-13 Thread Michael Tretter
On Fri, 16 Mar 2018 09:36:45 -0700, Jens Axboe wrote:
> On 3/15/18 3:00 AM, Michael Tretter wrote:
> > On Wed, 28 Feb 2018 10:54:56 +0100, Michael Tretter wrote:  
> >> If a file system is mounted on the nbd during a disconnect, resetting
> >> the size to 0, might change the block size and destroy the buffer_head
> >> mappings. This will cause a infinite loop when the file system looks for
> >> the buffer_heads for flushing.
> >>
> >> Only set the file size to 0, if we are the only opener of the nbd.
> >>
> >> Signed-off-by: Michael Tretter 
> >> ---
> >>  drivers/block/nbd.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> >> index 5f2a4240a204..0d1dbb60430b 100644
> >> --- a/drivers/block/nbd.c
> >> +++ b/drivers/block/nbd.c
> >> @@ -1118,7 +1118,8 @@ static int nbd_start_device_ioctl(struct nbd_device 
> >> *nbd, struct block_device *b
> >>if (ret)
> >>sock_shutdown(nbd);
> >>mutex_lock(>config_lock);
> >> -  bd_set_size(bdev, 0);
> >> +  if (bdev->bd_openers <= 1)
> >> +  bd_set_size(bdev, 0);
> >>/* user requested, ignore socket errors */
> >>if (test_bit(NBD_DISCONNECT_REQUESTED, >runtime_flags))
> >>ret = 0;  
> > 
> > Are there any comments on this patch?  
> 
> Josef?
> 

Maybe some history helps for reviewing the patch:

Commit abbbdf12497d ("nbd: replace kill_bdev() with
__invalidate_device()") already fixed this issue once by changing
nbd_size_clear() to only set the size to 0 if the device is only opened
once. The line was moved several times during the rework for the
netlink interface and the check for the number of openers was dropped,
reintroducing the issue.

Michael


Re: [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule

2018-04-13 Thread Steven Rostedt
On Fri, 13 Apr 2018 15:07:17 +0200
Steffen Maier  wrote:

> Just like blktrace distinguishes explicit and schedule by means of
> BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the
> existing argument "explicit" to distinguish the two cases in the one
> common tracepoint block_unplug.
> 
> Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace
> event correspond to the schedule() unplug") and commit d9c978331790
> ("block: remove block_unplug_timer() trace point").
> 
> Signed-off-by: Steffen Maier 
> ---
>  include/trace/events/block.h | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/events/block.h b/include/trace/events/block.h
> index 81b43f5bdf23..a13613d27cee 100644
> --- a/include/trace/events/block.h
> +++ b/include/trace/events/block.h
> @@ -470,6 +470,11 @@ TRACE_EVENT(block_plug,
>   TP_printk("[%s]", __entry->comm)
>  );
>  
> +#define show_block_unplug_explicit(val)  \
> + __print_symbolic(val,   \
> +  {false, "schedule"},   \
> +  {true,  "explicit"})

That's new. I haven't seen "true"/"false" values used for
print_symbolic before. But could you please use 1 and 0 instead, because
perf and trace-cmd won't be able to parse that. I could update
libtraceevent to handle it, but really, the first parameter is suppose
to be numeric.

-- Steve

> +
>  DECLARE_EVENT_CLASS(block_unplug,
>  
>   TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit),
> @@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug,
>  
>   TP_STRUCT__entry(
>   __field( int,   nr_rq   )
> + __field( bool,  explicit)
>   __array( char,  comm,   TASK_COMM_LEN   )
>   ),
>  
>   TP_fast_assign(
>   __entry->nr_rq = depth;
> + __entry->explicit = explicit;
>   memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
>   ),
>  
> - TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
> + TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
> +   show_block_unplug_explicit(__entry->explicit))
>  );
>  
>  /**



[PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events

2018-04-13 Thread Steffen Maier
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") to be equivalent to traditional blktrace
output. Also this allows event filtering to not always get all (un)plug
events.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.
The change did work for my cases (block device read/write I/O on
zfcp-attached SCSI disks and dm-mpath on top).

While I haven't seen any prior art using driver core (parent) relations
for trace events, there are other cases using this when no direct pointer
exists between objects, such as:
 #define to_scsi_target(d)  container_of(d, struct scsi_target, dev)
 static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
 {
return to_scsi_target(sdev->sdev_gendev.parent);
 }

This is the object model we make use of here:

struct gendisk {
struct hd_struct {
struct device {  /*container_of*/
struct kobject kobj; <--+
dev_t  devt; /*deref*/  |
} __dev;|
} part0;|
struct request_queue *queue; ..+|
}  :|
   :|
struct request_queue {  <..+|
/* queue kobject */ |
struct kobject {|
struct kobject *parent; +
} kobj;
}

The parent pointer comes from:
 #define disk_to_dev(disk)  (&(disk)->part0.__dev)
int blk_register_queue(struct gendisk *disk)
struct device *dev = disk_to_dev(disk);
struct request_queue *q = disk->queue;
ret = kobject_add(>kobj, kobject_get(>kobj), "%s", "queue");
^^^parent

$ ls -d /sys/block/sdf/queue
/sys/block/sda/queue
$ cat /sys/block/sdf/dev
80:0

A partition does not have its own request queue:

$ cat /sys/block/sdf/sdf1/dev
8:81
$ ls -d /sys/block/sdf/sdf1/queue
ls: cannot access '/sys/block/sdf/sdf1/queue': No such file or directory

The difference to blktrace parsed output is that block events don't use the
partition's minor number but the containing block device's minor number:

$ dd if=/dev/sdf1 count=1

$ cat /sys/kernel/debug/tracing/trace
block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0
block_bio_queue: 8,80 R 2048 + 32 [dd]
block_getrq: 8,80 R 2048 + 32 [dd]
block_plug: 8,80 [dd]

block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd]
block_unplug: 8,80 [dd] 1 explicit
  
block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd]
block_rq_complete: 8,80 R () 2048 + 32 [0]

$ btrace /dev/sdf1
  8,80   11 0.0 240240  A   R 2048 + 32 <- (8,81) 0
  8,81   12 0.000220890 240240  Q   R 2048 + 32 [dd]
  8,81   13 0.000229639 240240  G   R 2048 + 32 [dd]
  8,81   14 0.000231805 240240  P   N [dd]
^^
  8,81   15 0.000234671 240240  I   R 2048 + 32 [dd]
  8,81   16 0.000236365 240240  U   N [dd] 1
^^
  8,81   17 0.000238527 240240  D   R 2048 + 32 [dd]
  8,81   22 0.000613741 0  C   R 2048 + 32 [0]

Signed-off-by: Steffen Maier 
---
 include/trace/events/block.h | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index a13613d27cee..cffedc26e8a3 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -460,14 +460,18 @@ TRACE_EVENT(block_plug,
TP_ARGS(q),
 
TP_STRUCT__entry(
+   __field( dev_t, dev )
__array( char,  comm,   TASK_COMM_LEN   )
),
 
TP_fast_assign(
+   __entry->dev = q->kobj.parent ?
+   container_of(q->kobj.parent, struct device, kobj)->devt : 0;
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),
 
-   TP_printk("[%s]", __entry->comm)
+   TP_printk("%d,%d [%s]",
+ MAJOR(__entry->dev), MINOR(__entry->dev), __entry->comm)
 );
 
 #define show_block_unplug_explicit(val)\
@@ -482,18 +486,23 @@ DECLARE_EVENT_CLASS(block_unplug,
TP_ARGS(q, depth, explicit),
 
TP_STRUCT__entry(
+   __field( dev_t, dev )
__field( int,   nr_rq   )
__field( bool,  explicit)
__array( char,  comm,   TASK_COMM_LEN   )
),
 
TP_fast_assign(
+   __entry->dev   = q->kobj.parent ?
+   container_of(q->kobj.parent, struct device, kobj)->devt : 0;
__entry->nr_rq = depth;
__entry->explicit = explicit;
memcpy(__entry->comm, 

[PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule

2018-04-13 Thread Steffen Maier
Just like blktrace distinguishes explicit and schedule by means of
BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the
existing argument "explicit" to distinguish the two cases in the one
common tracepoint block_unplug.

Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace
event correspond to the schedule() unplug") and commit d9c978331790
("block: remove block_unplug_timer() trace point").

Signed-off-by: Steffen Maier 
---
 include/trace/events/block.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 81b43f5bdf23..a13613d27cee 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -470,6 +470,11 @@ TRACE_EVENT(block_plug,
TP_printk("[%s]", __entry->comm)
 );
 
+#define show_block_unplug_explicit(val)\
+   __print_symbolic(val,   \
+{false, "schedule"},   \
+{true,  "explicit"})
+
 DECLARE_EVENT_CLASS(block_unplug,
 
TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit),
@@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug,
 
TP_STRUCT__entry(
__field( int,   nr_rq   )
+   __field( bool,  explicit)
__array( char,  comm,   TASK_COMM_LEN   )
),
 
TP_fast_assign(
__entry->nr_rq = depth;
+   __entry->explicit = explicit;
memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
),
 
-   TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
+   TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq,
+ show_block_unplug_explicit(__entry->explicit))
 );
 
 /**
-- 
2.13.5



[PATCH 0/2] tracing/events: block: bring more on a par with blktrace

2018-04-13 Thread Steffen Maier
I had the need to understand I/O request processing in detail.
But I also had the need to enrich block traces with other trace events
including my own dynamic kprobe events. So I preferred block trace events
over blktrace to get everything nicely sorted into one ftrace output.
However, I missed device filtering for (un)plug events and also
the difference between the two flavors of unplug.

The first two patches bring block trace events closer to their
counterpart in blktrace tooling.

The last patch is just an RFC. I still kept it in this patch set because
it is inspired by PATCH 2/2.

Steffen Maier (3):
  tracing/events: block: track and print if unplug was explicit or
schedule
  tracing/events: block: dev_t via driver core for plug and unplug
events
  tracing/events: block: also try to get dev_t via driver core for some
events

 include/trace/events/block.h | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

-- 
2.13.5



[RFC] tracing/events: block: also try to get dev_t via driver core for some events

2018-04-13 Thread Steffen Maier
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block
trace points to TRACE_EVENT()") for cases where rq->rq_disk == NULL:
  block_rq_requeue, block_rq_insert, block_rq_issue;
and for cases where bio == NULL:
  block_getrq, block_sleeprq.

NB: The NULL pointer check for q->kobj.parent is certainly racy and
I don't have enough experience if it's good enough for a trace event.

Since I don't know when above cases would occur,
I'm not sure this is worth it.

Signed-off-by: Steffen Maier 
---
 include/trace/events/block.h | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index cffedc26e8a3..c476b7af404b 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -86,7 +86,9 @@ TRACE_EVENT(block_rq_requeue,
),
 
TP_fast_assign(
-   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) :
+   (q->kobj.parent ?
+container_of(q->kobj.parent, struct device, kobj)->devt : 0);
__entry->sector= blk_rq_trace_sector(rq);
__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
 
@@ -162,7 +164,9 @@ DECLARE_EVENT_CLASS(block_rq,
),
 
TP_fast_assign(
-   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
+   __entry->dev   = rq->rq_disk ? disk_devt(rq->rq_disk) :
+   (q->kobj.parent ?
+container_of(q->kobj.parent, struct device, kobj)->devt : 0);
__entry->sector= blk_rq_trace_sector(rq);
__entry->nr_sector = blk_rq_trace_nr_sectors(rq);
__entry->bytes = blk_rq_bytes(rq);
@@ -397,7 +401,9 @@ DECLARE_EVENT_CLASS(block_get_rq,
 ),
 
TP_fast_assign(
-   __entry->dev= bio ? bio_dev(bio) : 0;
+   __entry->dev= bio ? bio_dev(bio) :
+   (q->kobj.parent ?
+container_of(q->kobj.parent, struct device, kobj)->devt : 0);
__entry->sector = bio ? bio->bi_iter.bi_sector : 0;
__entry->nr_sector  = bio ? bio_sectors(bio) : 0;
blk_fill_rwbs(__entry->rwbs,
-- 
2.13.5



Re: [PATCH 1/2] bcache: remove unncessary code in bch_btree_keys_init()

2018-04-13 Thread tang . junhui
Hi Coly,

>Function bch_btree_keys_init() initializes b->set[].size and
>b->set[].data to zero. As the code comments indicates, these code indeed
>is unncessary, because both struct btree_keys and struct bset_tree are
>nested embedded into struct btree, when struct btree is filled with 0
>bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
>b->set[].data are initialized to 0 (a.k.a NULL) already.
>
Eh, you need to pay attention to mca_alloc()==>mca_reap(),
which get btree node memory from list btree_cache_freeable
or btree_cache_freed, that merory were used as btree node before,
and may be not fill with 0.
>This patch removes the redundant code, and add comments in
>bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.
>
>Signed-off-by: Coly Li 
>---
> drivers/md/bcache/bset.c  | 13 ++---
> drivers/md/bcache/btree.c |  4 
> 2 files changed, 10 insertions(+), 7 deletions(-)
>
>diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
>index 579c696a5fe0..343f4e9428e0 100644
>--- a/drivers/md/bcache/bset.c
>+++ b/drivers/md/bcache/bset.c
>@@ -352,15 +352,14 @@ void bch_btree_keys_init(struct btree_keys *b, const 
>struct btree_keys_ops *ops,
> b->nsets = 0;
> b->last_set_unwritten = 0;
> 
>-/* XXX: shouldn't be needed */
>-for (i = 0; i < MAX_BSETS; i++)
>-b->set[i].size = 0;
> /*
>- * Second loop starts at 1 because b->keys[0]->data is the memory we
>- * allocated
>+ * struct btree_keys in embedded in struct btree, and struct
>+ * bset_tree is embedded into struct btree_keys. They are all
>+ * initialized as 0 by kzalloc() in mca_bucket_alloc(), and
>+ * b->set[0].data is allocated in bch_btree_keys_alloc(), so we
>+ * don't have to initiate b->set[].size and b->set[].data here
>+ * any more.
>  */
>-for (i = 1; i < MAX_BSETS; i++)
>-b->set[i].data = NULL;
> }
> EXPORT_SYMBOL(bch_btree_keys_init);
> 
>diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
>index 17936b2dc7d6..344641e23415 100644
>--- a/drivers/md/bcache/btree.c
>+++ b/drivers/md/bcache/btree.c
>@@ -600,6 +600,10 @@ static void mca_data_alloc(struct btree *b, struct bkey 
>*k, gfp_t gfp)
> static struct btree *mca_bucket_alloc(struct cache_set *c,
>   struct bkey *k, gfp_t gfp)
> {
>+/*
>+ * kzalloc() is necessary here for initialization,
>+ * see code comments in bch_btree_keys_init().
>+ */
> struct btree *b = kzalloc(sizeof(struct btree), gfp);
> if (!b)
> return NULL;
>-- 
>2.16.2


Thanks.
Tang Junhui


Re: [PATCH 1/4] bcache: finish incremental GC

2018-04-13 Thread tang . junhui
Hi Coly,

> Hi Junhui,

> How about send out v2 patch and let me confirm it with a Reviewed-by ?

Certainly it's OK.

Thanks.
Tang Junhui


Re: [PATCH 1/4] bcache: finish incremental GC

2018-04-13 Thread Coly Li
On 2018/4/13 4:37 PM, tang.jun...@zte.com.cn wrote:
> Hi Coly,
>>>
>>> Hello Coly,
>>>
 On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
>
> In GC thread, we record the latest GC key in gc_done, which is expected
> to be used for incremental GC, but in currently code, we didn't realize
> it. When GC runs, front side IO would be blocked until the GC over, it
> would be a long time if there is a lot of btree nodes.
>
> This patch realizes incremental GC, the main ideal is that, when there
> are front side I/Os, after GC some nodes (100), we stop GC, release locker
> of the btree node, and go to process the front side I/Os for some times
> (100 ms), then go back to GC again.
>
> By this patch, when we doing GC, I/Os are not blocked all the time, and
> there is no obvious I/Os zero jump problem any more.
>
> Signed-off-by: Tang Junhui 

 Hi Junhui,

 I reply my comments in line,
>>> Thanks for your comments in advance.

> ---
>  drivers/md/bcache/bcache.h  |  5 +
>  drivers/md/bcache/btree.c   | 15 ++-
>  drivers/md/bcache/request.c |  3 +++
>  3 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 843877e..ab4c9ca 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -445,6 +445,7 @@ struct cache {
>  
>  struct gc_stat {
>  size_tnodes;
> +size_tnodes_pre;
>  size_tkey_bytes;
>  
>  size_tnkeys;
> @@ -568,6 +569,10 @@ struct cache_set {
>   */
>  atomic_trescale;
>  /*
> + * used for GC, identify if any front side I/Os is inflight
> + */
> +atomic_tio_inflight;

 Could you please to rename the above variable to something like
 search_inflight ? Just to be more explicit, considering we have many
 types of io requests.
>>> OK, It looks better.

> +/*
>   * When we invalidate buckets, we use both the priority and the 
> amount
>   * of good data to determine which buckets to reuse first - to weight
>   * those together consistently we keep track of the smallest nonzero
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3..b36d276 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -90,6 +90,9 @@
>  
>  #define MAX_NEED_GC64
>  #define MAX_SAVE_PRIO72
> +#define MIN_GC_NODES100
> +#define GC_SLEEP_TIME100

 How about naming the above macro as GC_SLEEP_MS ? So people may
 understand the sleep time is 100ms without checking more context.
>>> OK, It looks better.
> +
>  
>  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
>  
> @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, 
> struct btree_op *op,
>  memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
>  r->b = NULL;
>  
> +if (atomic_read(>c->io_inflight) &&
> +gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {

 On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the
 above check will be false for a very long time, and in some condition it
 might be always false after gc->nodes overflowed.

 How about adding the following check before the if() statement ?
>>>
 /* in case gc->nodes is overflowed */
 if (gc->nodes_pre < gc->nodes)
 gc->noeds_pre = gc->nodes;

>>> I think 32bit is big enough, which can express a value of billions,
>>> but the number of nodes is usually around tens of thousands.
>>> Also gc->nodes and gc->nodes_pre were set to zero when GC begin each time.
>>>
>>> How do you think about? 
>>
>> Oh, I see. Then I don't worry here. The patch is OK to me.
> Thanks Coly, Would you mind if I add you as a reveiw-by in the v2 patch?

Hi Junhui,

How about send out v2 patch and let me confirm it with a Reviewed-by ?

Coly Li



Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Jinpu Wang
On Thu, Apr 12, 2018 at 11:20 PM, Martin K. Petersen
 wrote:
>
> Jack,
>
>> + pr_err_ratelimited("%s: ref tag error at 
>> location %llu (rcvd %u)\n",
>
> I'm a bit concerned about dropping records of potential data loss.
>
> Also, what are you doing that compels all these to be logged? This
> should be a very rare occurrence.
>
> --
> Martin K. Petersen  Oracle Linux Engineering
Hi Martin,

Thanks for asking, we updated mpt3sas driver which enables DIX support
(prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
After reboot, kernel reports the IO errors from all the drives behind
HBA, seems for almost every read IO, which turns the system unusable:
[   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
[   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
[   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
[   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
[   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
[   13.080996] sda: ref tag error at location 0 (rcvd 143196159)
[   13.089878] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090275] sdb: ref tag error at location 937702912 (rcvd 277413887)
[   13.090448] sdb: ref tag error at location 937703072 (rcvd 143196159)
[   13.090655] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090823] sdb: ref tag error at location 8 (rcvd 277413887)
[   13.091218] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.095412] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.095859] sdc: ref tag error at location 937702912 (rcvd 143196159)
[   13.096058] sdc: ref tag error at location 937703072 (rcvd 143196159)
[   13.096228] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.096445] sdc: ref tag error at location 8 (rcvd 143196159)
[   13.096833] sdc: ref tag error at location 0 (rcvd 277413887)
[   13.097187] sds: ref tag error at location 0 (rcvd 277413887)
[   13.097707] sds: ref tag error at location 937702912 (rcvd 143196159)
[   13.097855] sds: ref tag error at location 937703072 (rcvd 277413887)

Kernel version 4.15 and 4.14.28, I scan the commits in upstream,
haven't found any relevant.
in  4.4.112, there's no such errors.
Diable DIX support (prot_mask=0x7) in mpt3sas fixes the problem.

Regards,
-- 
Jack Wang
Linux Kernel DeveloperProfitBricks GmbH


Re: [PATCH 1/4] bcache: finish incremental GC

2018-04-13 Thread tang . junhui
Hi Coly,
>> 
>> Hello Coly,
>> 
>>> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote:
 From: Tang Junhui 

 In GC thread, we record the latest GC key in gc_done, which is expected
 to be used for incremental GC, but in currently code, we didn't realize
 it. When GC runs, front side IO would be blocked until the GC over, it
 would be a long time if there is a lot of btree nodes.

 This patch realizes incremental GC, the main ideal is that, when there
 are front side I/Os, after GC some nodes (100), we stop GC, release locker
 of the btree node, and go to process the front side I/Os for some times
 (100 ms), then go back to GC again.

 By this patch, when we doing GC, I/Os are not blocked all the time, and
 there is no obvious I/Os zero jump problem any more.

 Signed-off-by: Tang Junhui 
>>>
>>> Hi Junhui,
>>>
>>> I reply my comments in line,
>> Thanks for your comments in advance.
>>>
 ---
  drivers/md/bcache/bcache.h  |  5 +
  drivers/md/bcache/btree.c   | 15 ++-
  drivers/md/bcache/request.c |  3 +++
  3 files changed, 22 insertions(+), 1 deletion(-)

 diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
 index 843877e..ab4c9ca 100644
 --- a/drivers/md/bcache/bcache.h
 +++ b/drivers/md/bcache/bcache.h
 @@ -445,6 +445,7 @@ struct cache {
  
  struct gc_stat {
  size_tnodes;
 +size_tnodes_pre;
  size_tkey_bytes;
  
  size_tnkeys;
 @@ -568,6 +569,10 @@ struct cache_set {
   */
  atomic_trescale;
  /*
 + * used for GC, identify if any front side I/Os is inflight
 + */
 +atomic_tio_inflight;
>>>
>>> Could you please to rename the above variable to something like
>>> search_inflight ? Just to be more explicit, considering we have many
>>> types of io requests.
>> OK, It looks better.
>>>
 +/*
   * When we invalidate buckets, we use both the priority and the amount
   * of good data to determine which buckets to reuse first - to weight
   * those together consistently we keep track of the smallest nonzero
 diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
 index 81e8dc3..b36d276 100644
 --- a/drivers/md/bcache/btree.c
 +++ b/drivers/md/bcache/btree.c
 @@ -90,6 +90,9 @@
  
  #define MAX_NEED_GC64
  #define MAX_SAVE_PRIO72
 +#define MIN_GC_NODES100
 +#define GC_SLEEP_TIME100
>>>
>>> How about naming the above macro as GC_SLEEP_MS ? So people may
>>> understand the sleep time is 100ms without checking more context.
>> OK, It looks better.
 +
  
  #define PTR_DIRTY_BIT(((uint64_t) 1 << 36))
  
 @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct 
 btree_op *op,
  memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
  r->b = NULL;
  
 +if (atomic_read(>c->io_inflight) &&
 +gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
>>>
>>> On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the
>>> above check will be false for a very long time, and in some condition it
>>> might be always false after gc->nodes overflowed.
>>>
>>> How about adding the following check before the if() statement ?
>> 
>>> /* in case gc->nodes is overflowed */
>>> if (gc->nodes_pre < gc->nodes)
>>> gc->noeds_pre = gc->nodes;
>>>
>> I think 32bit is big enough, which can express a value of billions,
>> but the number of nodes is usually around tens of thousands.
>> Also gc->nodes and gc->nodes_pre were set to zero when GC begin each time.
>> 
>> How do you think about? 
>
>Oh, I see. Then I don't worry here. The patch is OK to me.
Thanks Coly, Would you mind if I add you as a reveiw-by in the v2 patch?

Thanks.
Tang Junhui


Re: [PATCH v2] block: do not use interruptible wait anywhere

2018-04-13 Thread Johannes Thumshirn
Hi Alan,

On Thu, 2018-04-12 at 19:11 +0100, Alan Jenkins wrote:
> # dd if=/dev/sda of=/dev/null iflag=direct & \
>   while killall -SIGUSR1 dd; do sleep 0.1; done & \
>   echo mem > /sys/power/state ; \
>   sleep 5; killall dd  # stop after 5 seconds

Can you please also add a regression test to blktests[1] for this?

[1] https://github.com/osandov/blktests

Thanks,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 3/4] bcache: notify allocator to prepare for GC

2018-04-13 Thread tang . junhui
Hi Coly

Thanks for you comments.
> Hi Junhui,
> 
> This method is complicated IMHO. 
The main idea of this patch is:

GC thread   
allocator thread
==>triggered by sectors_to_gc
   set ca->prepare_gc to GC_PREPARING
   to notify allocator thread to
   prepare for GC   ==>detect 
ca->prepare_gc is 

GC_PREPARING, fill the 

go to invalidate_buckets(),
==>waiting for allocatorand fill free_inc queue 
with
   thread to prepare over   reclaimable buckets, 
after

that, set ca->prepare_gc to

GC_PREPARED to notify GC

thread being prepared OK
==>detect ca->prepare_gc is
   prepared OK, set 
   ca->prepare_gc back to 
   GC_PREPARE_NONE, and continue GC

> Why not in bch_allocator_thread()
> simple call wake_up_gc(ca->set) after invalidate_buckets() ?
In this patch, GC is triggered by sectors_to_gc, and I/Os from front side
continue exist, so we notify allocator to pack the free_inc queue
full of buckets before GC, then we could have enough buckets for front side
I/Os during GC period.

Maybe the code is somewhat redundant, I will try to refactoring the code
in the next patch.

 
Thanks.  
Tang Junhui