Re: [PATCH V3] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
On 4/12/18 5:59 AM, Ming Lei wrote: > The normal request completion can be done before or during handling > BLK_EH_RESET_TIMER, and this race may cause the request to never be > completed since driver's .timeout() may always return > BLK_EH_RESET_TIMER. > > This issue can't be fixed completely by driver, since the normal > completion can be done between returning .timeout() and handling > BLK_EH_RESET_TIMER. > > This patch fixes the race by introducing rq state of > MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding > queue lock, which can be per-request actually, but just not necessary > to introduce one lock for so unusual event. > > Also when .timeout() returns BLK_EH_HANDLED, sync with normal > completion path before completing this timed-out rq finally for > avoiding this rq's state touched by normal completion. I like this approach since it keeps the cost outside of the fast path. And it's fine to reuse the queue lock for this, instead of adding a special lock for something we consider a rare occurrence. >From a quick look this looks sane, but I'll take a closer look tomrrow and add some testing too. -- Jens Axboe
Re: [PATCH] nbd: do not update block size if file system is mounted
Yeah sorry I screwed that up again. I’m wondering if we can just drop this altogether and leave the zero setting in the config put that we already have. Does doing that fix it for you? Thanks, Josef Sent from my iPhone > On Apr 13, 2018, at 10:26 AM, Michael Tretter> wrote: > >> On Fri, 16 Mar 2018 09:36:45 -0700, Jens Axboe wrote: >>> On 3/15/18 3:00 AM, Michael Tretter wrote: On Wed, 28 Feb 2018 10:54:56 +0100, Michael Tretter wrote: If a file system is mounted on the nbd during a disconnect, resetting the size to 0, might change the block size and destroy the buffer_head mappings. This will cause a infinite loop when the file system looks for the buffer_heads for flushing. Only set the file size to 0, if we are the only opener of the nbd. Signed-off-by: Michael Tretter --- drivers/block/nbd.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 5f2a4240a204..0d1dbb60430b 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -1118,7 +1118,8 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b if (ret) sock_shutdown(nbd); mutex_lock(>config_lock); -bd_set_size(bdev, 0); +if (bdev->bd_openers <= 1) +bd_set_size(bdev, 0); /* user requested, ignore socket errors */ if (test_bit(NBD_DISCONNECT_REQUESTED, >runtime_flags)) ret = 0; >>> >>> Are there any comments on this patch? >> >> Josef? >> > > Maybe some history helps for reviewing the patch: > > Commit abbbdf12497d ("nbd: replace kill_bdev() with > __invalidate_device()") already fixed this issue once by changing > nbd_size_clear() to only set the size to 0 if the device is only opened > once. The line was moved several times during the rework for the > netlink interface and the check for the number of openers was dropped, > reintroducing the issue. > > Michael
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote: > The blk-mq timeout handling code ignores completions that occur after > blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() > has reset rq->aborted_gstate. If a block driver timeout handler always > returns BLK_EH_RESET_TIMER then the result will be that the request > never terminates. We should fix this issue in block layer because: 1) when normal completion is done after rq's state is updated to COMPLETE during timeout handling, this patch still follows previous behaviour to reset timer, instead of complete this request immediately. 2) if driver's .timeout() deals with this situation by returning RESET_TIMER at the first time, it can be very possible to deal with this situation as RESET_TIMER in next timeout, because both hw and sw state wrt. this request isn't changed compared with 1st .timeout. So it is very possible for this request to not be completed for ever. 3) normal completion may be done just after returning from .timeout(), so this issue may not be avoided by fixing driver And the simple approach in the following link can fix the issue without introducing any cost in fast path: https://marc.info/?l=linux-block=152353441722852=2 > > Since the request state can be updated from two different contexts, > namely regular completion and request timeout, this race cannot be > fixed with RCU synchronization only. Fix this race as follows: > - Use the deadline instead of the request generation to detect whether > or not a request timer fired after reinitialization of a request. > - Store the request state in the lowest two bits of the deadline instead > of the lowest two bits of 'gstate'. > - Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an > enumeration member into a #define such that its type can be changed > into unsigned long. That allows to write & ~RQ_STATE_MASK instead of > ~(unsigned long)RQ_STATE_MASK. > - Remove all request member variables that became superfluous due to > this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync. > - Remove the request state information that became superfluous due to this > patch, namely RQF_MQ_TIMEOUT_EXPIRED. > - Remove the hctx member that became superfluous due to these changes, > namely nr_expired. > - Remove the code that became superfluous due to this change, namely > the RCU lock and unlock statements in blk_mq_complete_request() and > also the synchronize_rcu() call in the timeout handler. > > Signed-off-by: Bart Van Assche> Cc: Tejun Heo > Cc: Christoph Hellwig > Cc: Ming Lei > Cc: Sagi Grimberg > Cc: Israel Rukshin , > Cc: Max Gurtovoy > Cc: # v4.16 > --- > > Changes compared to v4: > - Addressed multiple review comments from Christoph. The most important are > that atomic_long_cmpxchg() has been changed into cmpxchg() and also that > there is now a nice and clean split between the legacy and blk-mq versions > of blk_add_timer(). > - Changed the patch name and modified the patch description because there is > disagreement about whether or not the v4.16 blk-mq core can complete a > single request twice. Kept the "Cc: stable" tag because of > https://bugzilla.kernel.org/show_bug.cgi?id=199077. > > Changes compared to v3 (see also > https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html): > - Removed the spinlock again that was introduced to protect the request state. > v4 uses atomic_long_cmpxchg() instead. > - Split __deadline into two variables - one for the legacy block layer and one > for blk-mq. > > Changes compared to v2 > (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html): > - Rebased and retested on top of kernel v4.16. > > Changes compared to v1 > (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html): > - Removed the gstate and aborted_gstate members of struct request and used > the __deadline member to encode both the generation and state information. > > block/blk-core.c | 2 - > block/blk-mq-debugfs.c | 1 - > block/blk-mq.c | 174 > + > block/blk-mq.h | 65 ++ > block/blk-timeout.c| 89 ++--- > block/blk.h| 13 ++-- > include/linux/blk-mq.h | 1 - > include/linux/blkdev.h | 30 ++--- > 8 files changed, 118 insertions(+), 257 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 8625ec929fe5..181b1a688a5b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -200,8 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request > *rq) > rq->start_time = jiffies; > set_start_time_ns(rq); > rq->part = NULL; > - seqcount_init(>gstate_seq); > -
[PATCHSET 0/2] loop: dio fixup
Wanted to do a v2 of the loop dio short read fix, since I figured we should retain the zero fill for if we have to give up doing reads. Also noticed that we have cmd->rq for some reason, so killed that off as a cleanup. -- Jens Axboe
[PATCH 1/2] loop: remove cmd->rq member
We can always get at the request from the payload, no need to store a pointer to it. Signed-off-by: Jens Axboe--- drivers/block/loop.c | 36 +++- drivers/block/loop.h | 1 - 2 files changed, 19 insertions(+), 18 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index c9d04497a415..8b2fde2109fc 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -452,9 +452,9 @@ static void lo_complete_rq(struct request *rq) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); - if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio && -cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) { - struct bio *bio = cmd->rq->bio; + if (unlikely(req_op(rq) == REQ_OP_READ && cmd->use_aio && +cmd->ret >= 0 && cmd->ret < blk_rq_bytes(rq))) { + struct bio *bio = rq->bio; bio_advance(bio, cmd->ret); zero_fill_bio(bio); @@ -465,11 +465,13 @@ static void lo_complete_rq(struct request *rq) static void lo_rw_aio_do_completion(struct loop_cmd *cmd) { + struct request *rq = blk_mq_rq_from_pdu(cmd); + if (!atomic_dec_and_test(>ref)) return; kfree(cmd->bvec); cmd->bvec = NULL; - blk_mq_complete_request(cmd->rq); + blk_mq_complete_request(rq); } static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2) @@ -487,7 +489,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, { struct iov_iter iter; struct bio_vec *bvec; - struct request *rq = cmd->rq; + struct request *rq = blk_mq_rq_from_pdu(cmd); struct bio *bio = rq->bio; struct file *file = lo->lo_backing_file; unsigned int offset; @@ -1702,15 +1704,16 @@ EXPORT_SYMBOL(loop_unregister_transfer); static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { - struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); - struct loop_device *lo = cmd->rq->q->queuedata; + struct request *rq = bd->rq; + struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); + struct loop_device *lo = rq->q->queuedata; - blk_mq_start_request(bd->rq); + blk_mq_start_request(rq); if (lo->lo_state != Lo_bound) return BLK_STS_IOERR; - switch (req_op(cmd->rq)) { + switch (req_op(rq)) { case REQ_OP_FLUSH: case REQ_OP_DISCARD: case REQ_OP_WRITE_ZEROES: @@ -1723,8 +1726,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, /* always use the first bio's css */ #ifdef CONFIG_BLK_CGROUP - if (cmd->use_aio && cmd->rq->bio && cmd->rq->bio->bi_css) { - cmd->css = cmd->rq->bio->bi_css; + if (cmd->use_aio && rq->bio && rq->bio->bi_css) { + cmd->css = rq->bio->bi_css; css_get(cmd->css); } else #endif @@ -1736,8 +1739,9 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, static void loop_handle_cmd(struct loop_cmd *cmd) { - const bool write = op_is_write(req_op(cmd->rq)); - struct loop_device *lo = cmd->rq->q->queuedata; + struct request *rq = blk_mq_rq_from_pdu(cmd); + const bool write = op_is_write(req_op(rq)); + struct loop_device *lo = rq->q->queuedata; int ret = 0; if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) { @@ -1745,12 +1749,12 @@ static void loop_handle_cmd(struct loop_cmd *cmd) goto failed; } - ret = do_req_filebacked(lo, cmd->rq); + ret = do_req_filebacked(lo, rq); failed: /* complete non-aio request */ if (!cmd->use_aio || ret) { cmd->ret = ret ? -EIO : 0; - blk_mq_complete_request(cmd->rq); + blk_mq_complete_request(rq); } } @@ -1767,9 +1771,7 @@ static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq, { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); - cmd->rq = rq; kthread_init_work(>work, loop_queue_work); - return 0; } diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 0f45416e4fcf..b78de9879f4f 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -66,7 +66,6 @@ struct loop_device { struct loop_cmd { struct kthread_work work; - struct request *rq; bool use_aio; /* use AIO interface to handle I/O */ atomic_t ref; /* only for aio */ long ret; -- 2.7.4
[PATCH 2/2] loop: handle short DIO reads
We ran into an issue with loop and btrfs, where btrfs would complain about checksum errors. It turns out that is because we don't handle short reads at all, we just zero fill the remainder. Worse than that, we don't handle the filling properly, which results in loop trying to advance a single bio by much more than its size, since it doesn't take chaining into account. Handle short reads appropriately, by simply retrying at the new correct offset. End the remainder of the request with EIO, if we get a 0 read. Signed-off-by: Jens Axboe--- drivers/block/loop.c | 34 +++--- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 8b2fde2109fc..5d4e31655d96 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -451,16 +451,36 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq) static void lo_complete_rq(struct request *rq) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); + blk_status_t ret = BLK_STS_OK; - if (unlikely(req_op(rq) == REQ_OP_READ && cmd->use_aio && -cmd->ret >= 0 && cmd->ret < blk_rq_bytes(rq))) { - struct bio *bio = rq->bio; - - bio_advance(bio, cmd->ret); - zero_fill_bio(bio); + if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) || + req_op(rq) != REQ_OP_READ) { + if (cmd->ret < 0) + ret = BLK_STS_IOERR; + goto end_io; } - blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK); + /* +* Short READ - if we got some data, advance our request and +* retry it. If we got no data, end the rest with EIO. +*/ + if (cmd->ret) { + blk_update_request(rq, BLK_STS_OK, cmd->ret); + cmd->ret = 0; + blk_mq_requeue_request(rq, true); + } else { + if (cmd->use_aio) { + struct bio *bio = rq->bio; + + while (bio) { + zero_fill_bio(bio); + bio = bio->bi_next; + } + } + ret = BLK_STS_IOERR; +end_io: + blk_mq_end_request(rq, ret); + } } static void lo_rw_aio_do_completion(struct loop_cmd *cmd) -- 2.7.4
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
> I'll see if I can get our PCI SIG people to follow this through Hi Jonathan Can you let me know if this moves forward within PCI-SIG? I would like to track it. I can see this being doable between Root Ports that reside in the same Root Complex but might become more challenging to standardize for RPs that reside in different RCs in the same (potentially multi-socket) system. I know in the past we have seem MemWr TLPS cross the QPI bus in Intel systems but I am sure that is not something that would work in all systems and must fall outside the remit of PCI-SIG ;-). I agree such a capability bit would be very useful but it's going to be quite some time before we can rely on hardware being available that supports it. Stephen
[GIT PULL] Followup pull request for block
Hi Linus, Followup fixes for this merge window. This pull request contains: - Series from Ming, fixing corner cases in our CPU <-> queue mapping. This triggered repeated warnings on especially s390, but I also hit it in cpu hot plug/unplug testing while doing IO on NVMe on x86-64. - Another fix from Ming, ensuring that we always order budget and driver tag identically, avoiding a deadlock on QD=1 devices. - Loop locking regression fix from this merge window, from Omar. - Another loop locking fix, this time missing an unlock, from Tetsuo Handa. - Fix for racing IO submission with device removal from Bart. - sr reference fix from me, fixing a case where disk change or getevents can race with device removal. - Set of nvme fixes by way of Keith, from various contributors. Please pull! git://git.kernel.dk/linux-block.git tags/for-linus-20180413 Arnd Bergmann (1): nvme: target: fix buffer overflow Bart Van Assche (1): blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash Daniel Verkamp (1): nvmet: fix space padding in serial number James Smart (1): nvme: expand nvmf_check_if_ready checks Jens Axboe (1): sr: get/drop reference to device in revalidate and check_events Johannes Thumshirn (2): nvme: unexport nvme_start_keep_alive nvme: don't send keep-alives to the discovery controller Keith Busch (4): nvme-pci: Skip queue deletion if there are no queues nvme-pci: Remove unused queue parameter nvme-pci: Separate IO and admin queue IRQ vectors nvme: Use admin command effects for admin commands Mathieu Malaterre (1): backing: silence compiler warning using __printf Matias Bjørling (1): nvme: enforce 64bit offset for nvme_get_log_ext fn Max Gurtovoy (1): nvme: check return value of init_srcu_struct function Ming Lei (11): blk-mq: order getting budget and driver tag blk-mq: make sure that correct hctx->next_cpu is set blk-mq: don't keep offline CPUs mapped to hctx 0 blk-mq: avoid to write intermediate result to hctx->next_cpu blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu blk-mq: remove blk_mq_delay_queue() blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue() blk-mq: reimplement blk_mq_hw_queue_mapped blk-mq: remove code for dealing with remapping queue blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped" nvme-loop: fix kernel oops in case of unhandled command Omar Sandoval (1): loop: fix LOOP_GET_STATUS lock imbalance Rodrigo R. Galvao (1): nvmet: Fix nvmet_execute_write_zeroes sector count Tetsuo Handa (1): block/loop: fix deadlock after loop_set_status block/blk-core.c| 35 ++-- block/blk-mq-cpumap.c | 5 -- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 122 +++- drivers/block/loop.c| 45 --- drivers/nvme/host/core.c| 33 +++ drivers/nvme/host/fabrics.c | 83 ++- drivers/nvme/host/fabrics.h | 33 +-- drivers/nvme/host/fc.c | 12 +--- drivers/nvme/host/nvme.h| 4 +- drivers/nvme/host/pci.c | 35 +++- drivers/nvme/host/rdma.c| 14 + drivers/nvme/target/admin-cmd.c | 1 + drivers/nvme/target/discovery.c | 2 +- drivers/nvme/target/io-cmd.c| 4 +- drivers/nvme/target/loop.c | 20 ++- drivers/scsi/sr.c | 19 +-- include/linux/backing-dev.h | 1 + include/linux/blk-mq.h | 2 - 19 files changed, 245 insertions(+), 226 deletions(-) -- Jens Axboe
loop: handle short DIO reads
We ran into an issue with loop and btrfs, where btrfs would complain about checksum errors. It turns out that is because we don't handle short reads at all, we just zero fill the remainder. Worse than that, we don't handle the filling properly, which results in loop trying to advance a single bio by much more than its size, since it doesn't take chaining into account. Handle short reads appropriately, by simply retrying at the new correct offset. End the remainder of the request with EIO, if we get a 0 read. Signed-off-by: Jens Axboe--- diff --git a/drivers/block/loop.c b/drivers/block/loop.c index c9d04497a415..66975c4fb6f1 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -451,16 +451,28 @@ static int lo_req_flush(struct loop_device *lo, struct request *rq) static void lo_complete_rq(struct request *rq) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); + blk_status_t ret = BLK_STS_OK; - if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio && -cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) { - struct bio *bio = cmd->rq->bio; - - bio_advance(bio, cmd->ret); - zero_fill_bio(bio); + if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) || + req_op(cmd->rq) != REQ_OP_READ) { + if (cmd->ret < 0) + ret = BLK_STS_IOERR; + goto end_io; } - blk_mq_end_request(rq, cmd->ret < 0 ? BLK_STS_IOERR : BLK_STS_OK); + /* +* Short READ - if we got some data, advance our request and +* retry it. If we got no data, end the rest with EIO. +*/ + if (cmd->ret) { + blk_update_request(cmd->rq, BLK_STS_OK, cmd->ret); + cmd->ret = 0; + blk_mq_requeue_request(cmd->rq, true); + } else { + ret = BLK_STS_IOERR; +end_io: + blk_mq_end_request(rq, ret); + } } static void lo_rw_aio_do_completion(struct loop_cmd *cmd) -- Jens Axboe
Re: [PATCH] loop: fix aio/dio end of request clearing
On 4/12/18 12:52 PM, Jens Axboe wrote: > If we read more than the user asked for, we zero fill the last > part. But the current code assumes that the request has just > one bio, and since that's not guaranteed to be true, we can > run into a situation where we attempt to advance a bio by > a bigger amount than its size. > > Handle the zero filling appropriately, regardless of what bio > ends up needing to be cleared. > > Signed-off-by: Jens Axboe> > --- > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index c9d04497a415..d3aa96a2f369 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -455,9 +455,22 @@ static void lo_complete_rq(struct request *rq) > if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio && >cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) { > struct bio *bio = cmd->rq->bio; > + long left = cmd->ret; > > - bio_advance(bio, cmd->ret); > - zero_fill_bio(bio); > + while (left >= bio->bi_iter.bi_size) { > + left -= bio->bi_iter.bi_size; > + bio = bio->bi_next; > + if (WARN_ON_ONCE(!bio)) > + break; > + } > + while (bio) { > + if (left) { > + bio_advance(bio, left); > + left = 0; > + } > + zero_fill_bio(bio); > + bio = bio->bi_next; > + } > } Ignore this one, new version coming. -- Jens Axboe
Re: [PATCH v2 0/2] tracing/events: block: bring more on a par with blktrace
On Fri, 13 Apr 2018 18:36:20 +0200 Steffen Maierwrote: > I had the need to understand I/O request processing in detail. > But I also had the need to enrich block traces with other trace events > including my own dynamic kprobe events. So I preferred block trace events > over blktrace to get everything nicely sorted into one ftrace output. > However, I missed device filtering for (un)plug events and also > the difference between the two flavors of unplug. > > The first two patches bring block trace events closer to their > counterpart in blktrace tooling. > > The last patch is just an RFC. I still kept it in this patch set because > it is inspired by PATCH 2/2. > > Changes since v1: > [PATCH v2 1/2] > Use 0 and 1 instead of false and true for __print_symbolic table. > Now "trace-cmd report" can decode it. [Steven Rostedt] > > Steffen Maier (3): > tracing/events: block: track and print if unplug was explicit or > schedule > tracing/events: block: dev_t via driver core for plug and unplug > events > tracing/events: block: also try to get dev_t via driver core for some > events > > include/trace/events/block.h | 33 - > 1 file changed, 28 insertions(+), 5 deletions(-) > >From just the tracing point of view: Reviewed-by: Steven Rostedt (VMware) as for the race conditions in the block layer, I'll let someone else decided that. -- Steve
Re: [PATCH v2] block: ratelimite pr_err on IO path
Jinpu, [CC:ed the mpt3sas maintainers] The ratelimit patch is just an attempt to treat the symptom, not the cause. > Thanks for asking, we updated mpt3sas driver which enables DIX support > (prot_mask=0x7f), all disks are SATA SSDs, no DIF support. > After reboot, kernel reports the IO errors from all the drives behind > HBA, seems for almost every read IO, which turns the system unusable: > [ 13.079375] sda: ref tag error at location 0 (rcvd 143196159) > [ 13.079989] sda: ref tag error at location 937702912 (rcvd 143196159) > [ 13.080233] sda: ref tag error at location 937703072 (rcvd 143196159) > [ 13.080407] sda: ref tag error at location 0 (rcvd 143196159) > [ 13.080594] sda: ref tag error at location 8 (rcvd 143196159) That sounds like a bug in the mpt3sas driver or firmware. I guess the HBA could conceivably be operating a SATA device as DIX Type 0 and strip the PI on the drive side. But that doesn't seem to be a particularly useful mode of operation. Jinpu: Which firmware are you running? Also, please send us the output of: sg_readcap -l /dev/sda sg_inq -x /dev/sda sg_vpd /dev/sda Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas controller? -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH] target: Fix Fortify_panic kernel exception
The patch looks fine, but in general I think descriptions of what you fixed in the code or more important than starting out with a backtrace. E.g. please explain what was wrong, how you fixed it and only after that mention how it was caught. (preferably without the whole trace)
[PATCH v2 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block trace points to TRACE_EVENT()") to be equivalent to traditional blktrace output. Also this allows event filtering to not always get all (un)plug events. NB: The NULL pointer check for q->kobj.parent is certainly racy and I don't have enough experience if it's good enough for a trace event. The change did work for my cases (block device read/write I/O on zfcp-attached SCSI disks and dm-mpath on top). While I haven't seen any prior art using driver core (parent) relations for trace events, there are other cases using this when no direct pointer exists between objects, such as: #define to_scsi_target(d) container_of(d, struct scsi_target, dev) static inline struct scsi_target *scsi_target(struct scsi_device *sdev) { return to_scsi_target(sdev->sdev_gendev.parent); } This is the object model we make use of here: struct gendisk { struct hd_struct { struct device { /*container_of*/ struct kobject kobj; <--+ dev_t devt; /*deref*/ | } __dev;| } part0;| struct request_queue *queue; ..+| } :| :| struct request_queue { <..+| /* queue kobject */ | struct kobject {| struct kobject *parent; + } kobj; } The parent pointer comes from: #define disk_to_dev(disk) (&(disk)->part0.__dev) int blk_register_queue(struct gendisk *disk) struct device *dev = disk_to_dev(disk); struct request_queue *q = disk->queue; ret = kobject_add(>kobj, kobject_get(>kobj), "%s", "queue"); ^^^parent $ ls -d /sys/block/sdf/queue /sys/block/sda/queue $ cat /sys/block/sdf/dev 80:0 A partition does not have its own request queue: $ cat /sys/block/sdf/sdf1/dev 8:81 $ ls -d /sys/block/sdf/sdf1/queue ls: cannot access '/sys/block/sdf/sdf1/queue': No such file or directory The difference to blktrace parsed output is that block events don't use the partition's minor number but the containing block device's minor number: $ dd if=/dev/sdf1 count=1 $ cat /sys/kernel/debug/tracing/trace block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0 block_bio_queue: 8,80 R 2048 + 32 [dd] block_getrq: 8,80 R 2048 + 32 [dd] block_plug: 8,80 [dd] block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd] block_unplug: 8,80 [dd] 1 explicit block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd] block_rq_complete: 8,80 R () 2048 + 32 [0] $ btrace /dev/sdf1 8,80 11 0.0 240240 A R 2048 + 32 <- (8,81) 0 8,81 12 0.000220890 240240 Q R 2048 + 32 [dd] 8,81 13 0.000229639 240240 G R 2048 + 32 [dd] 8,81 14 0.000231805 240240 P N [dd] ^^ 8,81 15 0.000234671 240240 I R 2048 + 32 [dd] 8,81 16 0.000236365 240240 U N [dd] 1 ^^ 8,81 17 0.000238527 240240 D R 2048 + 32 [dd] 8,81 22 0.000613741 0 C R 2048 + 32 [0] Signed-off-by: Steffen Maier--- include/trace/events/block.h | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index e90bb6eb8097..6ea5a3899c2e 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug, TP_ARGS(q), TP_STRUCT__entry( + __field( dev_t, dev ) __array( char, comm, TASK_COMM_LEN ) ), TP_fast_assign( + __entry->dev = q->kobj.parent ? + container_of(q->kobj.parent, struct device, kobj)->devt : 0; memcpy(__entry->comm, current->comm, TASK_COMM_LEN); ), - TP_printk("[%s]", __entry->comm) + TP_printk("%d,%d [%s]", + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->comm) ); #define show_block_unplug_explicit(val)\ @@ -482,18 +486,23 @@ DECLARE_EVENT_CLASS(block_unplug, TP_ARGS(q, depth, explicit), TP_STRUCT__entry( + __field( dev_t, dev ) __field( int, nr_rq ) __field( bool, explicit) __array( char, comm, TASK_COMM_LEN ) ), TP_fast_assign( + __entry->dev = q->kobj.parent ? + container_of(q->kobj.parent, struct device, kobj)->devt : 0; __entry->nr_rq = depth; __entry->explicit = explicit; memcpy(__entry->comm,
[PATCH v2 1/2] tracing/events: block: track and print if unplug was explicit or schedule
Just like blktrace distinguishes explicit and schedule by means of BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the existing argument "explicit" to distinguish the two cases in the one common tracepoint block_unplug. Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace event correspond to the schedule() unplug") and commit d9c978331790 ("block: remove block_unplug_timer() trace point"). Signed-off-by: Steffen Maier--- Changes since v1: Use 0 and 1 instead of false and true for __print_symbolic table. Now "trace-cmd report" can decode it. [Steven Rostedt] include/trace/events/block.h | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 81b43f5bdf23..e90bb6eb8097 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -470,6 +470,11 @@ TRACE_EVENT(block_plug, TP_printk("[%s]", __entry->comm) ); +#define show_block_unplug_explicit(val)\ + __print_symbolic(val, \ +{0, "schedule"}, \ +{1, "explicit"}) + DECLARE_EVENT_CLASS(block_unplug, TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit), @@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug, TP_STRUCT__entry( __field( int, nr_rq ) + __field( bool, explicit) __array( char, comm, TASK_COMM_LEN ) ), TP_fast_assign( __entry->nr_rq = depth; + __entry->explicit = explicit; memcpy(__entry->comm, current->comm, TASK_COMM_LEN); ), - TP_printk("[%s] %d", __entry->comm, __entry->nr_rq) + TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq, + show_block_unplug_explicit(__entry->explicit)) ); /** -- 2.13.5
[PATCH v2 0/2] tracing/events: block: bring more on a par with blktrace
I had the need to understand I/O request processing in detail. But I also had the need to enrich block traces with other trace events including my own dynamic kprobe events. So I preferred block trace events over blktrace to get everything nicely sorted into one ftrace output. However, I missed device filtering for (un)plug events and also the difference between the two flavors of unplug. The first two patches bring block trace events closer to their counterpart in blktrace tooling. The last patch is just an RFC. I still kept it in this patch set because it is inspired by PATCH 2/2. Changes since v1: [PATCH v2 1/2] Use 0 and 1 instead of false and true for __print_symbolic table. Now "trace-cmd report" can decode it. [Steven Rostedt] Steffen Maier (3): tracing/events: block: track and print if unplug was explicit or schedule tracing/events: block: dev_t via driver core for plug and unplug events tracing/events: block: also try to get dev_t via driver core for some events include/trace/events/block.h | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) -- 2.13.5
[RFC v2] tracing/events: block: also try to get dev_t via driver core for some events
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block trace points to TRACE_EVENT()") for cases where rq->rq_disk == NULL: block_rq_requeue, block_rq_insert, block_rq_issue; and for cases where bio == NULL: block_getrq, block_sleeprq. NB: The NULL pointer check for q->kobj.parent is certainly racy and I don't have enough experience if it's good enough for a trace event. Since I don't know when above cases would occur, I'm not sure this is worth it. Signed-off-by: Steffen Maier--- include/trace/events/block.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 6ea5a3899c2e..001e4f369df9 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -86,7 +86,9 @@ TRACE_EVENT(block_rq_requeue, ), TP_fast_assign( - __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0; + __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : + (q->kobj.parent ? +container_of(q->kobj.parent, struct device, kobj)->devt : 0); __entry->sector= blk_rq_trace_sector(rq); __entry->nr_sector = blk_rq_trace_nr_sectors(rq); @@ -162,7 +164,9 @@ DECLARE_EVENT_CLASS(block_rq, ), TP_fast_assign( - __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0; + __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : + (q->kobj.parent ? +container_of(q->kobj.parent, struct device, kobj)->devt : 0); __entry->sector= blk_rq_trace_sector(rq); __entry->nr_sector = blk_rq_trace_nr_sectors(rq); __entry->bytes = blk_rq_bytes(rq); @@ -397,7 +401,9 @@ DECLARE_EVENT_CLASS(block_get_rq, ), TP_fast_assign( - __entry->dev= bio ? bio_dev(bio) : 0; + __entry->dev= bio ? bio_dev(bio) : + (q->kobj.parent ? +container_of(q->kobj.parent, struct device, kobj)->devt : 0); __entry->sector = bio ? bio->bi_iter.bi_sector : 0; __entry->nr_sector = bio ? bio_sectors(bio) : 0; blk_fill_rwbs(__entry->rwbs, -- 2.13.5
Re: [PATCH 1/3] nbd: del_gendisk after we cleanup the queue
On Fri, 2018-04-13 at 12:21 -0400, Josef Bacik wrote: > On Fri, Apr 13, 2018 at 04:16:18PM +, Bart Van Assche wrote: > > On Fri, 2018-04-13 at 12:03 -0400, Josef Bacik wrote: > > > This fixes a use after free bug, we need to do the del_gendisk after we > > > cleanup the queue on the device. > > > > Hello Josef, > > > > Which use-after-free bug does this patch fix? Are you aware that most block > > drivers call blk_cleanup_queue() before calling del_gendisk()? Why do you > > think that the nbd driver should use the opposite order? > > I'm confused, that's what this patch does. Before I had del_gendisk() first > and > then the blk_cleanup_queue(), which was bugging out when I was testing stuff > with a null pointer deref whenever I rmmod'ed the nbd. Swapping it to the way > everybody else did it fixed the problem. Thanks, Hello Josef, Oops, I swapped "blk_cleanup_queue()" and "del_gendisk()" in my e-mail. Can you share the call stack of the NULL pointer deref and also the translation of the crash address into a source code line? Thanks, Bart.
Re: [PATCH 1/3] nbd: del_gendisk after we cleanup the queue
On Fri, Apr 13, 2018 at 04:16:18PM +, Bart Van Assche wrote: > On Fri, 2018-04-13 at 12:03 -0400, Josef Bacik wrote: > > This fixes a use after free bug, we need to do the del_gendisk after we > > cleanup the queue on the device. > > Hello Josef, > > Which use-after-free bug does this patch fix? Are you aware that most block > drivers call blk_cleanup_queue() before calling del_gendisk()? Why do you > think that the nbd driver should use the opposite order? > I'm confused, that's what this patch does. Before I had del_gendisk() first and then the blk_cleanup_queue(), which was bugging out when I was testing stuff with a null pointer deref whenever I rmmod'ed the nbd. Swapping it to the way everybody else did it fixed the problem. Thanks, Josef
Re: [PATCH 1/3] nbd: del_gendisk after we cleanup the queue
On Fri, 2018-04-13 at 12:03 -0400, Josef Bacik wrote: > This fixes a use after free bug, we need to do the del_gendisk after we > cleanup the queue on the device. Hello Josef, Which use-after-free bug does this patch fix? Are you aware that most block drivers call blk_cleanup_queue() before calling del_gendisk()? Why do you think that the nbd driver should use the opposite order? Thanks, Bart.
[PATCH 3/3] nbd: use bd_set_size when updating disk size
From: Josef BacikWhen we stopped relying on the bdev everywhere I broke updating the block device size on the fly, which ceph relies on. We can't just do set_capacity, we also have to do bd_set_size so things like parted will notice the device size change. Fixes: 29eaadc ("nbd: stop using the bdev everywhere") cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik --- drivers/block/nbd.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 1520383b12f6..61dd95aa47fa 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -231,9 +231,15 @@ static void nbd_size_clear(struct nbd_device *nbd) static void nbd_size_update(struct nbd_device *nbd) { struct nbd_config *config = nbd->config; + struct block_device *bdev = bdget_disk(nbd->disk, 0); + blk_queue_logical_block_size(nbd->disk->queue, config->blksize); blk_queue_physical_block_size(nbd->disk->queue, config->blksize); set_capacity(nbd->disk, config->bytesize >> 9); + if (bdev) { + bd_set_size(bdev, config->bytesize); + bdput(bdev); + } kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE); } @@ -,7 +1117,6 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, struct block_device *b if (ret) return ret; - bd_set_size(bdev, config->bytesize); if (max_part) bdev->bd_invalidated = 1; mutex_unlock(>config_lock); -- 2.14.3
[PATCH 1/3] nbd: del_gendisk after we cleanup the queue
From: Josef BacikThis fixes a use after free bug, we need to do the del_gendisk after we cleanup the queue on the device. Fixes: c6a4759ea0c9 ("nbd: add device refcounting") cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik --- drivers/block/nbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 86258b00a1d4..e33da3e6aa20 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -174,8 +174,8 @@ static void nbd_dev_remove(struct nbd_device *nbd) { struct gendisk *disk = nbd->disk; if (disk) { - del_gendisk(disk); blk_cleanup_queue(disk->queue); + del_gendisk(disk); blk_mq_free_tag_set(>tag_set); disk->private_data = NULL; put_disk(disk); -- 2.14.3
[PATCH 2/3] nbd: update size when connected
From: Josef BacikI messed up changing the size of an NBD device while it was connected by not actually updating the device or doing the uevent. Fix this by updating everything if we're connected and we change the size. cc: sta...@vger.kernel.org Fixes: 639812a ("nbd: don't set the device size until we're connected") Signed-off-by: Josef Bacik --- drivers/block/nbd.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index e33da3e6aa20..1520383b12f6 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -243,6 +243,8 @@ static void nbd_size_set(struct nbd_device *nbd, loff_t blocksize, struct nbd_config *config = nbd->config; config->blksize = blocksize; config->bytesize = blocksize * nr_blocks; + if (nbd->task_recv != NULL) + nbd_size_update(nbd); } static void nbd_complete_rq(struct request *req) -- 2.14.3
Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
On Thu, Apr 12, 2018 at 06:57:12AM -0700, Tejun Heo wrote: > On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote: > > > Not really because aborted_gstate right now doesn't have any memory > > > barrier around it, so nothing ensures blk_add_timer() actually appears > > > before. We can either add the matching barriers in aborted_gstate > > > update and when it's read in the normal completion path, or we can > > > wait for the update to be visible everywhere by waiting for rcu grace > > > period (because the reader is rcu protected). > > > > Seems not necessary. > > > > Suppose it is out of order, the only side-effect is that the new > > recycled request is timed out as a bit late, I think that is what > > we can survive, right? > > It at least can mess up the timeout duration for the next recycle > instance because there can be two competing blk_add_timer() instances. > I'm not sure whether there can be other consequences. When ownership > isn't clear, it becomes really difficult to reason about these things > and can lead to subtle failures. I think it'd be best to always > establish who owns what. Please see the code of blk_add_timer() for blk-mq: blk_rq_set_deadline(req, jiffies + req->timeout); req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; if (!timer_pending(>timeout) || time_before(expiry, q->timeout.expires)) mod_timer(>timeout, expiry); If this rq is recycled, blk_add_timer() only touches rq->deadline and the EXPIRED flags, and the only effect is that the timeout may be handled a bit late, but the timeout monitor won't be lost. And this thing shouldn't be difficult to avoid, as you mentioned, synchronize_rcu() can be added between blk_add_timer() and resetting aborted gstate for avoiding it. thanks, Ming
Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller
Hello, Bart. On Thu, Apr 12, 2018 at 10:40:52PM +, Bart Van Assche wrote: > Is something like the patch below perhaps what you had in mind? Yeah, exactly. It'd be really great to have the lockdep asserts add to the right places. Thanks. -- tejun
Re: [PATCH] nbd: do not update block size if file system is mounted
On Fri, 16 Mar 2018 09:36:45 -0700, Jens Axboe wrote: > On 3/15/18 3:00 AM, Michael Tretter wrote: > > On Wed, 28 Feb 2018 10:54:56 +0100, Michael Tretter wrote: > >> If a file system is mounted on the nbd during a disconnect, resetting > >> the size to 0, might change the block size and destroy the buffer_head > >> mappings. This will cause a infinite loop when the file system looks for > >> the buffer_heads for flushing. > >> > >> Only set the file size to 0, if we are the only opener of the nbd. > >> > >> Signed-off-by: Michael Tretter> >> --- > >> drivers/block/nbd.c | 3 ++- > >> 1 file changed, 2 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c > >> index 5f2a4240a204..0d1dbb60430b 100644 > >> --- a/drivers/block/nbd.c > >> +++ b/drivers/block/nbd.c > >> @@ -1118,7 +1118,8 @@ static int nbd_start_device_ioctl(struct nbd_device > >> *nbd, struct block_device *b > >>if (ret) > >>sock_shutdown(nbd); > >>mutex_lock(>config_lock); > >> - bd_set_size(bdev, 0); > >> + if (bdev->bd_openers <= 1) > >> + bd_set_size(bdev, 0); > >>/* user requested, ignore socket errors */ > >>if (test_bit(NBD_DISCONNECT_REQUESTED, >runtime_flags)) > >>ret = 0; > > > > Are there any comments on this patch? > > Josef? > Maybe some history helps for reviewing the patch: Commit abbbdf12497d ("nbd: replace kill_bdev() with __invalidate_device()") already fixed this issue once by changing nbd_size_clear() to only set the size to 0 if the device is only opened once. The line was moved several times during the rework for the netlink interface and the check for the number of openers was dropped, reintroducing the issue. Michael
Re: [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule
On Fri, 13 Apr 2018 15:07:17 +0200 Steffen Maierwrote: > Just like blktrace distinguishes explicit and schedule by means of > BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the > existing argument "explicit" to distinguish the two cases in the one > common tracepoint block_unplug. > > Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace > event correspond to the schedule() unplug") and commit d9c978331790 > ("block: remove block_unplug_timer() trace point"). > > Signed-off-by: Steffen Maier > --- > include/trace/events/block.h | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/include/trace/events/block.h b/include/trace/events/block.h > index 81b43f5bdf23..a13613d27cee 100644 > --- a/include/trace/events/block.h > +++ b/include/trace/events/block.h > @@ -470,6 +470,11 @@ TRACE_EVENT(block_plug, > TP_printk("[%s]", __entry->comm) > ); > > +#define show_block_unplug_explicit(val) \ > + __print_symbolic(val, \ > + {false, "schedule"}, \ > + {true, "explicit"}) That's new. I haven't seen "true"/"false" values used for print_symbolic before. But could you please use 1 and 0 instead, because perf and trace-cmd won't be able to parse that. I could update libtraceevent to handle it, but really, the first parameter is suppose to be numeric. -- Steve > + > DECLARE_EVENT_CLASS(block_unplug, > > TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit), > @@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug, > > TP_STRUCT__entry( > __field( int, nr_rq ) > + __field( bool, explicit) > __array( char, comm, TASK_COMM_LEN ) > ), > > TP_fast_assign( > __entry->nr_rq = depth; > + __entry->explicit = explicit; > memcpy(__entry->comm, current->comm, TASK_COMM_LEN); > ), > > - TP_printk("[%s] %d", __entry->comm, __entry->nr_rq) > + TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq, > + show_block_unplug_explicit(__entry->explicit)) > ); > > /**
[PATCH 2/2] tracing/events: block: dev_t via driver core for plug and unplug events
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block trace points to TRACE_EVENT()") to be equivalent to traditional blktrace output. Also this allows event filtering to not always get all (un)plug events. NB: The NULL pointer check for q->kobj.parent is certainly racy and I don't have enough experience if it's good enough for a trace event. The change did work for my cases (block device read/write I/O on zfcp-attached SCSI disks and dm-mpath on top). While I haven't seen any prior art using driver core (parent) relations for trace events, there are other cases using this when no direct pointer exists between objects, such as: #define to_scsi_target(d) container_of(d, struct scsi_target, dev) static inline struct scsi_target *scsi_target(struct scsi_device *sdev) { return to_scsi_target(sdev->sdev_gendev.parent); } This is the object model we make use of here: struct gendisk { struct hd_struct { struct device { /*container_of*/ struct kobject kobj; <--+ dev_t devt; /*deref*/ | } __dev;| } part0;| struct request_queue *queue; ..+| } :| :| struct request_queue { <..+| /* queue kobject */ | struct kobject {| struct kobject *parent; + } kobj; } The parent pointer comes from: #define disk_to_dev(disk) (&(disk)->part0.__dev) int blk_register_queue(struct gendisk *disk) struct device *dev = disk_to_dev(disk); struct request_queue *q = disk->queue; ret = kobject_add(>kobj, kobject_get(>kobj), "%s", "queue"); ^^^parent $ ls -d /sys/block/sdf/queue /sys/block/sda/queue $ cat /sys/block/sdf/dev 80:0 A partition does not have its own request queue: $ cat /sys/block/sdf/sdf1/dev 8:81 $ ls -d /sys/block/sdf/sdf1/queue ls: cannot access '/sys/block/sdf/sdf1/queue': No such file or directory The difference to blktrace parsed output is that block events don't use the partition's minor number but the containing block device's minor number: $ dd if=/dev/sdf1 count=1 $ cat /sys/kernel/debug/tracing/trace block_bio_remap: 8,80 R 2048 + 32 <- (8,81) 0 block_bio_queue: 8,80 R 2048 + 32 [dd] block_getrq: 8,80 R 2048 + 32 [dd] block_plug: 8,80 [dd] block_rq_insert: 8,80 R 16384 () 2048 + 32 [dd] block_unplug: 8,80 [dd] 1 explicit block_rq_issue: 8,80 R 16384 () 2048 + 32 [dd] block_rq_complete: 8,80 R () 2048 + 32 [0] $ btrace /dev/sdf1 8,80 11 0.0 240240 A R 2048 + 32 <- (8,81) 0 8,81 12 0.000220890 240240 Q R 2048 + 32 [dd] 8,81 13 0.000229639 240240 G R 2048 + 32 [dd] 8,81 14 0.000231805 240240 P N [dd] ^^ 8,81 15 0.000234671 240240 I R 2048 + 32 [dd] 8,81 16 0.000236365 240240 U N [dd] 1 ^^ 8,81 17 0.000238527 240240 D R 2048 + 32 [dd] 8,81 22 0.000613741 0 C R 2048 + 32 [0] Signed-off-by: Steffen Maier--- include/trace/events/block.h | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index a13613d27cee..cffedc26e8a3 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -460,14 +460,18 @@ TRACE_EVENT(block_plug, TP_ARGS(q), TP_STRUCT__entry( + __field( dev_t, dev ) __array( char, comm, TASK_COMM_LEN ) ), TP_fast_assign( + __entry->dev = q->kobj.parent ? + container_of(q->kobj.parent, struct device, kobj)->devt : 0; memcpy(__entry->comm, current->comm, TASK_COMM_LEN); ), - TP_printk("[%s]", __entry->comm) + TP_printk("%d,%d [%s]", + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->comm) ); #define show_block_unplug_explicit(val)\ @@ -482,18 +486,23 @@ DECLARE_EVENT_CLASS(block_unplug, TP_ARGS(q, depth, explicit), TP_STRUCT__entry( + __field( dev_t, dev ) __field( int, nr_rq ) __field( bool, explicit) __array( char, comm, TASK_COMM_LEN ) ), TP_fast_assign( + __entry->dev = q->kobj.parent ? + container_of(q->kobj.parent, struct device, kobj)->devt : 0; __entry->nr_rq = depth; __entry->explicit = explicit; memcpy(__entry->comm,
[PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule
Just like blktrace distinguishes explicit and schedule by means of BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the existing argument "explicit" to distinguish the two cases in the one common tracepoint block_unplug. Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace event correspond to the schedule() unplug") and commit d9c978331790 ("block: remove block_unplug_timer() trace point"). Signed-off-by: Steffen Maier--- include/trace/events/block.h | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 81b43f5bdf23..a13613d27cee 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -470,6 +470,11 @@ TRACE_EVENT(block_plug, TP_printk("[%s]", __entry->comm) ); +#define show_block_unplug_explicit(val)\ + __print_symbolic(val, \ +{false, "schedule"}, \ +{true, "explicit"}) + DECLARE_EVENT_CLASS(block_unplug, TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit), @@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug, TP_STRUCT__entry( __field( int, nr_rq ) + __field( bool, explicit) __array( char, comm, TASK_COMM_LEN ) ), TP_fast_assign( __entry->nr_rq = depth; + __entry->explicit = explicit; memcpy(__entry->comm, current->comm, TASK_COMM_LEN); ), - TP_printk("[%s] %d", __entry->comm, __entry->nr_rq) + TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq, + show_block_unplug_explicit(__entry->explicit)) ); /** -- 2.13.5
[PATCH 0/2] tracing/events: block: bring more on a par with blktrace
I had the need to understand I/O request processing in detail. But I also had the need to enrich block traces with other trace events including my own dynamic kprobe events. So I preferred block trace events over blktrace to get everything nicely sorted into one ftrace output. However, I missed device filtering for (un)plug events and also the difference between the two flavors of unplug. The first two patches bring block trace events closer to their counterpart in blktrace tooling. The last patch is just an RFC. I still kept it in this patch set because it is inspired by PATCH 2/2. Steffen Maier (3): tracing/events: block: track and print if unplug was explicit or schedule tracing/events: block: dev_t via driver core for plug and unplug events tracing/events: block: also try to get dev_t via driver core for some events include/trace/events/block.h | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) -- 2.13.5
[RFC] tracing/events: block: also try to get dev_t via driver core for some events
Complements v2.6.31 commit 55782138e47d ("tracing/events: convert block trace points to TRACE_EVENT()") for cases where rq->rq_disk == NULL: block_rq_requeue, block_rq_insert, block_rq_issue; and for cases where bio == NULL: block_getrq, block_sleeprq. NB: The NULL pointer check for q->kobj.parent is certainly racy and I don't have enough experience if it's good enough for a trace event. Since I don't know when above cases would occur, I'm not sure this is worth it. Signed-off-by: Steffen Maier--- include/trace/events/block.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/include/trace/events/block.h b/include/trace/events/block.h index cffedc26e8a3..c476b7af404b 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -86,7 +86,9 @@ TRACE_EVENT(block_rq_requeue, ), TP_fast_assign( - __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0; + __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : + (q->kobj.parent ? +container_of(q->kobj.parent, struct device, kobj)->devt : 0); __entry->sector= blk_rq_trace_sector(rq); __entry->nr_sector = blk_rq_trace_nr_sectors(rq); @@ -162,7 +164,9 @@ DECLARE_EVENT_CLASS(block_rq, ), TP_fast_assign( - __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0; + __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : + (q->kobj.parent ? +container_of(q->kobj.parent, struct device, kobj)->devt : 0); __entry->sector= blk_rq_trace_sector(rq); __entry->nr_sector = blk_rq_trace_nr_sectors(rq); __entry->bytes = blk_rq_bytes(rq); @@ -397,7 +401,9 @@ DECLARE_EVENT_CLASS(block_get_rq, ), TP_fast_assign( - __entry->dev= bio ? bio_dev(bio) : 0; + __entry->dev= bio ? bio_dev(bio) : + (q->kobj.parent ? +container_of(q->kobj.parent, struct device, kobj)->devt : 0); __entry->sector = bio ? bio->bi_iter.bi_sector : 0; __entry->nr_sector = bio ? bio_sectors(bio) : 0; blk_fill_rwbs(__entry->rwbs, -- 2.13.5
Re: [PATCH 1/2] bcache: remove unncessary code in bch_btree_keys_init()
Hi Coly, >Function bch_btree_keys_init() initializes b->set[].size and >b->set[].data to zero. As the code comments indicates, these code indeed >is unncessary, because both struct btree_keys and struct bset_tree are >nested embedded into struct btree, when struct btree is filled with 0 >bits by kzalloc() in mca_bucket_alloc(), b->set[].size and >b->set[].data are initialized to 0 (a.k.a NULL) already. > Eh, you need to pay attention to mca_alloc()==>mca_reap(), which get btree node memory from list btree_cache_freeable or btree_cache_freed, that merory were used as btree node before, and may be not fill with 0. >This patch removes the redundant code, and add comments in >bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe. > >Signed-off-by: Coly Li>--- > drivers/md/bcache/bset.c | 13 ++--- > drivers/md/bcache/btree.c | 4 > 2 files changed, 10 insertions(+), 7 deletions(-) > >diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c >index 579c696a5fe0..343f4e9428e0 100644 >--- a/drivers/md/bcache/bset.c >+++ b/drivers/md/bcache/bset.c >@@ -352,15 +352,14 @@ void bch_btree_keys_init(struct btree_keys *b, const >struct btree_keys_ops *ops, > b->nsets = 0; > b->last_set_unwritten = 0; > >-/* XXX: shouldn't be needed */ >-for (i = 0; i < MAX_BSETS; i++) >-b->set[i].size = 0; > /* >- * Second loop starts at 1 because b->keys[0]->data is the memory we >- * allocated >+ * struct btree_keys in embedded in struct btree, and struct >+ * bset_tree is embedded into struct btree_keys. They are all >+ * initialized as 0 by kzalloc() in mca_bucket_alloc(), and >+ * b->set[0].data is allocated in bch_btree_keys_alloc(), so we >+ * don't have to initiate b->set[].size and b->set[].data here >+ * any more. > */ >-for (i = 1; i < MAX_BSETS; i++) >-b->set[i].data = NULL; > } > EXPORT_SYMBOL(bch_btree_keys_init); > >diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c >index 17936b2dc7d6..344641e23415 100644 >--- a/drivers/md/bcache/btree.c >+++ b/drivers/md/bcache/btree.c >@@ -600,6 +600,10 @@ static void mca_data_alloc(struct btree *b, struct bkey >*k, gfp_t gfp) > static struct btree *mca_bucket_alloc(struct cache_set *c, > struct bkey *k, gfp_t gfp) > { >+/* >+ * kzalloc() is necessary here for initialization, >+ * see code comments in bch_btree_keys_init(). >+ */ > struct btree *b = kzalloc(sizeof(struct btree), gfp); > if (!b) > return NULL; >-- >2.16.2 Thanks. Tang Junhui
Re: [PATCH 1/4] bcache: finish incremental GC
Hi Coly, > Hi Junhui, > How about send out v2 patch and let me confirm it with a Reviewed-by ? Certainly it's OK. Thanks. Tang Junhui
Re: [PATCH 1/4] bcache: finish incremental GC
On 2018/4/13 4:37 PM, tang.jun...@zte.com.cn wrote: > Hi Coly, >>> >>> Hello Coly, >>> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > In GC thread, we record the latest GC key in gc_done, which is expected > to be used for incremental GC, but in currently code, we didn't realize > it. When GC runs, front side IO would be blocked until the GC over, it > would be a long time if there is a lot of btree nodes. > > This patch realizes incremental GC, the main ideal is that, when there > are front side I/Os, after GC some nodes (100), we stop GC, release locker > of the btree node, and go to process the front side I/Os for some times > (100 ms), then go back to GC again. > > By this patch, when we doing GC, I/Os are not blocked all the time, and > there is no obvious I/Os zero jump problem any more. > > Signed-off-by: Tang Junhui Hi Junhui, I reply my comments in line, >>> Thanks for your comments in advance. > --- > drivers/md/bcache/bcache.h | 5 + > drivers/md/bcache/btree.c | 15 ++- > drivers/md/bcache/request.c | 3 +++ > 3 files changed, 22 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > index 843877e..ab4c9ca 100644 > --- a/drivers/md/bcache/bcache.h > +++ b/drivers/md/bcache/bcache.h > @@ -445,6 +445,7 @@ struct cache { > > struct gc_stat { > size_tnodes; > +size_tnodes_pre; > size_tkey_bytes; > > size_tnkeys; > @@ -568,6 +569,10 @@ struct cache_set { > */ > atomic_trescale; > /* > + * used for GC, identify if any front side I/Os is inflight > + */ > +atomic_tio_inflight; Could you please to rename the above variable to something like search_inflight ? Just to be more explicit, considering we have many types of io requests. >>> OK, It looks better. > +/* > * When we invalidate buckets, we use both the priority and the > amount > * of good data to determine which buckets to reuse first - to weight > * those together consistently we keep track of the smallest nonzero > diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c > index 81e8dc3..b36d276 100644 > --- a/drivers/md/bcache/btree.c > +++ b/drivers/md/bcache/btree.c > @@ -90,6 +90,9 @@ > > #define MAX_NEED_GC64 > #define MAX_SAVE_PRIO72 > +#define MIN_GC_NODES100 > +#define GC_SLEEP_TIME100 How about naming the above macro as GC_SLEEP_MS ? So people may understand the sleep time is 100ms without checking more context. >>> OK, It looks better. > + > > #define PTR_DIRTY_BIT(((uint64_t) 1 << 36)) > > @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, > struct btree_op *op, > memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1)); > r->b = NULL; > > +if (atomic_read(>c->io_inflight) && > +gc->nodes >= gc->nodes_pre + MIN_GC_NODES) { On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the above check will be false for a very long time, and in some condition it might be always false after gc->nodes overflowed. How about adding the following check before the if() statement ? >>> /* in case gc->nodes is overflowed */ if (gc->nodes_pre < gc->nodes) gc->noeds_pre = gc->nodes; >>> I think 32bit is big enough, which can express a value of billions, >>> but the number of nodes is usually around tens of thousands. >>> Also gc->nodes and gc->nodes_pre were set to zero when GC begin each time. >>> >>> How do you think about? >> >> Oh, I see. Then I don't worry here. The patch is OK to me. > Thanks Coly, Would you mind if I add you as a reveiw-by in the v2 patch? Hi Junhui, How about send out v2 patch and let me confirm it with a Reviewed-by ? Coly Li
Re: [PATCH v2] block: ratelimite pr_err on IO path
On Thu, Apr 12, 2018 at 11:20 PM, Martin K. Petersenwrote: > > Jack, > >> + pr_err_ratelimited("%s: ref tag error at >> location %llu (rcvd %u)\n", > > I'm a bit concerned about dropping records of potential data loss. > > Also, what are you doing that compels all these to be logged? This > should be a very rare occurrence. > > -- > Martin K. Petersen Oracle Linux Engineering Hi Martin, Thanks for asking, we updated mpt3sas driver which enables DIX support (prot_mask=0x7f), all disks are SATA SSDs, no DIF support. After reboot, kernel reports the IO errors from all the drives behind HBA, seems for almost every read IO, which turns the system unusable: [ 13.079375] sda: ref tag error at location 0 (rcvd 143196159) [ 13.079989] sda: ref tag error at location 937702912 (rcvd 143196159) [ 13.080233] sda: ref tag error at location 937703072 (rcvd 143196159) [ 13.080407] sda: ref tag error at location 0 (rcvd 143196159) [ 13.080594] sda: ref tag error at location 8 (rcvd 143196159) [ 13.080996] sda: ref tag error at location 0 (rcvd 143196159) [ 13.089878] sdb: ref tag error at location 0 (rcvd 143196159) [ 13.090275] sdb: ref tag error at location 937702912 (rcvd 277413887) [ 13.090448] sdb: ref tag error at location 937703072 (rcvd 143196159) [ 13.090655] sdb: ref tag error at location 0 (rcvd 143196159) [ 13.090823] sdb: ref tag error at location 8 (rcvd 277413887) [ 13.091218] sdb: ref tag error at location 0 (rcvd 143196159) [ 13.095412] sdc: ref tag error at location 0 (rcvd 143196159) [ 13.095859] sdc: ref tag error at location 937702912 (rcvd 143196159) [ 13.096058] sdc: ref tag error at location 937703072 (rcvd 143196159) [ 13.096228] sdc: ref tag error at location 0 (rcvd 143196159) [ 13.096445] sdc: ref tag error at location 8 (rcvd 143196159) [ 13.096833] sdc: ref tag error at location 0 (rcvd 277413887) [ 13.097187] sds: ref tag error at location 0 (rcvd 277413887) [ 13.097707] sds: ref tag error at location 937702912 (rcvd 143196159) [ 13.097855] sds: ref tag error at location 937703072 (rcvd 277413887) Kernel version 4.15 and 4.14.28, I scan the commits in upstream, haven't found any relevant. in 4.4.112, there's no such errors. Diable DIX support (prot_mask=0x7) in mpt3sas fixes the problem. Regards, -- Jack Wang Linux Kernel DeveloperProfitBricks GmbH
Re: [PATCH 1/4] bcache: finish incremental GC
Hi Coly, >> >> Hello Coly, >> >>> On 2018/4/12 2:38 PM, tang.jun...@zte.com.cn wrote: From: Tang JunhuiIn GC thread, we record the latest GC key in gc_done, which is expected to be used for incremental GC, but in currently code, we didn't realize it. When GC runs, front side IO would be blocked until the GC over, it would be a long time if there is a lot of btree nodes. This patch realizes incremental GC, the main ideal is that, when there are front side I/Os, after GC some nodes (100), we stop GC, release locker of the btree node, and go to process the front side I/Os for some times (100 ms), then go back to GC again. By this patch, when we doing GC, I/Os are not blocked all the time, and there is no obvious I/Os zero jump problem any more. Signed-off-by: Tang Junhui >>> >>> Hi Junhui, >>> >>> I reply my comments in line, >> Thanks for your comments in advance. >>> --- drivers/md/bcache/bcache.h | 5 + drivers/md/bcache/btree.c | 15 ++- drivers/md/bcache/request.c | 3 +++ 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 843877e..ab4c9ca 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -445,6 +445,7 @@ struct cache { struct gc_stat { size_tnodes; +size_tnodes_pre; size_tkey_bytes; size_tnkeys; @@ -568,6 +569,10 @@ struct cache_set { */ atomic_trescale; /* + * used for GC, identify if any front side I/Os is inflight + */ +atomic_tio_inflight; >>> >>> Could you please to rename the above variable to something like >>> search_inflight ? Just to be more explicit, considering we have many >>> types of io requests. >> OK, It looks better. >>> +/* * When we invalidate buckets, we use both the priority and the amount * of good data to determine which buckets to reuse first - to weight * those together consistently we keep track of the smallest nonzero diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 81e8dc3..b36d276 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -90,6 +90,9 @@ #define MAX_NEED_GC64 #define MAX_SAVE_PRIO72 +#define MIN_GC_NODES100 +#define GC_SLEEP_TIME100 >>> >>> How about naming the above macro as GC_SLEEP_MS ? So people may >>> understand the sleep time is 100ms without checking more context. >> OK, It looks better. + #define PTR_DIRTY_BIT(((uint64_t) 1 << 36)) @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op, memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1)); r->b = NULL; +if (atomic_read(>c->io_inflight) && +gc->nodes >= gc->nodes_pre + MIN_GC_NODES) { >>> >>> On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the >>> above check will be false for a very long time, and in some condition it >>> might be always false after gc->nodes overflowed. >>> >>> How about adding the following check before the if() statement ? >> >>> /* in case gc->nodes is overflowed */ >>> if (gc->nodes_pre < gc->nodes) >>> gc->noeds_pre = gc->nodes; >>> >> I think 32bit is big enough, which can express a value of billions, >> but the number of nodes is usually around tens of thousands. >> Also gc->nodes and gc->nodes_pre were set to zero when GC begin each time. >> >> How do you think about? > >Oh, I see. Then I don't worry here. The patch is OK to me. Thanks Coly, Would you mind if I add you as a reveiw-by in the v2 patch? Thanks. Tang Junhui
Re: [PATCH v2] block: do not use interruptible wait anywhere
Hi Alan, On Thu, 2018-04-12 at 19:11 +0100, Alan Jenkins wrote: > # dd if=/dev/sda of=/dev/null iflag=direct & \ > while killall -SIGUSR1 dd; do sleep 0.1; done & \ > echo mem > /sys/power/state ; \ > sleep 5; killall dd # stop after 5 seconds Can you please also add a regression test to blktests[1] for this? [1] https://github.com/osandov/blktests Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 3/4] bcache: notify allocator to prepare for GC
Hi Coly Thanks for you comments. > Hi Junhui, > > This method is complicated IMHO. The main idea of this patch is: GC thread allocator thread ==>triggered by sectors_to_gc set ca->prepare_gc to GC_PREPARING to notify allocator thread to prepare for GC ==>detect ca->prepare_gc is GC_PREPARING, fill the go to invalidate_buckets(), ==>waiting for allocatorand fill free_inc queue with thread to prepare over reclaimable buckets, after that, set ca->prepare_gc to GC_PREPARED to notify GC thread being prepared OK ==>detect ca->prepare_gc is prepared OK, set ca->prepare_gc back to GC_PREPARE_NONE, and continue GC > Why not in bch_allocator_thread() > simple call wake_up_gc(ca->set) after invalidate_buckets() ? In this patch, GC is triggered by sectors_to_gc, and I/Os from front side continue exist, so we notify allocator to pack the free_inc queue full of buckets before GC, then we could have enough buckets for front side I/Os during GC period. Maybe the code is somewhat redundant, I will try to refactoring the code in the next patch. Thanks. Tang Junhui