Re: [dm-devel] bug: using smp_processor_id() in preemptible code in rr_select_path()
On 12:06 Fri 05 Aug , Mike Snitzer wrote: > On Fri, Aug 05 2016 at 11:54am -0400, > Jens Axboe <ax...@kernel.dk> wrote: > > > On 08/05/2016 09:42 AM, Mike Snitzer wrote: > > >On Fri, Aug 05 2016 at 11:33P -0400, > > >Jens Axboe <ax...@kernel.dk> wrote: > > > > > >>On 08/05/2016 09:27 AM, Mike Snitzer wrote: > > >>>On Wed, Aug 03 2016 at 11:35am -0400, > > >>>Benjamin Block <bbl...@linux.vnet.ibm.com> wrote: > > >>> > > >>>>Hej Mike, > > >>>> > > >>>>when running a debug-kernel today with several multipath-devices using > > >>>>the round-robin path selector I noticed that the kernel throws these > > >>>>warnings here: > > >>>> > > >>>>BUG: using smp_processor_id() in preemptible [] code: > > >>>>kdmwork-252:0/881 > > >>>>caller is rr_select_path+0x36/0x108 [dm_round_robin] > > >>>>CPU: 1 PID: 881 Comm: kdmwork-252:0 Not tainted 4.7.0-debug #4 > > >>>> 617679b8 61767a48 0002 > > >>>> > > >>>> 61767ae8 61767a60 61767a60 > > >>>> 001145d0 > > >>>> 00b962ae 00bb291e > > >>>> 000b > > >>>> 61767aa8 61767a48 > > >>>> > > >>>> 07b962ae 001145d0 61767a48 > > >>>> 61767aa8 > > >>>>Call Trace: > > >>>>([<001144a2>] show_trace+0x8a/0xe0) > > >>>>([<00114586>] show_stack+0x8e/0xf0) > > >>>>([<006c7fdc>] dump_stack+0x9c/0xe0) > > >>>>([<006fbbc0>] check_preemption_disabled+0x108/0x130) > > >>>>([<03ff80268646>] rr_select_path+0x36/0x108 [dm_round_robin]) > > >>>>([<03ff80259a42>] choose_path_in_pg+0x42/0xc8 [dm_multipath]) > > >>>>([<03ff80259b62>] choose_pgpath+0x9a/0x1a0 [dm_multipath]) > > >>>>([<03ff8025b51a>] __multipath_map.isra.5+0x72/0x228 [dm_multipath]) > > >>>>([<03ff8025b75e>] multipath_map+0x3e/0x50 [dm_multipath]) > > >>>>([<03ff80225eb6>] map_request+0x66/0x458 [dm_mod]) > > >>>>([<03ff802262ec>] map_tio_request+0x44/0x70 [dm_mod]) > > >>>>([<0016835a>] kthread_worker_fn+0xf2/0x1d8) > > >>>>([<001681da>] kthread+0x112/0x120) > > >>>>([<0098378a>] kernel_thread_starter+0x6/0xc) > > >>>>([<00983784>] kernel_thread_starter+0x0/0xc) > > >>>>no locks held by kdmwork-252:0/881. > > >>>> [:snip:] > > > > I always forget the details (if this confuses lockdep or not), but you > > could potentially turn it into: > > > > local_irq_save(flags); > > x = this_cpu_ptr(); > > [...] > > > > spin_lock(>lock); > > [...] > > > > instead. > > Cool, I've coded up the patch (compile tested only). > > Benjamin, any chance you could test this against your v4.7 kernel and > report back? > > Thanks, > Mike > > diff --git a/drivers/md/dm-round-robin.c b/drivers/md/dm-round-robin.c > index 4ace1da..ed446f8 100644 > --- a/drivers/md/dm-round-robin.c > +++ b/drivers/md/dm-round-robin.c > @@ -210,14 +210,17 @@ static struct dm_path *rr_select_path(struct > path_selector *ps, size_t nr_bytes) > struct path_info *pi = NULL; > struct dm_path *current_path = NULL; > > + local_irq_save(flags); > current_path = *this_cpu_ptr(s->current_path); > if (current_path) { > percpu_counter_dec(>repeat_count); > - if (percpu_counter_read_positive(>repeat_count) > 0) > + if (percpu_counter_read_positive(>repeat_count) > 0) { > + local_irq_restore(flags); > return current_path; > + } > } > > - spin_lock_irqsave(>lock, flags); > + spin_lock(>lock); > if (!list_empty(>valid_paths)) { > pi = list_entry(s->valid_paths.next, struct path_info, list); > list_move_tail(>list, >valid_paths); > @@ -225,7 +228,8 @@ static struct dm_path *rr_select_path(struct &g
Re: bug: using smp_processor_id() in preemptible code in rr_select_path()
On 12:06 Fri 05 Aug , Mike Snitzer wrote: > On Fri, Aug 05 2016 at 11:54am -0400, > Jens Axboe <ax...@kernel.dk> wrote: > > > On 08/05/2016 09:42 AM, Mike Snitzer wrote: > > >On Fri, Aug 05 2016 at 11:33P -0400, > > >Jens Axboe <ax...@kernel.dk> wrote: > > > > > >>On 08/05/2016 09:27 AM, Mike Snitzer wrote: > > >>>On Wed, Aug 03 2016 at 11:35am -0400, > > >>>Benjamin Block <bbl...@linux.vnet.ibm.com> wrote: > > >>> > > >>>>Hej Mike, > > >>>> > > >>>>when running a debug-kernel today with several multipath-devices using > > >>>>the round-robin path selector I noticed that the kernel throws these > > >>>>warnings here: > > >>>> > > >>>>BUG: using smp_processor_id() in preemptible [] code: > > >>>>kdmwork-252:0/881 > > >>>>caller is rr_select_path+0x36/0x108 [dm_round_robin] > > >>>>CPU: 1 PID: 881 Comm: kdmwork-252:0 Not tainted 4.7.0-debug #4 > > >>>> 617679b8 61767a48 0002 > > >>>> > > >>>> 61767ae8 61767a60 61767a60 > > >>>> 001145d0 > > >>>> 00b962ae 00bb291e > > >>>> 000b > > >>>> 61767aa8 61767a48 > > >>>> > > >>>> 07b962ae 001145d0 61767a48 > > >>>> 61767aa8 > > >>>>Call Trace: > > >>>>([<001144a2>] show_trace+0x8a/0xe0) > > >>>>([<00114586>] show_stack+0x8e/0xf0) > > >>>>([<006c7fdc>] dump_stack+0x9c/0xe0) > > >>>>([<006fbbc0>] check_preemption_disabled+0x108/0x130) > > >>>>([<03ff80268646>] rr_select_path+0x36/0x108 [dm_round_robin]) > > >>>>([<03ff80259a42>] choose_path_in_pg+0x42/0xc8 [dm_multipath]) > > >>>>([<03ff80259b62>] choose_pgpath+0x9a/0x1a0 [dm_multipath]) > > >>>>([<03ff8025b51a>] __multipath_map.isra.5+0x72/0x228 [dm_multipath]) > > >>>>([<03ff8025b75e>] multipath_map+0x3e/0x50 [dm_multipath]) > > >>>>([<03ff80225eb6>] map_request+0x66/0x458 [dm_mod]) > > >>>>([<03ff802262ec>] map_tio_request+0x44/0x70 [dm_mod]) > > >>>>([<0016835a>] kthread_worker_fn+0xf2/0x1d8) > > >>>>([<001681da>] kthread+0x112/0x120) > > >>>>([<0098378a>] kernel_thread_starter+0x6/0xc) > > >>>>([<00983784>] kernel_thread_starter+0x0/0xc) > > >>>>no locks held by kdmwork-252:0/881. > > >>>> > > >>>> > > >>>>There are several more of these warnings, but all have the same > > >>>>stack-trace (this is on s390x, but this looks like its only common code) > > >>>>- sometimes the process-context is multipath. > > >>>> > > >>>>Looking at the changes in this function, it rather looks like this is > > >>>>caused by changes made in commit > > >>>>b0b477c7e0dd93f8916d106018ded1331b81bf61 (dm round robin: use percpu > > >>>>'repeat_count' and 'current_path'). > > >>>> > > >>>>The kernel is a stock v4.7 with some debug options enabled (prominently > > >>>>DEBUG_PREEMPT). Need any more info? > > >>> > > >>>As you can see from commit b0b477c7e0dd9 the round-robin path selector > > >>>is now using percpu data (pointer) and a percpu_counter. > > >>> > > >>>I'm really not sure how else to access this percpu data. > > >>> > > >>>Cc'ing LKML, linux-block, Jens and hch to cast a wider net in the hopes > > >>>of getting an answer to how to fix this. > > >> > > >>From a quick look, looks like you are using this_cpu_ptr() without > > >>having preemption disabled. > > > > > >Right, that is what it looked like to me too. > > > > > >I'm just not sure on what the proper pattern is to fix this. > > > > > >I'll look closer though. > > > > I always forget the details (if this confuses lockdep or not), b
Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote: > Hello Jens, > > The six patches in this patch series fix the queue lockup I reported > recently on the linux-block mailing list. Please consider these patches > for inclusion in the upstream kernel. > Hey Bart, just out of curiosity. Is this maybe related to similar stuff happening when CPUs are hot plugged - at least in that the stack gets stuck? Like in this thread here: https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html Would be interesting, because we recently saw similar stuff happening. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH v4 0/6] Avoid that scsi-mq and dm-mq queue processing stalls sporadically
On Wed, Apr 12, 2017 at 06:11:25PM +, Bart Van Assche wrote: > On Wed, 2017-04-12 at 12:55 +0200, Benjamin Block wrote: > > On Fri, Apr 07, 2017 at 11:16:48AM -0700, Bart Van Assche wrote: > > > The six patches in this patch series fix the queue lockup I reported > > > recently on the linux-block mailing list. Please consider these patches > > > for inclusion in the upstream kernel. > > > > just out of curiosity. Is this maybe related to similar stuff happening > > when CPUs are hot plugged - at least in that the stack gets stuck? Like > > in this thread here: > > https://www.mail-archive.com/linux-block@vger.kernel.org/msg06057.html > > > > Would be interesting, because we recently saw similar stuff happening. > > Hello Benjamin, > > My proposal is to repeat that test with Jens' for-next branch. If the issue > still occurs with that tree then please check the contents of > /sys/kernel/debug/block/*/mq/*/{dispatch,*/rq_list}. That will allow to > determine whether or not any block layer requests are still pending. If > running the command below resolves the deadlock then it means that a > trigger to run a block layer queue is still missing somewhere: > > for a in /sys/kernel/debug/block/*/mq/state; do echo run >$a; done > > See also git://git.kernel.dk/linux-block.git. > Thx for the hint! I'll forward that and see if the affected folks are willing to reproduce. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [RFC PATCH 1/6] bsg: fix kernel panic resulting from missing allocation of a reply-buffer
d: GB 4.13.0-rc4-bsg-regression+ #2 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) Call Trace: ([<00117532>] show_stack+0x8a/0xe0) [<00bcbaee>] dump_stack+0x96/0xd8 [<003cd5fc>] slab_err+0xac/0xc0 [<003d68e4>] free_debug_processing+0x554/0x570 [<003d69ae>] __slab_free+0xae/0x618 [<003d7dce>] kfree+0x44e/0x4a0 [<00859b4e>] blk_done_softirq+0x146/0x160 [<00bf4ec0>] __do_softirq+0x3d0/0x840 [<001662a6>] run_ksoftirqd+0x3e/0xb8 [<001957fc>] smpboot_thread_fn+0x2f4/0x318 [<0018f6f6>] kthread+0x166/0x178 [<00bf3cf2>] kernel_thread_starter+0x6/0xc [<00bf3cec>] kernel_thread_starter+0x0/0xc INFO: lockdep is turned off. FIX kmalloc-1024: Object at 0x4ad9d650 not freed = BUG kmalloc-1024 (Tainted: GB ): Invalid object pointer 0x4ad9eb90 - INFO: Slab 0x03d1012b6600 objects=24 used=17 fp=0x4ad99548 flags=0x3fffc008101 CPU: 2 PID: 20 Comm: ksoftirqd/2 Tainted: GB 4.13.0-rc4-bsg-regression+ #2 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) Call Trace: ([<00117532>] show_stack+0x8a/0xe0) [<00bcbaee>] dump_stack+0x96/0xd8 [<003cd5fc>] slab_err+0xac/0xc0 [<003d68e4>] free_debug_processing+0x554/0x570 [<003d69ae>] __slab_free+0xae/0x618 [<003d7dce>] kfree+0x44e/0x4a0 [<00859b4e>] blk_done_softirq+0x146/0x160 [<00bf4ec0>] __do_softirq+0x3d0/0x840 [<001662a6>] run_ksoftirqd+0x3e/0xb8 [<001957fc>] smpboot_thread_fn+0x2f4/0x318 [<0018f6f6>] kthread+0x166/0x178 [<00bf3cf2>] kernel_thread_starter+0x6/0xc [<00bf3cec>] kernel_thread_starter+0x0/0xc INFO: lockdep is turned off. FIX kmalloc-1024: Object at 0x4ad9eb90 not freed > > --- > From 4cd32ee48e334b62b55bff0d380833b978454040 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <h...@lst.de> > Date: Fri, 11 Aug 2017 11:03:29 +0200 > Subject: bsg-lib: allocate sense data for each request > > Since we split the scsi_request out of the request the driver is supposed > to provide storage for the sense buffer. The bsg-lib code failed to do so, > though and will crash anytime it is used. > > This patch moves bsg-lib to allocate and setup the bsg_job ahead of time, > and allocate the sense data, which is used as reply buffer in bsg. > > Reported-by: Steffen Maier <ma...@linux.vnet.ibm.com> > Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> > Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request") > Cc: <sta...@vger.kernel.org> #4.11+ > --- > block/bsg-lib.c | 53 > +++-- > include/linux/blkdev.h | 1 - > include/linux/bsg-lib.h | 2 ++ > 3 files changed, 31 insertions(+), 25 deletions(-) > > diff --git a/block/bsg-lib.c b/block/bsg-lib.c > index c4513b23f57a..215893dbd038 100644 > --- a/block/bsg-lib.c > +++ b/block/bsg-lib.c > @@ -100,7 +100,7 @@ EXPORT_SYMBOL_GPL(bsg_job_done); > */ > static void bsg_softirq_done(struct request *rq) > { > - struct bsg_job *job = rq->special; > + struct bsg_job *job = blk_mq_rq_to_pdu(rq); > > bsg_job_put(job); > } > @@ -129,26 +129,9 @@ static int bsg_map_buffer(struct bsg_buffer *buf, struct > request *req) > static int bsg_create_job(struct device *dev, struct request *req) > { > struct request *rsp = req->next_rq; > - struct request_queue *q = req->q; > - struct scsi_request *rq = scsi_req(req); > - struct bsg_job *job; > + struct bsg_job *job = blk_mq_rq_to_pdu(req); > int ret; > > - BUG_ON(req->special); > - > - job = kzalloc(sizeof(struct bsg_job) + q->bsg_job_size, GFP_KERNEL); > - if (!job) > - return -ENOMEM; > - > - req->special = job; > - job->req = req; > - if (q->bsg_job_size) > - job->dd_data = (void *)[1]; > - job->request = rq->cmd; > - job->request_len = rq->cmd_len; > - job->reply = rq->sense; > - job->reply_len = SCSI_SENSE_BUFFERSIZE; /* Size of sense buffer > - * allocated */ > if (req->bio) { > ret = bsg_map_buffer(>request_payload, req); > if (ret) > @@ -187,7 +170,6 @@ static void bsg_request_fn(struct request_queue *q) > { > struct device *dev = q->qu
Re: [RFC PATCH 4/6] bsg: refactor ioctl to use regular BSG-command infrastructure for SG_IO
On Thu, Aug 10, 2017 at 10:24:56AM +0200, Johannes Thumshirn wrote: > On Wed, Aug 09, 2017 at 04:11:18PM +0200, Benjamin Block wrote: > > + return 0 == (bc->hdr.flags & BSG_FLAG_Q_AT_TAIL); > > return !(bc->hdr.flags & BSG_FLAG_Q_AT_TAIL); and make the function return > bool? I have to admit, this is the 1st time I have seen the above construct. > Hmm yeah, odd. To be honest, I just copied the expression so that it is obvious that no behavior changed, but just the place. I'll replace it with what you suggest. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [RFC PATCH 5/6] bsg: reduce unnecessary arguments for bsg_map_hdr()
On Thu, Aug 10, 2017 at 11:35:31AM +0200, Christoph Hellwig wrote: > On Wed, Aug 09, 2017 at 04:11:19PM +0200, Benjamin Block wrote: > > Since struct bsg_command is now used in every calling case, we don't > > need separation of arguments anymore that are contained in the same > > bsg_command. > > > > Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> > > --- > > block/bsg.c | 13 ++--- > > 1 file changed, 6 insertions(+), 7 deletions(-) > > > > diff --git a/block/bsg.c b/block/bsg.c > > index 8517361a9b3f..6ee2ffca808a 100644 > > --- a/block/bsg.c > > +++ b/block/bsg.c > > @@ -212,11 +212,12 @@ bsg_validate_sgv4_hdr(struct sg_io_v4 *hdr, int *op) > > * map sg_io_v4 to a request. > > */ > > static struct request * > > -bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t > > has_write_perm, > > - u8 *reply_buffer) > > +bsg_map_hdr(struct bsg_device *bd, struct bsg_command *bc, > > + fmode_t has_write_perm) > > I wish we could just rename the argument to mode and pass on the > whole file->f_mode while you are cleaning up this code. That should > be a separate patch, though. > Hmm, I did a quick pass through the code and the only place this seems to be used, is to pass it to blk_verify_command() if the subcommand used in the BSG request is a SCSI Command. And this has the same semantics. So I guess this would require adjustments to the whole stack, as this is also used from the 'normal' SG side of the world. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [RFC PATCH 1/6] bsg: fix kernel panic resulting from missing allocation of a reply-buffer
On Fri, Aug 11, 2017 at 12:10:38AM +0200, Benjamin Block wrote: > On Thu, Aug 10, 2017 at 11:32:17AM +0200, Christoph Hellwig wrote: > > We can't use an on-stack buffer for the sense data, as drivers will > > dma to it. So we should reuse the SCSI init_rq_fn() for the BSG > > queues and/or implement the same scheme. > > > ... > > struct sg_io_v4 > +--+ > | | > | request>++ > | + _len | || > | (A)| | BSG Request| > | | | e.g. struct fc_bsg_request | Depends on BSG > implementation > | | || FC vs. iSCSI vs. ... > | | ++ > | | > | response--->++ Used as _Output_ > | + max_len | || User doesn't initialize > | (B)| | BSG Reply | User provides (optional) > | | | e.g. struct fc_bsg_reply | memory; May be NULL. > | | || > | | ++ > | | > | dout_xferp->+---+ Stuff send on the wire by > | + _len | | | the LLD > | (C)| | Transport Protocol IU | Aligned on PAGE_SIZE > | | | e.g. FC-GS-7 CT_IU| > | | | | > | | +---+ > | | > | din_xferp-->+---+ Buffer for response data by > | + _len | | | the LLD > | (D)| | Transport Protocol IU | Aligned on PAGE_SIZE > | | | e.g. FC-GS-7 CT_IU| > | | | | > | | +---+ > +--+ > ... > > struct request (E) > +--+ > | | struct scsi_request > | scsi_request--->+-+ > | | | | > | | | cmd-> Copy of (A) > | | | + _len | Space in struct or kzalloc > | | | (G)| > | | | | > | | | sense---> Space for BSG Reply > | | | + _len | Same Data-Structure as (B) > | | | (H)| NOT actually pointer (B) > | | | | 'reply_buffer' in my patch > | | +-+ > | | > | bio> Mapped via blk_rq_map_user() to (C) dout_xferp > | | > | next_rq-+ > | | | > +--+ | >| > struct request (F)|(if used) > +--+<-+ > | | > | scsi_request---> Unused here > | | > | bio> Mapped via blk_rq_map_user() to (D) din_xferp > | | > +--+ > ... > > struct bsg_job > +-+ > | | > | request---> (G) scsi_request->cmd -> Copy of (A) > | + _len| e.g. struct fc_bsg_request > | | > | reply-> (H) scsi_request->sense -> 'reply_buffer' > | + _len| e.g. struct fc_bsg_reply > | | > | request_payload---> struct scatterlist ... map (E)->bio > | + _len| > | (I) | > | | > | reply_payload-> struct scatterlist ... map (F)->bio > | + _len| > | (J) | > | | > +-+ > > > This worked till it broke. Right now every driver that tries to access > (H) will panic the system, or cause very undefined behavior. I suspect > no driver right now tries to do any DMA into (H); before the regression, > this has been also an on-stack variable (I suspect since BSG was > introduced, haven't checked though). > > The asymmetries between the first struct request (E) and the following > (F) also makes it hard to use the same scheme as in other drivers, where > init_rq_fn() gets to initialize each request in the same'ish way. Or? > Just looking at it right now, this would require some bigger rework that > is not appropriate for a stable bug-fix. > Just some more brain-dump here. One more problem for direct DMA into (H) in the current BSG setup is probably, that the transport classes have each their own private format for the BSG reply (
Re: [RFC PATCH 1/6] bsg: fix kernel panic resulting from missing allocation of a reply-buffer
Hey Christoph, I looked over the patch in detail, see below. > From f5b03b82df0569c035022c1c2535696186907f1a Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <h...@lst.de> > Date: Fri, 11 Aug 2017 11:03:29 +0200 > Subject: bsg-lib: allocate sense data for each request > > Since we split the scsi_request out of the request the driver is supposed > to provide storage for the sense buffer. The bsg-lib code failed to do so, > though and will crash anytime it is used. > > This patch moves bsg-lib to allocate and setup the bsg_job ahead of time, > and allocate the sense data, which is used as reply buffer in bsg. > > Reported-by: Steffen Maier <ma...@linux.vnet.ibm.com> > Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> > Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request") > Cc: <sta...@vger.kernel.org> #4.11+ > --- > block/bsg-lib.c | 51 > +++-- > include/linux/blkdev.h | 1 - > include/linux/bsg-lib.h | 2 ++ > 3 files changed, 30 insertions(+), 24 deletions(-) > > diff --git a/block/bsg-lib.c b/block/bsg-lib.c > index c4513b23f57a..c07333c3b785 100644 > --- a/block/bsg-lib.c > +++ b/block/bsg-lib.c > @@ -37,13 +37,11 @@ static void bsg_destroy_job(struct kref *kref) > struct bsg_job *job = container_of(kref, struct bsg_job, kref); > struct request *rq = job->req; > > - blk_end_request_all(rq, BLK_STS_OK); > - > put_device(job->dev); /* release reference for the request */ > > kfree(job->request_payload.sg_list); > kfree(job->reply_payload.sg_list); > - kfree(job); > + blk_end_request_all(rq, BLK_STS_OK); What is the reason for moving that last line? Just wondering whether that might change the behavior somehow, although it doesn't look like it from the code. > } > > void bsg_job_put(struct bsg_job *job) > @@ -100,7 +98,7 @@ EXPORT_SYMBOL_GPL(bsg_job_done); > */ > static void bsg_softirq_done(struct request *rq) > { > - struct bsg_job *job = rq->special; > + struct bsg_job *job = blk_mq_rq_to_pdu(rq); > > bsg_job_put(job); > } > @@ -129,24 +127,11 @@ static int bsg_map_buffer(struct bsg_buffer *buf, > struct request *req) > static int bsg_create_job(struct device *dev, struct request *req) > { > struct request *rsp = req->next_rq; > - struct request_queue *q = req->q; > + struct bsg_job *job = blk_mq_rq_to_pdu(req); > struct scsi_request *rq = scsi_req(req); > - struct bsg_job *job; > int ret; > > - BUG_ON(req->special); > - > - job = kzalloc(sizeof(struct bsg_job) + q->bsg_job_size, GFP_KERNEL); > - if (!job) > - return -ENOMEM; > - > - req->special = job; > - job->req = req; > - if (q->bsg_job_size) > - job->dd_data = (void *)[1]; > - job->request = rq->cmd; > job->request_len = rq->cmd_len; > - job->reply = rq->sense; > job->reply_len = SCSI_SENSE_BUFFERSIZE; /* Size of sense buffer >* allocated */ > if (req->bio) { > @@ -187,7 +172,6 @@ static void bsg_request_fn(struct request_queue *q) > { > struct device *dev = q->queuedata; > struct request *req; > - struct bsg_job *job; > int ret; > > if (!get_device(dev)) > @@ -207,8 +191,7 @@ static void bsg_request_fn(struct request_queue *q) > continue; > } > > - job = req->special; > - ret = q->bsg_job_fn(job); > + ret = q->bsg_job_fn(blk_mq_rq_to_pdu(req)); > spin_lock_irq(q->queue_lock); > if (ret) > break; > @@ -219,6 +202,27 @@ static void bsg_request_fn(struct request_queue *q) > spin_lock_irq(q->queue_lock); > } > > +static int bsg_init_rq(struct request_queue *q, struct request *req, gfp_t > gfp) > +{ > + struct bsg_job *job = blk_mq_rq_to_pdu(req); > + > + memset(job, 0, sizeof(*job)); > + job->req = req; > + job->request = job->sreq.cmd; That doesn't work with bsg.c if the request submitted by the user is bigger than BLK_MAX_CDB. There is code in blk_fill_sgv4_hdr_rq() that will reassign the req->cmd point in that case to something else. This is maybe wrong in the same vein as my Patch 1 is. If not for the legacy code in bsg.c, setting this here, will miss changes to that pointer between request-allocation and job-submission. So maybe just move this to bsg_create_job(). > + job->dd_data = job + 1; > + job
Re: [RFC PATCH 1/6] bsg: fix kernel panic resulting from missing allocation of a reply-buffer
On Sun, Aug 13, 2017 at 04:39:40PM +0200, Christoph Hellwig wrote: > On Fri, Aug 11, 2017 at 06:01:42PM +0200, Benjamin Block wrote: > > When the BSG interface is used with bsg-lib, and the user sends a > > Bidirectional command - so when he gives an input- and output-buffer > > (most users of our interface will likely do that, if they wanna get the > > transport-level response data) - bsg will allocate two requests from the > > queue. The first request's bio is used to map the input and the second > > request's bio for the output (see bsg_map_hdr() in the if-statement with > > (op == REQ_OP_SCSI_OUT && hdr->din_xfer_len)). > > > > When we now allocate the full space of bsg_job, sense, dd_data for each > > request, these will be wasted on the (linked) second request. They will > > go unused all the time, as only the first request's bsg_job, sense and > > dd_data is used by the LLDs and BSG itself. > > > > Right now, because we don't allocate this on each request, those spaces > > are only allocated for the first request in bsg-lib. > > > > Maybe we can ignore this, if it gets to complicated, I don't wanne > > prolong this unnecessary. > > We have the same 'issue' with bidirection scsi commands - it's a side > effect of having two request structures for these commands, and the > only real fix would be to have a single request structure, which would > be nice especially if we can't do it without growing struct request. > Alright, I was not aware of that. That is fair then. Thx. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
[RFC PATCH 2/6] bsg: assign sense_len instead of fixed SCSI_SENSE_BUFFERSIZE
We do set rq->sense_len when we assigne the reply-buffer in blk_fill_sgv4_hdr_rq(). No point in possibly deviating from this value later on. bsg-lib.h specifies: unsigned int reply_len; /* * On entry : reply_len indicates the buffer size allocated for * the reply. * * ... */ Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- block/bsg-lib.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/bsg-lib.c b/block/bsg-lib.c index c4513b23f57a..c7c2c6bbb5ae 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -147,8 +147,8 @@ static int bsg_create_job(struct device *dev, struct request *req) job->request = rq->cmd; job->request_len = rq->cmd_len; job->reply = rq->sense; - job->reply_len = SCSI_SENSE_BUFFERSIZE; /* Size of sense buffer -* allocated */ + job->reply_len = rq->sense_len; + if (req->bio) { ret = bsg_map_buffer(>request_payload, req); if (ret) -- 2.12.2
[RFC PATCH 5/6] bsg: reduce unnecessary arguments for bsg_map_hdr()
Since struct bsg_command is now used in every calling case, we don't need separation of arguments anymore that are contained in the same bsg_command. Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- block/bsg.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 8517361a9b3f..6ee2ffca808a 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -212,11 +212,12 @@ bsg_validate_sgv4_hdr(struct sg_io_v4 *hdr, int *op) * map sg_io_v4 to a request. */ static struct request * -bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t has_write_perm, - u8 *reply_buffer) +bsg_map_hdr(struct bsg_device *bd, struct bsg_command *bc, + fmode_t has_write_perm) { struct request_queue *q = bd->queue; struct request *rq, *next_rq = NULL; + struct sg_io_v4 *hdr = >hdr; int ret; unsigned int op, dxfer_len; void __user *dxferp = NULL; @@ -244,7 +245,7 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, fmode_t has_write_perm, if (IS_ERR(rq)) return rq; - ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, reply_buffer, + ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, bc->reply_buffer, has_write_perm); if (ret) goto out; @@ -633,8 +634,7 @@ static int __bsg_write(struct bsg_device *bd, const char __user *buf, /* * get a request, fill in the blanks, and add to request queue */ - rq = bsg_map_hdr(bd, >hdr, has_write_perm, -bc->reply_buffer); + rq = bsg_map_hdr(bd, bc, has_write_perm); if (IS_ERR(rq)) { ret = PTR_ERR(rq); rq = NULL; @@ -934,8 +934,7 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, unsigned long arg) goto sg_io_out; } - bc->rq = bsg_map_hdr(bd, >hdr, file->f_mode & FMODE_WRITE, -bc->reply_buffer); + bc->rq = bsg_map_hdr(bd, bc, file->f_mode & FMODE_WRITE); if (IS_ERR(bc->rq)) { ret = PTR_ERR(bc->rq); goto sg_io_out; -- 2.12.2
[RFC PATCH 0/6] bsg: fix regression resulting in panics when sending commands via BSG and some sanity cleanups
Hello all, Steffen noticed recently that we have a regression in the BSG code that prevents us from sending any traffic over this interface. After I researched this a bit, it turned out that this affects not only zFCP, but likely all LLDs that implements the BSG API. This was introduced in 4.11 (details in Patch 1 of this series). I imagine the regression happened because of some very "unfortunate" variable- namings. I can not fix them, as they involve the base struct request, but I tried to add some cleanups that should make the relationships between stuff more visible in the future I hope. Patch 1 - Regression Fix; Also tagged for stable Patch 2-6 - Cleanups I tagged this as RFC. Patches 2-6 are a 'nice to have' IMO, Patch 1 is obviously necessary, and if it is OK, I can re-send it separately if necessary. If you don't like the changes in the other patches, I don't mind dropping them. I am not sure about Patch 4. It certainly works, but it changes user-visible behavior, in what I believe is within the behavior described by the SG interface. It makes the different methods of how BSG passes commands down to the LLDs more conform with each other - even though I can't make them the exact same. More details in the patch description. I rebased the series on Jens' for-next and I have function-tested the series on s390x's zFCP with the tools provided in the zfcp HBA API library (https://www.ibm.com/developerworks/linux/linux390/zfcp-hbaapi.html) and some custom code to test the read/write interface of BSG. Reviews are more than welcome :) Beste Grüße / Best regards, - Benjamin Block Benjamin Block (6): bsg: fix kernel panic resulting from missing allocation of a reply-buffer bsg: assign sense_len instead of fixed SCSI_SENSE_BUFFERSIZE bsg: scsi-transport: add compile-tests to prevent reply-buffer overflows bsg: refactor ioctl to use regular BSG-command infrastructure for SG_IO bsg: reduce unecessary arguments for bsg_map_hdr() bsg: reduce unecessary arguments for blk_complete_sgv4_hdr_rq() block/bsg-lib.c | 4 +- block/bsg.c | 90 ++--- drivers/scsi/scsi_transport_fc.c| 3 ++ drivers/scsi/scsi_transport_iscsi.c | 3 ++ include/linux/bsg-lib.h | 2 + 5 files changed, 65 insertions(+), 37 deletions(-) -- 2.12.2
[RFC PATCH 3/6] bsg: scsi-transport: add compile-tests to prevent reply-buffer overflows
The BSG implementations use the bsg_job's reply buffer as storage for their own custom reply structures (e.g.: struct fc_bsg_reply or struct iscsi_bsg_reply). The size of bsg_job's reply buffer and those of the implementations is not dependent in any way the compiler can currently check. To make it easier to notice accidental violations add an explicit compile- time check that tests whether the implementations' reply buffer is at most as large as bsg_job's. To do so, we have to move the size-define from bsg.c to a common header. Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- block/bsg.c | 3 +-- drivers/scsi/scsi_transport_fc.c| 3 +++ drivers/scsi/scsi_transport_iscsi.c | 3 +++ include/linux/bsg-lib.h | 2 ++ 4 files changed, 9 insertions(+), 2 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 285b1b8126c3..b924f1c23c58 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -74,8 +75,6 @@ static int bsg_major; static struct kmem_cache *bsg_cmd_cachep; -#define BSG_COMMAND_REPLY_BUFFERSIZE SCSI_SENSE_BUFFERSIZE - /* * our internal command type */ diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 892fbd9800d9..ce6654b5d329 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -3736,6 +3736,9 @@ static int fc_bsg_dispatch(struct bsg_job *job) { struct Scsi_Host *shost = fc_bsg_to_shost(job); + BUILD_BUG_ON(sizeof(struct fc_bsg_reply) > +BSG_COMMAND_REPLY_BUFFERSIZE); + if (scsi_is_fc_rport(job->dev)) return fc_bsg_rport_dispatch(shost, job); else diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index a424eaeafeb0..4e021c949ad7 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -1483,6 +1483,9 @@ static int iscsi_bsg_host_dispatch(struct bsg_job *job) int cmdlen = sizeof(uint32_t); /* start with length of msgcode */ int ret; + BUILD_BUG_ON(sizeof(struct iscsi_bsg_reply) > +BSG_COMMAND_REPLY_BUFFERSIZE); + /* check if we have the msgcode value at least */ if (job->request_len < sizeof(uint32_t)) { ret = -ENOMSG; diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h index e34dde2da0ef..85d7c7678cc6 100644 --- a/include/linux/bsg-lib.h +++ b/include/linux/bsg-lib.h @@ -25,6 +25,8 @@ #include +#define BSG_COMMAND_REPLY_BUFFERSIZE SCSI_SENSE_BUFFERSIZE + struct request; struct device; struct scatterlist; -- 2.12.2
[RFC PATCH 6/6] bsg: reduce unnecessary arguments for blk_complete_sgv4_hdr_rq()
Since struct bsg_command is now used in every calling case, we don't need separation of arguments anymore that are contained in the same bsg_command. Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- block/bsg.c | 20 +--- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 6ee2ffca808a..09f767cdf816 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -399,13 +399,14 @@ static struct bsg_command *bsg_get_done_cmd(struct bsg_device *bd) return bc; } -static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, - struct bio *bio, struct bio *bidi_bio) +static int blk_complete_sgv4_hdr_rq(struct bsg_command *bc) { + struct sg_io_v4 *hdr = >hdr; + struct request *rq = bc->rq; struct scsi_request *req = scsi_req(rq); int ret = 0; - dprintk("rq %p bio %p 0x%x\n", rq, bio, req->result); + dprintk("rq %p bio %p 0x%x\n", rq, bc->bio, req->result); /* * fill in all the output members */ @@ -432,7 +433,7 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, if (rq->next_rq) { hdr->dout_resid = req->resid_len; hdr->din_resid = scsi_req(rq->next_rq)->resid_len; - blk_rq_unmap_user(bidi_bio); + blk_rq_unmap_user(bc->bidi_bio); blk_put_request(rq->next_rq); } else if (rq_data_dir(rq) == READ) hdr->din_resid = req->resid_len; @@ -448,7 +449,7 @@ static int blk_complete_sgv4_hdr_rq(struct request *rq, struct sg_io_v4 *hdr, if (!ret && req->result < 0) ret = req->result; - blk_rq_unmap_user(bio); + blk_rq_unmap_user(bc->bio); scsi_req_free_cmd(req); blk_put_request(rq); @@ -507,8 +508,7 @@ static int bsg_complete_all_commands(struct bsg_device *bd) if (IS_ERR(bc)) break; - tret = blk_complete_sgv4_hdr_rq(bc->rq, >hdr, bc->bio, - bc->bidi_bio); + tret = blk_complete_sgv4_hdr_rq(bc); if (!ret) ret = tret; @@ -542,8 +542,7 @@ __bsg_read(char __user *buf, size_t count, struct bsg_device *bd, * after completing the request. so do that here, * bsg_complete_work() cannot do that for us */ - ret = blk_complete_sgv4_hdr_rq(bc->rq, >hdr, bc->bio, - bc->bidi_bio); + ret = blk_complete_sgv4_hdr_rq(bc); if (copy_to_user(buf, >hdr, sizeof(bc->hdr))) ret = -EFAULT; @@ -944,8 +943,7 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, unsigned long arg) blk_execute_rq(bd->queue, NULL, bc->rq, at_head); bc->hdr.duration = jiffies_to_msecs(jiffies - bc->hdr.duration); - ret = blk_complete_sgv4_hdr_rq(bc->rq, >hdr, bc->bio, - bc->bidi_bio); + ret = blk_complete_sgv4_hdr_rq(bc); if (copy_to_user(uarg, >hdr, sizeof(bc->hdr))) ret = -EFAULT; -- 2.12.2
[RFC PATCH 4/6] bsg: refactor ioctl to use regular BSG-command infrastructure for SG_IO
Before, the SG_IO ioctl for BSG devices used to use its own on-stack data to assemble and send the specified command. The read and write calls use their own infrastructure build around the struct bsg_command and a custom slab-pool for that. Rafactor this, so that SG_IO ioctl also uses struct bsg_command and the surrounding infrastructure. This way we use global defines like BSG_COMMAND_REPLY_BUFFERSIZE only in one place, rather than two, the handling of BSG commands gets more consistent, and it reduces some code- duplications (the bio-pointer handling). It also reduces the stack footprint by 320 to 384 bytes (depending on how large pointers are), and uses the active slab-implementation for efficient alloc/free. There are two other side-effects: - the 'duration' field in the sg header is now also filled for SG_IO calls, unlike before were it was always zero. - the BSG device queue-limit is also applied to SG_IO, unlike before were you could flood one BSG device with as many commands as you'd like. If one can trust older SG documentation this limit is applicable to either normal writes, or SG_IO calls; but this wasn't enforced before for SG_IO. A complete unification is not possible, as it then would also enqueue SG_IO commands in the BGS devices's command list, but this is only for the read- and write-calls. Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- block/bsg.c | 60 1 file changed, 36 insertions(+), 24 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index b924f1c23c58..8517361a9b3f 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -320,6 +320,17 @@ static void bsg_rq_end_io(struct request *rq, blk_status_t status) wake_up(>wq_done); } +static int bsg_prep_add_command(struct bsg_command *bc, struct request *rq) +{ + bc->rq = rq; + bc->bio = rq->bio; + if (rq->next_rq) + bc->bidi_bio = rq->next_rq->bio; + bc->hdr.duration = jiffies; + + return 0 == (bc->hdr.flags & BSG_FLAG_Q_AT_TAIL); +} + /* * do final setup of a 'bc' and submit the matching 'rq' to the block * layer for io @@ -327,16 +338,11 @@ static void bsg_rq_end_io(struct request *rq, blk_status_t status) static void bsg_add_command(struct bsg_device *bd, struct request_queue *q, struct bsg_command *bc, struct request *rq) { - int at_head = (0 == (bc->hdr.flags & BSG_FLAG_Q_AT_TAIL)); + int at_head = bsg_prep_add_command(bc, rq); /* * add bc command to busy queue and submit rq for io */ - bc->rq = rq; - bc->bio = rq->bio; - if (rq->next_rq) - bc->bidi_bio = rq->next_rq->bio; - bc->hdr.duration = jiffies; spin_lock_irq(>lock); list_add_tail(>list, >busy_list); spin_unlock_irq(>lock); @@ -916,31 +922,37 @@ static long bsg_ioctl(struct file *file, unsigned int cmd, unsigned long arg) return scsi_cmd_ioctl(bd->queue, NULL, file->f_mode, cmd, uarg); } case SG_IO: { - struct request *rq; - u8 reply_buffer[BSG_COMMAND_REPLY_BUFFERSIZE] = { 0, }; - struct bio *bio, *bidi_bio = NULL; - struct sg_io_v4 hdr; + struct bsg_command *bc; int at_head; - if (copy_from_user(, uarg, sizeof(hdr))) - return -EFAULT; + bc = bsg_alloc_command(bd); + if (IS_ERR(bc)) + return PTR_ERR(bc); - rq = bsg_map_hdr(bd, , file->f_mode & FMODE_WRITE, -reply_buffer); - if (IS_ERR(rq)) - return PTR_ERR(rq); + if (copy_from_user(>hdr, uarg, sizeof(bc->hdr))) { + ret = -EFAULT; + goto sg_io_out; + } - bio = rq->bio; - if (rq->next_rq) - bidi_bio = rq->next_rq->bio; + bc->rq = bsg_map_hdr(bd, >hdr, file->f_mode & FMODE_WRITE, +bc->reply_buffer); + if (IS_ERR(bc->rq)) { + ret = PTR_ERR(bc->rq); + goto sg_io_out; + } - at_head = (0 == (hdr.flags & BSG_FLAG_Q_AT_TAIL)); - blk_execute_rq(bd->queue, NULL, rq, at_head); - ret = blk_complete_sgv4_hdr_rq(rq, , bio, bidi_bio); + at_head = bsg_prep_add_command(bc, bc->rq); + blk_execute_rq(bd->queue, NULL, bc->rq, at_head); + bc->hdr.duration = jiffies_to_msecs(jiffies - bc->hdr.duration); - if (copy_to_user(uarg, , sizeof(hdr))) -
[RFC PATCH 1/6] bsg: fix kernel panic resulting from missing allocation of a reply-buffer
In contrast to the normal SCSI-lib, the BSG block-queue doesn't make use of any extra init_rq_fn() to make additional allocations during request-creation, and the request sense-pointer is not used to transport SCSI sense data, but is used as backing for the bsg_job->reply pointer; that in turn is used in the LLDs to store protocol IUs or similar stuff. This 're-purposing' of the sense-pointer is done in the BSG blk-lib (bsg_create_job()), during the queue-processing. Failing to allocate/assign it results in illegal dereferences because LLDs use this pointer unquestioned, as can be seen in the various BSG-implementations: drivers/scsi/libfc/fc_lport.c: fc_lport_bsg_request() drivers/scsi/qla2xxx/qla_bsg.c: qla24xx_bsg_request() drivers/scsi/qla4xxx/ql4_bsg.c: qla4xxx_process_vendor_specific() drivers/s390/scsi/zfcp_fc.c:zfcp_fc_ct_els_job_handler() ... An example panic on s390x, using the zFCP driver, looks like this (I had debugging on, otherwise NULL-pointer dereferences wouldn't even panic on s390x): Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6403 Fault in home space mode while using kernel ASCE. AS:01590007 R3:0024 Oops: 0038 ilc:2 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-bsg-regression+ #3 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) task: 65cb0100 task.stack: 65cb4000 Krnl PSW : 0704e0018000 03ff801e4156 (zfcp_fc_ct_els_job_handler+0x16/0x58 [zfcp]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0001 5fa9d0d0 5fa9d078 00e16866 03ff0290 6b6b6b6b6b6b6b6b 59f78f00 000f 593a0958 593a0958 60d88800 5ddd4c38 58b50100 0700659cba08 03ff801e8556 659cb9a8 Krnl Code: 03ff801e4146: e3102054lg %r1,80(%r2) 03ff801e414c: 58402040 l %r4,64(%r2) #03ff801e4150: e3502024 lg %r5,32(%r2) >03ff801e4156: 50405004 st %r4,4(%r5) 03ff801e415a: e54c5008 mvhi8(%r5),0 03ff801e4160: e33010280012 lt %r3,40(%r1) 03ff801e4166: a718fffb lhi %r1,-5 03ff801e416a: 1803 lr %r0,%r3 Call Trace: ([<03ff801e8556>] zfcp_fsf_req_complete+0x726/0x768 [zfcp]) [<03ff801ea82a>] zfcp_fsf_reqid_check+0x102/0x180 [zfcp] [<03ff801eb980>] zfcp_qdio_int_resp+0x230/0x278 [zfcp] [<009b91b6>] qdio_kick_handler+0x2ae/0x2c8 [<009b9e3e>] __tiqdio_inbound_processing+0x406/0xc10 [<001684c2>] tasklet_action+0x15a/0x1d8 [<00bd28ec>] __do_softirq+0x3ec/0x848 [<001675a4>] irq_exit+0x74/0xf8 [<0010dd6a>] do_IRQ+0xba/0xf0 [<00bd19e8>] io_int_handler+0x104/0x2d4 [<001033b6>] enabled_wait+0xb6/0x188 ([<0010339e>] enabled_wait+0x9e/0x188) [<0010396a>] arch_cpu_idle+0x32/0x50 [<00bd0112>] default_idle_call+0x52/0x68 [<001cd0fa>] do_idle+0x102/0x188 [<001cd41e>] cpu_startup_entry+0x3e/0x48 [<00118c64>] smp_start_secondary+0x11c/0x130 [<00bd2016>] restart_int_handler+0x62/0x78 [<>] (null) INFO: lockdep is turned off. Last Breaking-Event-Address: [<03ff801e41d6>] zfcp_fc_ct_job_handler+0x3e/0x48 [zfcp] Kernel panic - not syncing: Fatal exception in interrupt To prevent this, allocate a buffer when the BSG blk-request is setup, and before it is queued for LLD processing. Reported-by: Steffen Maier <ma...@linux.vnet.ibm.com> Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request") Cc: <sta...@vger.kernel.org> #4.11+ --- block/bsg.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index 37663b664666..285b1b8126c3 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -74,6 +74,8 @@ static int bsg_major; static struct kmem_cache *bsg_cmd_cachep; +#define BSG_COMMAND_REPLY_BUFFERSIZE SCSI_SENSE_BUFFERSIZE + /* * our internal command type */ @@ -85,6 +87,7 @@ struct bsg_command { struct bio *bidi_bio; int err; struct sg_io_v4 hdr; + u8 reply_buffer[BSG_COMMAND_REPLY_BUFFERSIZE]; }; static void bsg_free_command(struct bsg_command *bc) @@ -137,7 +140,7 @@ static inline struct hlist_head *bsg_dev_idx_hash(int index) static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, struct sg_io_v4 *hdr, struct bsg_devic
Re: Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?
On Tue, Sep 19, 2017 at 02:16:26PM -0400, Douglas Gilbert wrote: > On 2017-09-19 10:56 AM, Benjamin Block wrote: > > Hello linux-block, > > > > I wrote some tests recently to test patches against bsg.c and bsg-lib.c, > > and while writing those I noticed something strange: > > > > When you use the write() and read() call on multiple file-descriptors > > for a single bsg-device (FC or SCSI), it is possible that you get > > cross-talk between those different file-descriptors. > > > > E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0 > > and you send commands via write() in both processes, when they both do > > read() later on - to read the response for the commands they send before > > -, it is possible that process A gets the response (Sense-Data, > > FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis. > > > > I noticed this because in my tests I 'tag' each command I send with > > write() via a value I write into the usr_ptr field of sg_io_v4. When I > > later user read() to receive responses I check for this tag in a > > hash-table and so can look up the original command. When I used this and > > spawned multiple processes with the same target bsg-device, I got > > responses for commands I don't have tags for in my hash-table, so they > > were written by an other process. This never happend when I only spawn > > one test-process. > > > > This seems awfully contra-productive.. so much that I don't see any > > point in allowing users to open more than one FD per bsg-device, because > > that would inevitably lead to very hard to debug 'bugs'. And I wonder if > > this is really by design or some regression that happend over time. > > > > I looked at the code in bsg.c and yes, it seems this is what is coded > > right now. We have a hash-table which manages all 'struct bsg_device's > > who are indexed by device-minors, so one hash-table entry per > > device-node. > > > > So eh, long talk short question, is that intended? > > Hi, > About once a year I point out that major misfeature in the bsg driver > and no-one seems to care. Most recently I pointed it out in a > discussion about SCSI SG CVE-2017-0794 6 days ago: > " ... Last time I looked at the bsg driver, a SCSI command > could be issued on one file descriptor and its data-in block > and SCSI status delivered to another file descriptor to the > same block device (potentially in another process). IOW chaos" > > It is hard to imagine any sane application relying on this bizarre > behaviour so fixing it should not break any applications. My guess > is that it would require a non-trivial patch set to fix. Would you > like to volunteer? > Interesting. So this is not a regression then. Well, personally I am intrigued to try to fix this, but professionally it is not really up to me to choose to work on this. Especially because as you say, this looks not trivial - at least if you want to let multiple applications open a FD for a single BSG device node. Lets see. But thank you for elaborating on this! Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
[PATCH 1/1] bsg-lib: fix use-after-free under memory-pressure
When under memory-pressure it is possible that the mempool which backs the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count emergency buffers - in case it can't get a regular allocation. These buffers are preallocated and once they are also used, they are re-supplied with old finished requests from the same request_queue (see mempool_free()). The bug is, when re-supplying the emergency pool, the old requests are not again ran through the callback mempool_t->alloc(), and thus also not through the callback bsg_init_rq(). Thus we skip initialization, and while the sense-buffer still should be good, scsi_request->cmd might have become to be an invalid pointer in the meantime. When the request is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB, bsg will replace it with a custom allocated buffer, which is freed when the user's command is finished, thus it dangles afterwards. When next a command is sent by the user that has a smaller/similar CDB as BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by scsi_request->__cmd, will not make a custom allocation, and write into undefined memory. Fix this by splitting bsg_init_rq() into two functions: - bsg_init_job() directly replace bsg_init_rq() and only does the allocation of the sense-buffer, which is used to back the bsg job's reply buffer. This pointer should never change during the lifetime of a scsi_request, so it doesn't need re-initialization. - bsg_init_rq() is a new function that make use of 'struct request_queue's initialize_rq_fn callback (which was introduced in v4.12). This is always called before the request is given out via blk_get_request(). This function does the remaining initialization that was previously done in bsg_init_rq(), and will also do it when the request is taken from the emergency-pool of the backing mempool. Also rename bsg_exit_rq() into bsg_exit_job(), to make it fit the name-scheme. Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer") Cc: <sta...@vger.kernel.org> # 4.11+ Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- Notes: I did test this on zFCP with FC CT commands send via the ioctl() and write() system-call. That did work fine. But I would very much appreciate if anyone could run this against an other HBA or even an other implementer of bsg-lib, such as now SAS, because I have no access to such hardware here. This should make no difference to the normal cases - where each request is allocated via slab - with- or without this patch; if I didn't miss anything. Only the order is a bit mixed up - the memset is done after the sense-allocation, so I have to buffer the sense-pointer for that. But otherwise there is no difference I am aware of, so it should behave the same (does for me). I could not reproduce the memory-pressure case here in the lab.. I don't see any reason why it should work now, but I am open to suggestions :) Beste Grüße / Best regards, - Benjamin Block block/bsg-lib.c | 41 - 1 file changed, 28 insertions(+), 13 deletions(-) diff --git a/block/bsg-lib.c b/block/bsg-lib.c index c82408c7cc3c..634d1557da38 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -203,28 +203,42 @@ static void bsg_request_fn(struct request_queue *q) spin_lock_irq(q->queue_lock); } -static int bsg_init_rq(struct request_queue *q, struct request *req, gfp_t gfp) +static int bsg_init_job(struct request_queue *q, struct request *req, gfp_t gfp) { struct bsg_job *job = blk_mq_rq_to_pdu(req); struct scsi_request *sreq = >sreq; - memset(job, 0, sizeof(*job)); + /* called right after the request is allocated for the request_queue */ - scsi_req_init(sreq); - sreq->sense_len = SCSI_SENSE_BUFFERSIZE; - sreq->sense = kzalloc(sreq->sense_len, gfp); + sreq->sense = kzalloc(SCSI_SENSE_BUFFERSIZE, gfp); if (!sreq->sense) return -ENOMEM; - job->req = req; - job->reply = sreq->sense; - job->reply_len = sreq->sense_len; - job->dd_data = job + 1; - return 0; } -static void bsg_exit_rq(struct request_queue *q, struct request *req) +static void bsg_init_rq(struct request *req) +{ + struct bsg_job *job = blk_mq_rq_to_pdu(req); + struct scsi_request *sreq = >sreq; + void *sense = sreq->sense; + + /* called right before the request is given to the request_queue user */ + + memset(job, 0, sizeof(*job)); + + scsi_req_init(sreq); + + sreq->sense = sense; + sreq->sense_len = SCSI_SENSE_BUFFERSIZE; + + job->req = req; + job->reply = sense;
Is the possible cross-talking between unrelated file-descriptors on bsg-device by design?
Hello linux-block, I wrote some tests recently to test patches against bsg.c and bsg-lib.c, and while writing those I noticed something strange: When you use the write() and read() call on multiple file-descriptors for a single bsg-device (FC or SCSI), it is possible that you get cross-talk between those different file-descriptors. E.g.: You have two independent processes open a FD on /dev/bsg/fc_host0 and you send commands via write() in both processes, when they both do read() later on - to read the response for the commands they send before -, it is possible that process A gets the response (Sense-Data, FC-Protocol-Data, ...) for a command that process B wrote and vis-a-vis. I noticed this because in my tests I 'tag' each command I send with write() via a value I write into the usr_ptr field of sg_io_v4. When I later user read() to receive responses I check for this tag in a hash-table and so can look up the original command. When I used this and spawned multiple processes with the same target bsg-device, I got responses for commands I don't have tags for in my hash-table, so they were written by an other process. This never happend when I only spawn one test-process. This seems awfully contra-productive.. so much that I don't see any point in allowing users to open more than one FD per bsg-device, because that would inevitably lead to very hard to debug 'bugs'. And I wonder if this is really by design or some regression that happend over time. I looked at the code in bsg.c and yes, it seems this is what is coded right now. We have a hash-table which manages all 'struct bsg_device's who are indexed by device-minors, so one hash-table entry per device-node. So eh, long talk short question, is that intended? Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH 8/9] block: pass full fmode_t to blk_verify_command
0; > > return -EPERM; > @@ -234,7 +234,7 @@ static int blk_fill_sghdr_rq(struct request_queue *q, > struct request *rq, > > if (copy_from_user(req->cmd, hdr->cmdp, hdr->cmd_len)) > return -EFAULT; > - if (blk_verify_command(req->cmd, mode & FMODE_WRITE)) > + if (blk_verify_command(req->cmd, mode)) > return -EPERM; > > /* > @@ -469,7 +469,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk > *disk, fmode_t mode, > if (in_len && copy_from_user(buffer, sic->data + cmdlen, in_len)) > goto error; > > - err = blk_verify_command(req->cmd, mode & FMODE_WRITE); > + err = blk_verify_command(req->cmd, mode); > if (err) > goto error; > > diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c > index 0419c2298eab..92fd870e1315 100644 > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -217,7 +217,7 @@ static int sg_allow_access(struct file *filp, unsigned > char *cmd) > if (sfp->parentdp->device->type == TYPE_SCANNER) > return 0; > > - return blk_verify_command(cmd, filp->f_mode & FMODE_WRITE); > + return blk_verify_command(cmd, filp->f_mode); > } > > static int > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 02fa42d24b52..75fe9d45ead1 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1371,7 +1371,7 @@ static inline int sb_issue_zeroout(struct super_block > *sb, sector_t block, > gfp_mask, 0); > } > > -extern int blk_verify_command(unsigned char *cmd, fmode_t has_write_perm); > +extern int blk_verify_command(unsigned char *cmd, fmode_t mode); > > enum blk_default_limits { > BLK_MAX_SEGMENTS= 128, > -- > 2.14.1 > Reviewed-by: Benjamin Block <bbl...@linux.vnet.ibm.com> Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH 6/9] bsg-lib: introduce a timeout field in struct bsg_job
On Tue, Oct 03, 2017 at 12:48:42PM +0200, Christoph Hellwig wrote: > The zfcp driver wants to know the timeout for a bsg job, so add a field > to struct bsg_job for it in preparation of not exposing the request > to the bsg-lib users. > > Signed-off-by: Christoph Hellwig <h...@lst.de> > --- > block/bsg-lib.c | 1 + > drivers/s390/scsi/zfcp_fc.c | 4 ++-- > include/linux/bsg-lib.h | 2 ++ > 3 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/block/bsg-lib.c b/block/bsg-lib.c > index 15d25ccd51a5..0d5bbf6a2ddd 100644 > --- a/block/bsg-lib.c > +++ b/block/bsg-lib.c > @@ -132,6 +132,7 @@ static int bsg_prepare_job(struct device *dev, struct > request *req) > struct bsg_job *job = blk_mq_rq_to_pdu(req); > int ret; > > + job->timeout = req->timeout; > job->request = rq->cmd; > job->request_len = rq->cmd_len; > > diff --git a/drivers/s390/scsi/zfcp_fc.c b/drivers/s390/scsi/zfcp_fc.c > index 8210645c2111..9d6f69b92c81 100644 > --- a/drivers/s390/scsi/zfcp_fc.c > +++ b/drivers/s390/scsi/zfcp_fc.c > @@ -960,7 +960,7 @@ static int zfcp_fc_exec_els_job(struct bsg_job *job, > d_id = ntoh24(bsg_request->rqst_data.h_els.port_id); > > els->handler = zfcp_fc_ct_els_job_handler; > - return zfcp_fsf_send_els(adapter, d_id, els, job->req->timeout / HZ); > + return zfcp_fsf_send_els(adapter, d_id, els, job->timeout / HZ); > } > > static int zfcp_fc_exec_ct_job(struct bsg_job *job, > @@ -979,7 +979,7 @@ static int zfcp_fc_exec_ct_job(struct bsg_job *job, > return ret; > > ct->handler = zfcp_fc_ct_job_handler; > - ret = zfcp_fsf_send_ct(wka_port, ct, NULL, job->req->timeout / HZ); > + ret = zfcp_fsf_send_ct(wka_port, ct, NULL, job->timeout / HZ); > if (ret) > zfcp_fc_wka_port_put(wka_port); > > diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h > index b1be0233ce35..402223c95ce1 100644 > --- a/include/linux/bsg-lib.h > +++ b/include/linux/bsg-lib.h > @@ -44,6 +44,8 @@ struct bsg_job { > > struct kref kref; > > + unsigned int timeout; > + > /* Transport/driver specific request/reply structs */ > void *request; > void *reply; > -- > 2.14.1 > Reviewed-by: Benjamin Block <bbl...@linux.vnet.ibm.com> Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH 1/2] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer
On Wed, Sep 06, 2017 at 08:07:43AM -0600, Jens Axboe wrote: > On 09/06/2017 07:44 AM, Christoph Hellwig wrote: > > From: Benjamin Block <bbl...@linux.vnet.ibm.com> > > > > Since we split the scsi_request out of struct request bsg fails to > > provide a reply-buffer for the drivers. This was done via the pointer > > for sense-data, that is not preallocated anymore. > > Confused, this is already in master, was included before 4.13 was > finalized. > I think this is the backport for 4.11 and 4.12 as requested by Greg. I just send a mail as answer that I would do it, but I guess Christoph already had something in the pipe. Or? I'll take a look. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
[PATCH v2 1/1] bsg-lib: fix use-after-free under memory-pressure
When under memory-pressure it is possible that the mempool which backs the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count emergency buffers - in case it can't get a regular allocation. These buffers are preallocated and once they are also used, they are re-supplied with old finished requests from the same request_queue (see mempool_free()). The bug is, when re-supplying the emergency pool, the old requests are not again ran through the callback mempool_t->alloc(), and thus also not through the callback bsg_init_rq(). Thus we skip initialization, and while the sense-buffer still should be good, scsi_request->cmd might have become to be an invalid pointer in the meantime. When the request is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB, bsg will replace it with a custom allocated buffer, which is freed when the user's command is finished, thus it dangles afterwards. When next a command is sent by the user that has a smaller/similar CDB as BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by scsi_request->__cmd, will not make a custom allocation, and write into undefined memory. Fix this by splitting bsg_init_rq() into two functions: - bsg_init_rq() is changed to only do the allocation of the sense-buffer, which is used to back the bsg job's reply buffer. This pointer should never change during the lifetime of a scsi_request, so it doesn't need re-initialization. - bsg_initialize_rq() is a new function that makes use of 'struct request_queue's initialize_rq_fn callback (which was introduced in v4.12). This is always called before the request is given out via blk_get_request(). This function does the remaining initialization that was previously done in bsg_init_rq(), and will also do it when the request is taken from the emergency-pool of the backing mempool. Fixes: 50b4d485528d ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer") Cc: <sta...@vger.kernel.org> # 4.11+ Reviewed-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> --- Notes: I did test this on zFCP with FC CT commands send via the ioctl() and write() system-call. That did work fine. But I would very much appreciate if anyone could run this against an other HBA or even an other implementer of bsg-lib, such as now SAS, because I have no access to such hardware here. This should make no difference to the normal cases - where each request is allocated via slab - with- or without this patch; if I didn't miss anything. Only the order is a bit mixed up - the memset is done after the sense-allocation, so I have to buffer the sense-pointer for that. But otherwise there is no difference I am aware of, so it should behave the same (does for me). I could not reproduce the memory-pressure case here in the lab.. I don't see any reason why it should work now, but I am open to suggestions :) Beste Grüße / Best regards, - Benjamin Block Changes in v2: - Renamed function names to reflect the function-pointer names in request_queue block/bsg-lib.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/block/bsg-lib.c b/block/bsg-lib.c index c82408c7cc3c..ff5a158b22fa 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -208,20 +208,34 @@ static int bsg_init_rq(struct request_queue *q, struct request *req, gfp_t gfp) struct bsg_job *job = blk_mq_rq_to_pdu(req); struct scsi_request *sreq = >sreq; - memset(job, 0, sizeof(*job)); + /* called right after the request is allocated for the request_queue */ - scsi_req_init(sreq); - sreq->sense_len = SCSI_SENSE_BUFFERSIZE; - sreq->sense = kzalloc(sreq->sense_len, gfp); + sreq->sense = kzalloc(SCSI_SENSE_BUFFERSIZE, gfp); if (!sreq->sense) return -ENOMEM; + return 0; +} + +static void bsg_initialize_rq(struct request *req) +{ + struct bsg_job *job = blk_mq_rq_to_pdu(req); + struct scsi_request *sreq = >sreq; + void *sense = sreq->sense; + + /* called right before the request is given to the request_queue user */ + + memset(job, 0, sizeof(*job)); + + scsi_req_init(sreq); + + sreq->sense = sense; + sreq->sense_len = SCSI_SENSE_BUFFERSIZE; + job->req = req; - job->reply = sreq->sense; + job->reply = sense; job->reply_len = sreq->sense_len; job->dd_data = job + 1; - - return 0; } static void bsg_exit_rq(struct request_queue *q, struct request *req) @@ -252,6 +266,7 @@ struct request_queue *bsg_setup_queue(struct device *dev, const char *name, q->cmd_size = si
Re: [PATCH 1/1] bsg-lib: fix use-after-free under memory-pressure
On Mon, Sep 25, 2017 at 08:53:07AM -0700, Christoph Hellwig wrote: > > if (!q) > > return ERR_PTR(-ENOMEM); > > q->cmd_size = sizeof(struct bsg_job) + dd_job_size; > > - q->init_rq_fn = bsg_init_rq; > > - q->exit_rq_fn = bsg_exit_rq; > > + q->init_rq_fn = bsg_init_job; > > + q->exit_rq_fn = bsg_exit_job; > > + q->initialize_rq_fn = bsg_init_rq; > > Please use function names that match the method names, that is keep > the existing names and name the new helper bsg_initialize_rq; > OK, I can change that. Beste Grüße / Best regards, - Benjamin Block > > Except for that the patch looks fine to me: > > Reviewed-by: Christoph Hellwig <h...@lst.de> > -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH v2 1/1] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer
On Thu, Aug 24, 2017 at 10:45:56AM +0200, Christoph Hellwig wrote: > > /** > > - * bsg_destroy_job - routine to teardown/delete a bsg job > > + * bsg_teardown_job - routine to teardown a bsg job > > * @job: bsg_job that is to be torn down > > */ > > -static void bsg_destroy_job(struct kref *kref) > > +static void bsg_teardown_job(struct kref *kref) > > Why this rename? The destroy name seems to be one of the most > common patterns for the kref_put callbacks. > Hmm, I did it mostly so it is symmetric with bsg_prepare_job() and it doesn't really itself destroy the job-struct anymore. If there are other thing amiss I can change that along with them, if it bothers poeple. Beste Grüße / Best regards, - Benjamin Block > > Otherwise this looks fine: > > Reviewed-by: Christoph Hellwig <h...@lst.de> > -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
[PATCH v2 0/1] bsg: fix regression resulting in panics when sending commands via BSG
Hello all, This is the second try for fixing the regression in the BSG-interface that exists since v4.11 (for more infos see the first series). I separated my other changes from the bug-fix so that it is easier to apply if judged good. I will rebase my cleanups I sent in v1 and send them when I get a bit more time. But the regression-fix is more important, so here's that. I did some more tests on it than on v1, including some heavy parallel I/O on the same blk-queue using both BSG and the normal SCSI-stack at the same time (throwing some intentional bad commands in it too). That seemed to work all well enough - i.e. it didn't crash and got the expected results. I haven't done any external error-inject, but IMO that would be beyond the scope right now. The fix is based on Christoph's idea, I discussed this with him off-list already. I rebased the series on Jens' for-next. Reviews are more than welcome :) Beste Grüße / Best regards, - Benjamin Block Benjamin Block (1): bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer block/bsg-lib.c | 74 + include/linux/blkdev.h | 1 - include/linux/bsg-lib.h | 2 ++ 3 files changed, 46 insertions(+), 31 deletions(-) -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
[PATCH v2 1/1] bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer
Since we split the scsi_request out of struct request bsg fails to provide a reply-buffer for the drivers. This was done via the pointer for sense-data, that is not preallocated anymore. Failing to allocate/assign it results in illegal dereferences because LLDs use this pointer unquestioned. An example panic on s390x, using the zFCP driver, looks like this (I had debugging on, otherwise NULL-pointer dereferences wouldn't even panic on s390x): Unable to handle kernel pointer dereference in virtual kernel address space Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6403 Fault in home space mode while using kernel ASCE. AS:01590007 R3:0024 Oops: 0038 ilc:2 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-bsg-regression+ #3 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) task: 65cb0100 task.stack: 65cb4000 Krnl PSW : 0704e0018000 03ff801e4156 (zfcp_fc_ct_els_job_handler+0x16/0x58 [zfcp]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0001 5fa9d0d0 5fa9d078 00e16866 03ff0290 6b6b6b6b6b6b6b6b 59f78f00 000f 593a0958 593a0958 60d88800 5ddd4c38 58b50100 0700659cba08 03ff801e8556 659cb9a8 Krnl Code: 03ff801e4146: e3102054lg %r1,80(%r2) 03ff801e414c: 58402040 l %r4,64(%r2) #03ff801e4150: e3502024 lg %r5,32(%r2) >03ff801e4156: 50405004 st %r4,4(%r5) 03ff801e415a: e54c5008 mvhi8(%r5),0 03ff801e4160: e33010280012 lt %r3,40(%r1) 03ff801e4166: a718fffb lhi %r1,-5 03ff801e416a: 1803 lr %r0,%r3 Call Trace: ([<03ff801e8556>] zfcp_fsf_req_complete+0x726/0x768 [zfcp]) [<03ff801ea82a>] zfcp_fsf_reqid_check+0x102/0x180 [zfcp] [<03ff801eb980>] zfcp_qdio_int_resp+0x230/0x278 [zfcp] [<009b91b6>] qdio_kick_handler+0x2ae/0x2c8 [<009b9e3e>] __tiqdio_inbound_processing+0x406/0xc10 [<001684c2>] tasklet_action+0x15a/0x1d8 [<00bd28ec>] __do_softirq+0x3ec/0x848 [<001675a4>] irq_exit+0x74/0xf8 [<0010dd6a>] do_IRQ+0xba/0xf0 [<00bd19e8>] io_int_handler+0x104/0x2d4 [<001033b6>] enabled_wait+0xb6/0x188 ([<0010339e>] enabled_wait+0x9e/0x188) [<0010396a>] arch_cpu_idle+0x32/0x50 [<00bd0112>] default_idle_call+0x52/0x68 [<001cd0fa>] do_idle+0x102/0x188 [<001cd41e>] cpu_startup_entry+0x3e/0x48 [<00118c64>] smp_start_secondary+0x11c/0x130 [<00bd2016>] restart_int_handler+0x62/0x78 [<>] (null) INFO: lockdep is turned off. Last Breaking-Event-Address: [<03ff801e41d6>] zfcp_fc_ct_job_handler+0x3e/0x48 [zfcp] Kernel panic - not syncing: Fatal exception in interrupt This patch moves bsg-lib to allocate and setup struct bsg_job ahead of time, including the allocation of a buffer for the reply-data. This means, struct bsg_job is not allocated separately anymore, but as part of struct request allocation - similar to struct scsi_cmd. Reflect this in the function names that used to handle creation/destruction of struct bsg_job. Reported-by: Steffen Maier <ma...@linux.vnet.ibm.com> Suggested-by: Christoph Hellwig <h...@lst.de> Signed-off-by: Benjamin Block <bbl...@linux.vnet.ibm.com> Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request") Cc: <sta...@vger.kernel.org> #4.11+ --- block/bsg-lib.c | 74 + include/linux/blkdev.h | 1 - include/linux/bsg-lib.h | 2 ++ 3 files changed, 46 insertions(+), 31 deletions(-) diff --git a/block/bsg-lib.c b/block/bsg-lib.c index c4513b23f57a..dd56d7460cb9 100644 --- a/block/bsg-lib.c +++ b/block/bsg-lib.c @@ -29,26 +29,25 @@ #include /** - * bsg_destroy_job - routine to teardown/delete a bsg job + * bsg_teardown_job - routine to teardown a bsg job * @job: bsg_job that is to be torn down */ -static void bsg_destroy_job(struct kref *kref) +static void bsg_teardown_job(struct kref *kref) { struct bsg_job *job = container_of(kref, struct bsg_job, kref); struct request *rq = job->req; - blk_end_request_all(rq, BLK_STS_OK); - put_device(job->dev); /* release reference for the request */ kfree(job->request_payload.sg_list); kfree(job->reply_payload.sg_list); - kfree(job); + + blk_end_request_all(rq, BLK_STS_OK); } void bsg_job_put(struct bsg_job *job) { - kref_put(>kref, bsg_destroy_job); + kref_p
Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues
> +static int bsg_transport_complete_rq(struct request *rq, struct sg_io_v4 > *hdr) > +{ > + struct bsg_job *job = blk_mq_rq_to_pdu(rq); > + int ret = 0; > + > + /* > + * The assignments below don't make much sense, but are kept for > + * bug by bug backwards compatibility: > + */ > + hdr->device_status = job->result & 0xff; > + hdr->transport_status = host_byte(job->result); > + hdr->driver_status = driver_byte(job->result); > + hdr->info = 0; > + if (hdr->device_status || hdr->transport_status || hdr->driver_status) > + hdr->info |= SG_INFO_CHECK; > + hdr->response_len = 0; > + > + if (job->result < 0) { > + /* we're only returning the result field in the reply */ > + job->reply_len = sizeof(u32); > + ret = job->result; > + } > + > + if (job->reply_len && hdr->response) { > + int len = min(hdr->max_response_len, job->reply_len); > + > + if (copy_to_user(ptr64(hdr->response), job->reply, len)) > + ret = -EFAULT; > + else > + hdr->response_len = len; > + } > + > + /* we assume all request payload was transferred, residual == 0 */ > + hdr->dout_resid = 0; > + > + if (rq->next_rq) { > + unsigned int rsp_len = blk_rq_bytes(rq->next_rq); > + > + if (WARN_ON(job->reply_payload_rcv_len > rsp_len)) This gives my a lot of new Warnings when running my tests on zFCP, non of which happen when I run on 4.13. After browsing the source some, I figured this is because in comparison to the old code, blk_rq_bytes() is now called after we finished the blk-request already and blk_update_bidi_request()+blk_update_request() was already called. This will update rq->__data_len, and thus the call here to get rsp_len will get a wrong value. Thus the warnings, and the following calculation is actually wrong. I figure you can just replace this call for blk_rq_bytes() with 'job->reply_payload.payload_len'. Its essentially the same value, but the later is not changed after the job is committed as far as I could see in the source. Driver makes use of it, but only reading as far as I could make out after browsing the code for a bit. I did a quick test with that change in place and that seems to work fine now. As far as my tests go, they behave as they did before. Beste Grüße / Best regards, - Benjamin Block > + hdr->din_resid = 0; > + else > + hdr->din_resid = rsp_len - job->reply_payload_rcv_len; > + } else { > + hdr->din_resid = 0; > + } > + > + return ret; > +} > + -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues
b.c > @@ -2105,8 +2105,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct > request_queue *q) > { > struct device *dev = shost->dma_dev; > > - queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q); > - > /* >* this limit is imposed by hardware restrictions >*/ > @@ -2202,6 +2200,7 @@ struct request_queue *scsi_old_alloc_queue(struct > scsi_device *sdev) > } > > __scsi_init_queue(shost, q); > + queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q); > blk_queue_prep_rq(q, scsi_prep_fn); > blk_queue_unprep_rq(q, scsi_unprep_fn); > blk_queue_softirq_done(q, scsi_softirq_done); > @@ -2231,6 +2230,7 @@ struct request_queue *scsi_mq_alloc_queue(struct > scsi_device *sdev) > > sdev->request_queue->queuedata = sdev; > __scsi_init_queue(sdev->host, sdev->request_queue); > + queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, > sdev->request_queue); > return sdev->request_queue; > } > > diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c > index bf53356f41f0..bfb4e6875643 100644 > --- a/drivers/scsi/scsi_sysfs.c > +++ b/drivers/scsi/scsi_sysfs.c > @@ -1262,8 +1262,7 @@ int scsi_sysfs_add_sdev(struct scsi_device *sdev) > transport_add_device(>sdev_gendev); > sdev->is_visible = 1; > > - error = bsg_register_queue(rq, >sdev_gendev, NULL, NULL); > - > + error = bsg_scsi_register_queue(rq, >sdev_gendev); > if (error) > /* we're treating error on bsg register as non-fatal, >* so pretend nothing went wrong */ > diff --git a/drivers/scsi/scsi_transport_sas.c > b/drivers/scsi/scsi_transport_sas.c > index 736a1f4f9676..4889bd432382 100644 > --- a/drivers/scsi/scsi_transport_sas.c > +++ b/drivers/scsi/scsi_transport_sas.c > @@ -228,7 +228,6 @@ static int sas_bsg_initialize(struct Scsi_Host *shost, > struct sas_rphy *rphy) >*/ > blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); > queue_flag_set_unlocked(QUEUE_FLAG_BIDI, q); > - queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q); > return 0; > } > > diff --git a/include/linux/bsg-lib.h b/include/linux/bsg-lib.h > index 08762d297cbd..28a7ccc55c89 100644 > --- a/include/linux/bsg-lib.h > +++ b/include/linux/bsg-lib.h > @@ -38,7 +38,6 @@ struct bsg_buffer { > }; > > struct bsg_job { > - struct scsi_request sreq; > struct device *dev; > > struct kref kref; > @@ -64,6 +63,9 @@ struct bsg_job { > struct bsg_buffer request_payload; > struct bsg_buffer reply_payload; > > + int result; > + unsigned int reply_payload_rcv_len; > + > void *dd_data; /* Used for driver-specific storage */ > }; > > diff --git a/include/linux/bsg.h b/include/linux/bsg.h > index 7173f6e9d2dd..ed98946a8324 100644 > --- a/include/linux/bsg.h > +++ b/include/linux/bsg.h > @@ -1,33 +1,42 @@ > -#ifndef BSG_H > -#define BSG_H > +#ifndef _LINUX_BSG_H > +#define _LINUX_BSG_H > > #include > > +struct request; > + > +#ifdef CONFIG_BLK_DEV_BSG > +struct bsg_ops { > + int (*check_proto)(struct sg_io_v4 *hdr); > + int (*fill_hdr)(struct request *rq, struct sg_io_v4 *hdr, > + fmode_t mode); > + int (*complete_rq)(struct request *rq, struct sg_io_v4 *hdr); > + void(*free_rq)(struct request *rq); > +}; > > -#if defined(CONFIG_BLK_DEV_BSG) > struct bsg_class_device { > struct device *class_dev; > struct device *parent; > int minor; > struct request_queue *queue; > struct kref ref; > + const struct bsg_ops *ops; > void (*release)(struct device *); > }; > > -extern int bsg_register_queue(struct request_queue *q, > - struct device *parent, const char *name, > - void (*release)(struct device *)); > -extern void bsg_unregister_queue(struct request_queue *); > +int bsg_register_queue(struct request_queue *q, struct device *parent, > + const char *name, const struct bsg_ops *ops, > + void (*release)(struct device *)); > +int bsg_scsi_register_queue(struct request_queue *q, struct device *parent); > +void bsg_unregister_queue(struct request_queue *); > #else > -static inline int bsg_register_queue(struct request_queue *q, > - struct device *parent, const char *name, > - void (*release)(struct device *)) > +static inline int bsg_scsi_register_queue(struct request_queue *q, > + struct device *parent) > { > return 0; > } > static inline void bsg_unregister_queue(struct request_queue *q) > { > } > -#endif > - > -#endif > +#endif /* CONFIG_BLK_DEV_BSG */ > +#endif /* _LINUX_BSG_H */ > -- > 2.14.1 > Otherwise I haven't seen anything. I'll run it through my tests on zFCP and look whether I see something strange happening. I like the idea of splitting things up here, it gets rid of some code-duplications and unnecessary double book-keeping. Although I have to say, looking at functions like bsg_transport_complete_rq() and bsg_scsi_complete_rq() it also creates some new duplications that haven't been there before. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues
On Fri, Oct 20, 2017 at 06:26:30PM +0200, Christoph Hellwig wrote: > On Thu, Oct 19, 2017 at 05:59:33PM +0200, Benjamin Block wrote: > > > +#define ptr64(val) ((void __user *)(uintptr_t)(val)) > > > > Better to reflect the special property, that it is a user pointer, in > > the name of the macro. Maybe something like user_ptr(64). The same > > comment for the same macro in bsg.c. > > Not sure it's worth it especially now that Martin has merged the patch. He did? I only saw a mail that he picked patches 2-5. So all the bsg changes are still open I think. (Maybe I just missed that, I haven't exactly followed the list very closely as of late) > But given how many interface we have all over the kernel that use a u64 > to store a user pointer in ioctls and similar it might make sense to > lift a helper like this to a generic header. In that case we'll need > a more descriptive name for sure. > > > > +static int bsg_transport_check_proto(struct sg_io_v4 *hdr) > > > +{ > > > + if (hdr->protocol != BSG_PROTOCOL_SCSI || > > > + hdr->subprotocol != BSG_SUB_PROTOCOL_SCSI_TRANSPORT) > > > + return -EINVAL; > > > + if (!capable(CAP_SYS_RAWIO)) > > > + return -EPERM; > > > > Any particular reason why this is not symmetric with bsg_scsi? IOW > > permission checking done in bsg_transport_fill_hdr(), like it is done in > > bsg_scsi_fill_hdr()? > > > > We might save some time copying memory with this (we also only talk > > about ~20 bytes here), but on the other hand the interface would be more > > clean otherwise IMO (if we already do restructure the interface) - > > similar callbacks have similar responsibilities. > > I could move the capable check around, no sure why I had done it that > way, it's been a while. Probably because blk_verify_command needs the > CDB while a simple capable() check does not. That was my guess, too. I just though it would be more consistent otherwise. Its not a big thing, really. Beste Grüße / Best regards, - Benjamin Block -- Linux on z Systems Development / IBM Systems & Technology Group IBM Deutschland Research & Development GmbH Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294