[BUG] scsi/fcoe: Sleep-in-atomic bugs in fcoe driver
According to fcoe_ctlr.c, the driver may sleep under a RCU lock, and the function call paths are: fcoe_ctlr_disc_stop_locked (acquire the RCU lock) fc_rport_logoff mutex_lock --> may sleep fcoe_ctlr_vn_disc fc_rport_login mutex_lock --> may sleep fcoe_ctlr_vn_age fc_rport_logoff mutex_lock --> may sleep These bugs are found by my static analysis tool and my code review. Thanks, Jia-Ju Bai
[BUG] scsi/libfc: Sleep-in-atomic bugs in libfc
According to fc_disc.c, the driver may sleep under a RCU lock, and the function call paths are: fc_disc_done (acquire the RCU lock) fc_rport_logoff mutex_lock --> may sleep fc_disc_done (acquire the RCU lock) fc_rport_login mutex_lock --> may sleep fc_disc_stop_rports fc_rport_logoff mutex_lock --> may sleep These bugs are found by my static analysis tool and my code review. Thanks, Jia-Ju Bai
Re: [PATCH] ibmvscsis: Fix write_pending failure path
Bryant, > For write_pending if the queue is down or client failed then return > -EIO so that LIO can properly process the completed command. Prior we > returned 0 since LIO could not handle it properly. Now with: target: > Fix unknown fabric callback queue-full errors that patch addresses > LIO's ability to handle things right. Applied to 4.14/scsi-fixes. thanks! -- Martin K. Petersen Oracle Linux Engineering
[PATCH] scsi/fnic: Fix a sleep-in-atomic bug in fnic_handle_event
The driver may sleep under a spinlock, and the function call path is: fnic_handle_event (acquire the spinlock) fnic_fcoe_start_fcf_disc fcoe_ctlr_link_up mutec_lock --> may sleep To fix it, the spinlock can be released before fnic_fcoe_start_fcf_disc, and acquired again after this function. This bug is found by my static analysis tool and my code review. Signed-off-by: Jia-Ju Bai --- drivers/scsi/fnic/fnic_fcs.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/fnic/fnic_fcs.c b/drivers/scsi/fnic/fnic_fcs.c index 999fc75..4c99c96 100644 --- a/drivers/scsi/fnic/fnic_fcs.c +++ b/drivers/scsi/fnic/fnic_fcs.c @@ -265,7 +265,9 @@ void fnic_handle_event(struct work_struct *work) case FNIC_EVT_START_FCF_DISC: FNIC_FCS_DBG(KERN_DEBUG, fnic->lport->host, "Start FCF Discovery\n"); + spin_unlock_irqrestore(&fnic->fnic_lock, flags); fnic_fcoe_start_fcf_disc(fnic); + spin_lock_irqsave(&fnic->fnic_lock, flags); break; default: FNIC_FCS_DBG(KERN_DEBUG, fnic->lport->host, -- 1.7.9.5
Re: [PATCH v2 00/21] lpfc updates for 11.4.0.4
James, > This patch set provides a number of bug fixes and additions to > the driver. > > The patches were cut against the Martin's 4.14/scsi-queue tree. > There are no outside dependencies. Applied to 4.15/scsi-queue. I got a conflict in patch 8 in the "Process all the event on FCP fast-path EQ" hunk. Please verify. -- Martin K. Petersen Oracle Linux Engineering
Re: [Patch v2 1/2] libiscsi: Fix use-after-free race during iscsi_session_teardown
Khazhismel, > Noticed these don't seem to be in 4.14/scsi-queue Not sure what happened there. I apologize. They are now in 4.14/scsi-fixes. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 0/3] smartpqi updates
Don, > These patches are based on Linus's tree > > The changes are: > - update list of controllers > - cleanup warning message > - change driver version to 1.1.2-126 Applied to 4.15/scsi-queue. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 00/18] use ARRAY_SIZE macro
On Mon, 2 Oct 2017 15:22:24 -0400 bfie...@fieldses.org (J. Bruce Fields) wrote: > Mainly I'd just like to know which you're asking for. Do you want me to > apply this, or to ACK it so someone else can? If it's sent as a series > I tend to assume the latter. > > But in this case I'm assuming it's the former, so I'll pick up the nfsd > one Could you to apply the NFSD patch ("nfsd: use ARRAY_SIZE") with the Reviewed-by: Jeff Layton ) tag please ? This patch is an individual patch and it should not have been sent in a series (sorry about that). Thank you, Jérémy
Re: [PATCH V6 13/14] block: mq-deadline: Limit write request dispatch for zoned block devices
Bart, On 10/3/17 08:44, Bart Van Assche wrote: > On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: >> When dispatching write requests to a zoned block device, only allow >> requests targeting an unlocked zone. Requests targeting a locked zone >> are left in the scheduler queue to preserve the initial write order. >> If no write request can be dispatched, allow reads to be dispatched >> even if the write batch is not done. >> >> To ensure that the search for an appropriate write request is atomic >> in deadline_fifo_request() and deadline_next_request() with reagrd to > ^^ > regard? Will fix. >> write requests zone lock state, introduce the spinlock zone_lock. >> Holding this lock while doing the search in these functions as well as >> when unlocking the target zone of a completed write request in >> dd_completed_request() ensure that the search does not pickup a write >> request in the middle of a zone queued write sequence. > > Since there is already a spinlock in the mq-deadline scheduler that serializes > most operations, do we really need a new spinlock? The dd->lock spinlock is used without IRQ disabling. So it does not protect against request completion method calls. That spinlock being a "big lock", I did not want to use it with IRQ disabled. >> /* >> * Write unlock the target zone of a write request. >> + * Clearing the target zone write lock bit is done with the scheduler >> zone_lock >> + * spinlock held so that deadline_next_request() and deadline_fifo_request() >> + * cannot see the lock state of a zone change due to a request completion >> during >> + * their eventual search for an appropriate write request. Otherwise, for a >> zone >> + * with multiple write requests queued, a non sequential write request >> + * can be chosen. >> */ >> static void deadline_wunlock_zone(struct deadline_data *dd, >>struct request *rq) >> { >> +unsigned long flags; >> + >> +spin_lock_irqsave(&dd->zone_lock, flags); >> + >> WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq), dd->zones_wlock)); >> deadline_clear_request_zone_wlock(rq); >> + >> +spin_unlock_irqrestore(&dd->zone_lock, flags); >> } > > Is a request removed from the sort and fifo lists before being dispatched? If > so, > does that mean that obtaining zone_lock in the above function is not > necessary? Yes, a request is removed from the sort tree and fifo list before dispatching. But the dd->zone_lock spinlock is not there to protect that, dd->lock protects the sort tree and fifo list. dd->zone_lock was added to prevent the completed_request() method from changing a zone lock state while deadline_fifo_requests() or deadline_next_request() are running. Ex: Imagine this case: write request A for a zone Z is being executed (it was dispatched) so Z is locked. Dispatch runs and inspects the next requests in sort order. Let say we have the sequential writes B, C, D, E queued for the same zone Z. First B is inspected and cannot be dispatched (zone locked). Inspection move on to C, but before the that, A completes and Z is unlocked. Then C will be OK to go as the zone is now unlocked. But it is the wrong choice as it will result in out of order write. B must be the next request dispatched after A. dd->zone_lock prevents this from happening. Without this spinlock, the bad example case above happens very easily. >> static struct request * >> deadline_fifo_request(struct deadline_data *dd, int data_dir) >> { >> +struct request *rq; >> +unsigned long flags; >> + >> if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE)) >> return NULL; >> >> if (list_empty(&dd->fifo_list[data_dir])) >> return NULL; >> >> -return rq_entry_fifo(dd->fifo_list[data_dir].next); >> +if (!dd->zones_wlock || data_dir == READ) >> +return rq_entry_fifo(dd->fifo_list[data_dir].next); >> + >> +spin_lock_irqsave(&dd->zone_lock, flags); >> + >> +list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) { >> +if (deadline_can_dispatch_request(dd, rq)) >> +goto out; >> +} >> +rq = NULL; >> + >> +out: >> +spin_unlock_irqrestore(&dd->zone_lock, flags); >> + >> +return rq; >> } > > Same question here: is it really necessary to obtain zone_lock? See above. Same comment. Best regards. -- Damien Le Moal, Western Digital
Re: [PATCH V6 13/14] block: mq-deadline: Limit write request dispatch for zoned block devices
On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: > When dispatching write requests to a zoned block device, only allow > requests targeting an unlocked zone. Requests targeting a locked zone > are left in the scheduler queue to preserve the initial write order. > If no write request can be dispatched, allow reads to be dispatched > even if the write batch is not done. > > To ensure that the search for an appropriate write request is atomic > in deadline_fifo_request() and deadline_next_request() with reagrd to ^^ regard? > write requests zone lock state, introduce the spinlock zone_lock. > Holding this lock while doing the search in these functions as well as > when unlocking the target zone of a completed write request in > dd_completed_request() ensure that the search does not pickup a write > request in the middle of a zone queued write sequence. Since there is already a spinlock in the mq-deadline scheduler that serializes most operations, do we really need a new spinlock? > /* > * Write unlock the target zone of a write request. > + * Clearing the target zone write lock bit is done with the scheduler > zone_lock > + * spinlock held so that deadline_next_request() and deadline_fifo_request() > + * cannot see the lock state of a zone change due to a request completion > during > + * their eventual search for an appropriate write request. Otherwise, for a > zone > + * with multiple write requests queued, a non sequential write request > + * can be chosen. > */ > static void deadline_wunlock_zone(struct deadline_data *dd, > struct request *rq) > { > + unsigned long flags; > + > + spin_lock_irqsave(&dd->zone_lock, flags); > + > WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq), dd->zones_wlock)); > deadline_clear_request_zone_wlock(rq); > + > + spin_unlock_irqrestore(&dd->zone_lock, flags); > } Is a request removed from the sort and fifo lists before being dispatched? If so, does that mean that obtaining zone_lock in the above function is not necessary? > static struct request * > deadline_fifo_request(struct deadline_data *dd, int data_dir) > { > + struct request *rq; > + unsigned long flags; > + > if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE)) > return NULL; > > if (list_empty(&dd->fifo_list[data_dir])) > return NULL; > > - return rq_entry_fifo(dd->fifo_list[data_dir].next); > + if (!dd->zones_wlock || data_dir == READ) > + return rq_entry_fifo(dd->fifo_list[data_dir].next); > + > + spin_lock_irqsave(&dd->zone_lock, flags); > + > + list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) { > + if (deadline_can_dispatch_request(dd, rq)) > + goto out; > + } > + rq = NULL; > + > +out: > + spin_unlock_irqrestore(&dd->zone_lock, flags); > + > + return rq; > } Same question here: is it really necessary to obtain zone_lock? Thanks, Bart.
Re: [PATCH V6 12/14] block: mq-deadline: Introduce zone locking support
On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: > For a write request to a zoned block device, lock the request target > zone upon request displatch. The zone is unlocked either when the ^ dispatch? > request completes or when the request is requeued (inserted). Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH V6 10/14] block: mq-deadline: Add zoned block device data
On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: > + pr_info("mq-deadline: %s: zones write locking enabled\n", > + dev_name(q->backing_dev_info->dev)); Should this message perhaps be changed into "... zone write locking ..." or "... per-zone write locking ..."? Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH V5 02/14] scsi: sd_zbc: Fix comments and indentation
On Mon, 2017-09-25 at 15:14 +0900, Damien Le Moal wrote: > + * sd_zbc_zone_no - Get the number of the zone conataining a sector. Please fix the spelling in the above comment ("conataining"). Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH 00/18] use ARRAY_SIZE macro
On Mon, Oct 02, 2017 at 07:35:54AM +0200, Greg KH wrote: > On Sun, Oct 01, 2017 at 08:52:20PM -0400, Jérémy Lefaure wrote: > > On Mon, 2 Oct 2017 09:01:31 +1100 > > "Tobin C. Harding" wrote: > > > > > > In order to reduce the size of the To: and Cc: lines, each patch of the > > > > series is sent only to the maintainers and lists concerned by the patch. > > > > This cover letter is sent to every list concerned by this series. > > > > > > Why don't you just send individual patches for each subsystem? I'm not a > > > maintainer but I don't see > > > how any one person is going to be able to apply this whole series, it is > > > making it hard for > > > maintainers if they have to pick patches out from among the series (if > > > indeed any will bother > > > doing that). > > Yeah, maybe it would have been better to send individual patches. > > > > From my point of view it's a series because the patches are related (I > > did a git format-patch from my local branch). But for the maintainers > > point of view, they are individual patches. > > And the maintainers view is what matters here, if you wish to get your > patches reviewed and accepted... Mainly I'd just like to know which you're asking for. Do you want me to apply this, or to ACK it so someone else can? If it's sent as a series I tend to assume the latter. But in this case I'm assuming it's the former, so I'll pick up the nfsd one --b.
[PATCH] ibmvscsis: Fix write_pending failure path
From: "Bryant G. Ly" For write_pending if the queue is down or client failed then return -EIO so that LIO can properly process the completed command. Prior we returned 0 since LIO could not handle it properly. Now with: target: Fix unknown fabric callback queue-full errors that patch addresses LIO's ability to handle things right. Signed-off-by: Bryant G. Ly --- drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c index 1f75d0380516f..fe5b9d7bc06df 100644 --- a/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c +++ b/drivers/scsi/ibmvscsi_tgt/ibmvscsi_tgt.c @@ -3767,7 +3767,7 @@ static int ibmvscsis_write_pending(struct se_cmd *se_cmd) */ if ((vscsi->flags & (CLIENT_FAILED | RESPONSE_Q_DOWN))) { pr_err("write_pending failed since: %d\n", vscsi->flags); - return 0; + return -EIO; } rc = srp_transfer_data(cmd, &vio_iu(iue)->srp.cmd, ibmvscsis_rdma, -- 2.13.5 (Apple Git-94)
Inquiry
-- Dear Sir/Madam, I've visited your website and I find your products very interested at the first sight to create its place. To go to the next step which concern the marketing Please kindly send me your latest catalogue's product and please send us your prices CIF, to start presentation in order to have feedback of the consumers and whole sellers/distributors Please contact us for more information Your early reply is highly appreciated.Thank You!Best Regards, Stenvall Skoeld & Company (China) Limited 4th Floor 139 No. 1 Ruijin Road Huangpu District, Shanghai Tel: +86 (21) 6316 6080
Re: [PATCH V6 08/14] scsi: sd_zbc: Limit zone write locking to sequential zones
On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: > + * There is no write constraints on conventional zones. So any write ^^ are? Anyway: Reviewed-by: Bart Van Assche
Re: [PATCH V6 07/14] scsi: sd_zbc: Initialize device request queue zoned data
On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: > Initialize the seq_zone_bitmap and nr_zones fields of the disk request > queue on disk revalidate. As the seq_zone_bitmap allocation is > identical to the allocation of the zone write lock bitmap, introduce > the helper sd_zbc_alloc_zone_bitmap(). Using this helper, wait for the > disk capacity and number of zones to stabilize on the second > revalidation pass to allocate and initialize the bitmaps. Reviewed-by: Bart Van Assche
Re: [PATCH V6 02/14] scsi: sd_zbc: Fix comments and indentation
On Mon, 2017-10-02 at 16:15 +0900, Damien Le Moal wrote: > Fix comments style (use kernel-doc style) and content to clarify some > functions. Also fix some functions signature indentation and remove a > useless blank line in sd_zbc_read_zones(). Reviewed-by: Bart Van Assche
Re: [PATCH 00/18] use ARRAY_SIZE macro
Em Sun, 1 Oct 2017 20:52:20 -0400 Jérémy Lefaure escreveu: > Anyway, I can tell to each maintainer that they can apply the patches > they're concerned about and next time I may send individual patches. In the case of media, we'll handle it as if they were individual ones. Thanks, Mauro
Re: [PATCH 1/3] scsi: ufs: Change HCI marco to actual bit position
On Sat, 2017-09-30 at 11:52 +0530, Alim Akhtar wrote: > Currently UFS HCI uses UFS_BIT() macro to get various bit > position for the hardware registers status bits. Which makes > code longer instead of shorter. This macro does not improve > code readability as well. > Lets re-write these macro definition with the actual bit position. Can the three patches in this series be combined into a single patch? > -#define UFS_BIT(x) (1L << (x)) > [ ... ] > +#define UFS_BIT(x) (1L << (x)) If you want to keep the three patches separate, please make sure that this patch does not modify the UFS_BIT() definition. Please also fix the spelling of the title of this patch (marco -> macro). Thanks, Bart.
Re: system hung up when offlining CPUs
On 09/16/2017 11:02 AM, Thomas Gleixner wrote: > On Sat, 16 Sep 2017, Thomas Gleixner wrote: >> On Thu, 14 Sep 2017, YASUAKI ISHIMATSU wrote: >>> Here are one irq's info of megasas: >>> >>> - Before offline CPU >>> /proc/irq/70/smp_affinity_list >>> 24-29 >>> >>> /proc/irq/70/effective_affinity >>> ,,,,,,,,,,,,,,,3f00 >>> >>> /sys/kernel/debug/irq/irqs/70 >>> handler: handle_edge_irq >>> status: 0x4000 >>> istate: 0x >>> ddepth: 0 >>> wdepth: 0 >>> dstate: 0x00609200 >>> IRQD_ACTIVATED >>> IRQD_IRQ_STARTED >>> IRQD_MOVE_PCNTXT >>> IRQD_AFFINITY_SET >>> IRQD_AFFINITY_MANAGED >> >> So this uses managed affinity, which means that once the last CPU in the >> affinity mask goes offline, the interrupt is shut down by the irq core >> code, which is the case: >> >>> dstate: 0x00a39000 >>> IRQD_IRQ_DISABLED >>> IRQD_IRQ_MASKED >>> IRQD_MOVE_PCNTXT >>> IRQD_AFFINITY_SET >>> IRQD_AFFINITY_MANAGED >>> IRQD_MANAGED_SHUTDOWN <--- >> >> So the irq core code works as expected, but something in the >> driver/scsi/block stack seems to fiddle with that shut down queue. >> >> I only can tell about the inner workings of the irq code, but I have no >> clue about the rest. > > Though there is something wrong here: > >> affinity: 24-29 >> effectiv: 24-29 > > and after offlining: > >> affinity: 29 >> effectiv: 29 > > But that should be: > > affinity: 24-29 > effectiv: 29 > > because the irq core code preserves 'affinity'. It merily updates > 'effective', which is where your interrupts are routed to. > > Is the driver issuing any set_affinity() calls? If so, that's wrong. > > Which driver are we talking about? We are talking about megasas driver. So I added linux-scsi and maintainers of megasas into the thread. Thanks, Yasuaki Ishimatsu > > Thanks, > > tglx >
Re: [PATCHv6 2/5] scsi: Export blacklist flags to sysfs
On Mon, 2017-10-02 at 16:26 +0200, Hannes Reinecke wrote: > Each scsi device is scanned according to the found blacklist flags, > but this information is never presented to sysfs. > This makes it quite hard to figure out if blacklisting worked as > expected. > With this patch we're exporting an additional attribute 'blacklist' > containing the blacklist flags for this device. Reviewed-by: Bart Van Assche
Re: [PATCH v2 1/1] bsg-lib: fix use-after-free under memory-pressure
Looks good: Reviewed-by: Christoph Hellwig
Re: [PATCH V7 5/6] block: support PREEMPT_ONLY
On Sat, 2017-09-30 at 14:12 +0800, Ming Lei wrote: > +void blk_set_preempt_only(struct request_queue *q, bool preempt_only) > +{ > + blk_mq_freeze_queue(q); > + if (preempt_only) > + queue_flag_set_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q); > + else > + queue_flag_clear_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q); > + blk_mq_unfreeze_queue(q); > +} > +EXPORT_SYMBOL(blk_set_preempt_only); > + > /** > * __blk_run_queue_uncond - run a queue whether or not it has been stopped > * @q: The queue to run > @@ -771,9 +782,18 @@ int blk_queue_enter(struct request_queue *q, unsigned > flags) > while (true) { > int ret; > > + /* > + * preempt_only flag has to be set after queue is frozen, > + * so it can be checked here lockless and safely > + */ > + if (blk_queue_preempt_only(q)) { > + if (!(flags & BLK_REQ_PREEMPT)) > + goto slow_path; > + } > + > if (percpu_ref_tryget_live(&q->q_usage_counter)) > return 0; Sorry but I don't think that it is possible with these changes to prevent that a non-preempt request gets allocated after a (SCSI) queue has been quiesced. If the CPU that calls blk_queue_enter() observes the set of the PREEMPT_ONLY flag after the queue has been unfrozen and after the SCSI device state has been changed into QUIESCED then blk_queue_enter() can succeed for a non-preempt request. I think this is exactly the scenario we want to avoid. This is why a synchronize_rcu() call is present in my patch before the queue is unfrozen and also why in my patch the percpu_ref_tryget_live() call occurs before the test of the PREEMPT_ONLY flag. Bart.
[PATCH] bsg: update bsg_device_list to utilize static hash table implementation.
Updates bsg_device_list to utilize static hashtable implementation. This is done in an effort to standarize the hashtables used across the kernel. Signed-off-by: Tim Hansen --- block/bsg.c | 20 ++-- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/block/bsg.c b/block/bsg.c index ee1335c68de7..ecb93d733b56 100644 --- a/block/bsg.c +++ b/block/bsg.c @@ -66,8 +66,8 @@ enum { static DEFINE_MUTEX(bsg_mutex); static DEFINE_IDR(bsg_minor_idr); -#define BSG_LIST_ARRAY_SIZE8 -static struct hlist_head bsg_device_list[BSG_LIST_ARRAY_SIZE]; +#define BSG_HASHTABLE_BIT_SIZE 3 +static DEFINE_HASHTABLE(bsg_device_list, BSG_HASHTABLE_BIT_SIZE); static struct class *bsg_class; static int bsg_major; @@ -130,11 +130,6 @@ static struct bsg_command *bsg_alloc_command(struct bsg_device *bd) return bc; } -static inline struct hlist_head *bsg_dev_idx_hash(int index) -{ - return &bsg_device_list[index & (BSG_LIST_ARRAY_SIZE - 1)]; -} - static int blk_fill_sgv4_hdr_rq(struct request_queue *q, struct request *rq, struct sg_io_v4 *hdr, struct bsg_device *bd, fmode_t has_write_perm) @@ -716,7 +711,7 @@ static int bsg_put_device(struct bsg_device *bd) goto out; } - hlist_del(&bd->dev_list); + hash_del(&bd->dev_list); mutex_unlock(&bsg_mutex); dprintk("%s: tearing down\n", bd->name); @@ -770,7 +765,7 @@ static struct bsg_device *bsg_add_device(struct inode *inode, atomic_set(&bd->ref_count, 1); mutex_lock(&bsg_mutex); - hlist_add_head(&bd->dev_list, bsg_dev_idx_hash(iminor(inode))); + hash_add(bsg_device_list, &bd->dev_list, iminor(inode)); strncpy(bd->name, dev_name(rq->bsg_dev.class_dev), sizeof(bd->name) - 1); dprintk("bound to <%s>, max queue %d\n", @@ -786,7 +781,7 @@ static struct bsg_device *__bsg_get_device(int minor, struct request_queue *q) mutex_lock(&bsg_mutex); - hlist_for_each_entry(bd, bsg_dev_idx_hash(minor), dev_list) { + hash_for_each_possible(bsg_device_list, bd, dev_list, minor) { if (bd->queue == q) { atomic_inc(&bd->ref_count); goto found; @@ -1042,7 +1037,7 @@ static char *bsg_devnode(struct device *dev, umode_t *mode) static int __init bsg_init(void) { - int ret, i; + int ret; dev_t devid; bsg_cmd_cachep = kmem_cache_create("bsg_cmd", @@ -1052,9 +1047,6 @@ static int __init bsg_init(void) return -ENOMEM; } - for (i = 0; i < BSG_LIST_ARRAY_SIZE; i++) - INIT_HLIST_HEAD(&bsg_device_list[i]); - bsg_class = class_create(THIS_MODULE, "bsg"); if (IS_ERR(bsg_class)) { ret = PTR_ERR(bsg_class); -- 2.14.2
Re: [PATCH V7 2/6] block: tracking request allocation with q_usage_counter
On Sat, 2017-09-30 at 14:12 +0800, Ming Lei wrote: > @@ -1395,16 +1401,21 @@ static struct request *blk_old_get_request(struct > request_queue *q, > unsigned int op, gfp_t gfp_mask) > { > struct request *rq; > + int ret = 0; > > WARN_ON_ONCE(q->mq_ops); > > /* create ioc upfront */ > create_io_context(gfp_mask, q->node); > > + ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM)); > + if (ret) > + return ERR_PTR(ret); Can the above blk_queue_enter() call block if the REQ_NOWAIT flag has been set in the op argument and e.g. gfp_mask == GFP_KERNEL? If so, isn't that a bug? Bart.
[PATCHv6 4/5] scsi_devinfo: Whitespace fixes
Signed-off-by: Hannes Reinecke --- drivers/scsi/scsi_devinfo.c | 44 ++-- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index 28fea83..6858ad8 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -304,8 +304,8 @@ static void scsi_strcpy_devinfo(char *name, char *to, size_t to_length, */ to[from_length] = '\0'; } else { - /* -* space pad the string if it is short. + /* +* space pad the string if it is short. */ strncpy(&to[from_length], spaces, to_length - from_length); @@ -325,10 +325,10 @@ static void scsi_strcpy_devinfo(char *name, char *to, size_t to_length, * @flags: if strflags NULL, use this flag value * * Description: - * Create and add one dev_info entry for @vendor, @model, @strflags or - * @flag. If @compatible, add to the tail of the list, do not space - * pad, and set devinfo->compatible. The scsi_static_device_list entries - * are added with @compatible 1 and @clfags NULL. + * Create and add one dev_info entry for @vendor, @model, @strflags or + * @flag. If @compatible, add to the tail of the list, do not space + * pad, and set devinfo->compatible. The scsi_static_device_list entries + * are added with @compatible 1 and @clfags NULL. * * Returns: 0 OK, -error on failure. **/ @@ -350,11 +350,11 @@ static int scsi_dev_info_list_add(int compatible, char *vendor, char *model, * @key: specify list to use * * Description: - * Create and add one dev_info entry for @vendor, @model, - * @strflags or @flag in list specified by @key. If @compatible, - * add to the tail of the list, do not space pad, and set - * devinfo->compatible. The scsi_static_device_list entries are - * added with @compatible 1 and @clfags NULL. + * Create and add one dev_info entry for @vendor, @model, + * @strflags or @flag in list specified by @key. If @compatible, + * add to the tail of the list, do not space pad, and set + * devinfo->compatible. The scsi_static_device_list entries are + * added with @compatible 1 and @clfags NULL. * * Returns: 0 OK, -error on failure. **/ @@ -405,7 +405,7 @@ int scsi_dev_info_list_add_keyed(int compatible, char *vendor, char *model, * * Description: * Finds the first dev_info entry matching @vendor, @model - * in list specified by @key. + * in list specified by @key. * * Returns: pointer to matching entry, or ERR_PTR on failure. **/ @@ -467,9 +467,9 @@ static struct scsi_dev_info_list *scsi_dev_info_list_find(const char *vendor, return devinfo; } else { if (!memcmp(devinfo->vendor, vendor, -sizeof(devinfo->vendor)) && -!memcmp(devinfo->model, model, - sizeof(devinfo->model))) + sizeof(devinfo->vendor)) && + !memcmp(devinfo->model, model, + sizeof(devinfo->model))) return devinfo; } } @@ -508,10 +508,10 @@ int scsi_dev_info_list_del_keyed(char *vendor, char *model, int key) * @dev_list: string of device flags to add * * Description: - * Parse dev_list, and add entries to the scsi_dev_info_list. - * dev_list is of the form "vendor:product:flag,vendor:product:flag". - * dev_list is modified via strsep. Can be called for command line - * addition, for proc or mabye a sysfs interface. + * Parse dev_list, and add entries to the scsi_dev_info_list. + * dev_list is of the form "vendor:product:flag,vendor:product:flag". + * dev_list is modified via strsep. Can be called for command line + * addition, for proc or mabye a sysfs interface. * * Returns: 0 if OK, -error on failure. **/ @@ -701,7 +701,7 @@ static int proc_scsi_devinfo_open(struct inode *inode, struct file *file) return seq_open(file, &scsi_devinfo_seq_ops); } -/* +/* * proc_scsi_dev_info_write - allow additions to scsi_dev_info_list via /proc. * * Description: Adds a black/white list entry for vendor and model with an @@ -840,8 +840,8 @@ int scsi_dev_info_remove_list(int key) * scsi_init_devinfo - set up the dynamic device list. * * Description: - * Add command line entries from scsi_dev_flags, then add - * scsi_static_device_list entries to the scsi device info list. + * Add command line entries from scsi_dev_flags, then add + * scsi_static_device_list entries to the scsi device info list. */ int __init scsi_init_devinfo(void) { -- 1.8
[PATCHv6 0/5] scsi: Fixup blacklist handling
Hi all, the SCSI blacklist handling seems to be rather tricky issue; everytime a fix is included it tends to break other devices. This patchset attempt to simplify the devlist handling yet again, but this time implementing the framework for regression testing, too. A patch adding a regression test to the blktest suite will follow. As usual, comments and reviews are welcome. Changes to v1: - Implement exact match for vendor string as suggested by Bart - Straigten out issues pointed out by Alan Stern - Reshuffle patches Changes to v2: - Simplify code as indicated by Alan Stern - Display blacklist flags verbatim - Reformat blacklist flags definition for better readability Changes to v3: - Add reviews from Alan Stern - Generate blacklist flag definitions - Include reviews from Martin Petersen Changes to v4: - Add reviews from Bart - Rework blacklist flag string generation Changes to v5: - Add bounds checking for sysfs attribute - Add review from Doug Gilbert Hannes Reinecke (5): scsi_debug: allow to specify inquiry vendor and model scsi: Export blacklist flags to sysfs scsi_devinfo: Reformat blacklist flags scsi_devinfo: Whitespace fixes scsi_devinfo: fixup string compare drivers/scsi/Makefile | 8 + drivers/scsi/scsi_debug.c | 25 ++- drivers/scsi/scsi_devinfo.c | 67 --- drivers/scsi/scsi_scan.c| 1 + drivers/scsi/scsi_sysfs.c | 37 ++ include/scsi/scsi_devinfo.h | 76 + 6 files changed, 148 insertions(+), 66 deletions(-) -- 1.8.5.6
[PATCHv6 1/5] scsi_debug: allow to specify inquiry vendor and model
For testing purposes we need to be able to pass in the inquiry vendor and model. Signed-off-by: Hannes Reinecke Reviewed-by: Bart Van Assche Acked-by: Doug Gilbert --- drivers/scsi/scsi_debug.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 09ba494..3c15f6b 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -953,9 +953,9 @@ static int fetch_to_dev_buffer(struct scsi_cmnd *scp, unsigned char *arr, } -static const char * inq_vendor_id = "Linux "; -static const char * inq_product_id = "scsi_debug "; -static const char *inq_product_rev = "0186"; /* version less '.' */ +static char sdebug_inq_vendor_id[9] = "Linux "; +static char sdebug_inq_product_id[17] = "scsi_debug "; +static char sdebug_inq_product_rev[5] = "0186";/* version less '.' */ /* Use some locally assigned NAAs for SAS addresses. */ static const u64 naa3_comp_a = 0x3220ULL; static const u64 naa3_comp_b = 0x3330ULL; @@ -975,8 +975,8 @@ static int inquiry_vpd_83(unsigned char *arr, int port_group_id, arr[0] = 0x2; /* ASCII */ arr[1] = 0x1; arr[2] = 0x0; - memcpy(&arr[4], inq_vendor_id, 8); - memcpy(&arr[12], inq_product_id, 16); + memcpy(&arr[4], sdebug_inq_vendor_id, 8); + memcpy(&arr[12], sdebug_inq_product_id, 16); memcpy(&arr[28], dev_id_str, dev_id_str_len); num = 8 + 16 + dev_id_str_len; arr[3] = num; @@ -1408,9 +1408,9 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip) arr[6] = 0x10; /* claim: MultiP */ /* arr[6] |= 0x40; ... claim: EncServ (enclosure services) */ arr[7] = 0xa; /* claim: LINKED + CMDQUE */ - memcpy(&arr[8], inq_vendor_id, 8); - memcpy(&arr[16], inq_product_id, 16); - memcpy(&arr[32], inq_product_rev, 4); + memcpy(&arr[8], sdebug_inq_vendor_id, 8); + memcpy(&arr[16], sdebug_inq_product_id, 16); + memcpy(&arr[32], sdebug_inq_product_rev, 4); /* version descriptors (2 bytes each) follow */ put_unaligned_be16(0xc0, arr + 58); /* SAM-6 no version claimed */ put_unaligned_be16(0x5c0, arr + 60); /* SPC-5 no version claimed */ @@ -4151,6 +4151,12 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip, module_param_named(fake_rw, sdebug_fake_rw, int, S_IRUGO | S_IWUSR); module_param_named(guard, sdebug_guard, uint, S_IRUGO); module_param_named(host_lock, sdebug_host_lock, bool, S_IRUGO | S_IWUSR); +module_param_string(inq_vendor, sdebug_inq_vendor_id, + sizeof(sdebug_inq_vendor_id), S_IRUGO|S_IWUSR); +module_param_string(inq_product, sdebug_inq_product_id, + sizeof(sdebug_inq_product_id), S_IRUGO|S_IWUSR); +module_param_string(inq_rev, sdebug_inq_product_rev, + sizeof(sdebug_inq_product_rev), S_IRUGO|S_IWUSR); module_param_named(lbpu, sdebug_lbpu, int, S_IRUGO); module_param_named(lbpws, sdebug_lbpws, int, S_IRUGO); module_param_named(lbpws10, sdebug_lbpws10, int, S_IRUGO); @@ -4202,6 +4208,9 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip, MODULE_PARM_DESC(fake_rw, "fake reads/writes instead of copying (def=0)"); MODULE_PARM_DESC(guard, "protection checksum: 0=crc, 1=ip (def=0)"); MODULE_PARM_DESC(host_lock, "host_lock is ignored (def=0)"); +MODULE_PARM_DESC(inq_vendor, "SCSI INQUIRY vendor string (def=\"Linux\")"); +MODULE_PARM_DESC(inq_product, "SCSI INQUIRY product string (def=\"scsi_debug\")"); +MODULE_PARM_DESC(inq_rev, "SCSI INQUIRY revision string (def=\"0186\")"); MODULE_PARM_DESC(lbpu, "enable LBP, support UNMAP command (def=0)"); MODULE_PARM_DESC(lbpws, "enable LBP, support WRITE SAME(16) with UNMAP bit (def=0)"); MODULE_PARM_DESC(lbpws10, "enable LBP, support WRITE SAME(10) with UNMAP bit (def=0)"); -- 1.8.5.6
[PATCHv6 2/5] scsi: Export blacklist flags to sysfs
Each scsi device is scanned according to the found blacklist flags, but this information is never presented to sysfs. This makes it quite hard to figure out if blacklisting worked as expected. With this patch we're exporting an additional attribute 'blacklist' containing the blacklist flags for this device. Signed-off-by: Hannes Reinecke --- drivers/scsi/Makefile | 8 drivers/scsi/scsi_scan.c | 1 + drivers/scsi/scsi_sysfs.c | 37 + 3 files changed, 46 insertions(+) diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index 93dbe58..c4298c7 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -191,6 +191,14 @@ clean-files := 53c700_d.h 53c700_u.h $(obj)/53c700.o $(MODVERDIR)/$(obj)/53c700.ver: $(obj)/53c700_d.h +$(obj)/scsi_sysfs.o: $(obj)/scsi_devinfo_tbl.c + +quiet_cmd_bflags = GEN $@ + cmd_bflags = sed -n 's/.*BLIST_\([A-Z0-9_]*\) *.*/BLIST_FLAG_NAME(\1),/p' $< > $@ + +$(obj)/scsi_devinfo_tbl.c: include/scsi/scsi_devinfo.h + $(call if_changed,bflags) + # If you want to play with the firmware, uncomment # GENERATE_FIRMWARE := 1 diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index e7818af..26edd61 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -984,6 +984,7 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result, scsi_attach_vpd(sdev); sdev->max_queue_depth = sdev->queue_depth; + sdev->sdev_bflags = *bflags; /* * Ok, the device is now all set up, we can diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index bf53356..fde3f44 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "scsi_priv.h" #include "scsi_logging.h" @@ -966,6 +967,41 @@ static DEVICE_ATTR(queue_depth, S_IRUGO | S_IWUSR, sdev_show_queue_depth, } static DEVICE_ATTR(wwid, S_IRUGO, sdev_show_wwid, NULL); +#define BLIST_FLAG_NAME(name) [ilog2(BLIST_##name)] = #name +static const char *const sdev_bflags_name[] = { +#include "scsi_devinfo_tbl.c" +}; +#undef BLIST_FLAG_NAME + +static ssize_t +sdev_show_blacklist(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct scsi_device *sdev = to_scsi_device(dev); + int i; + ssize_t len = 0; + + for (i = 0; i < sizeof(sdev->sdev_bflags) * BITS_PER_BYTE; i++) { + const char *name = NULL; + + if (!(sdev->sdev_bflags & BIT(i))) + continue; + if (i < ARRAY_SIZE(sdev_bflags_name) && sdev_bflags_name[i]) + name = sdev_bflags_name[i]; + + if (name) + len += snprintf(buf + len, PAGE_SIZE - len, + "%s%s", len ? " " : "", name); + else + len += snprintf(buf + len, PAGE_SIZE - len, + "%sINVALID_BIT(%d)", len ? " " : "", i); + } + if (len) + len += snprintf(buf + len, PAGE_SIZE - len, "\n"); + return len; +} +static DEVICE_ATTR(blacklist, S_IRUGO, sdev_show_blacklist, NULL); + #ifdef CONFIG_SCSI_DH static ssize_t sdev_show_dh_state(struct device *dev, struct device_attribute *attr, @@ -1151,6 +1187,7 @@ static umode_t scsi_sdev_bin_attr_is_visible(struct kobject *kobj, &dev_attr_queue_depth.attr, &dev_attr_queue_type.attr, &dev_attr_wwid.attr, + &dev_attr_blacklist.attr, #ifdef CONFIG_SCSI_DH &dev_attr_dh_state.attr, &dev_attr_access_state.attr, -- 1.8.5.6
[PATCHv6 3/5] scsi_devinfo: Reformat blacklist flags
Reformat blacklist flags to make the values easier to read and to enhance error checking. Signed-off-by: Hannes Reinecke Reviewed-by: Bart van Assche --- include/scsi/scsi_devinfo.h | 76 + 1 file changed, 50 insertions(+), 26 deletions(-) diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h index 9592570..7a2329c 100644 --- a/include/scsi/scsi_devinfo.h +++ b/include/scsi/scsi_devinfo.h @@ -3,31 +3,55 @@ /* * Flags for SCSI devices that need special treatment */ -#define BLIST_NOLUN0x001 /* Only scan LUN 0 */ -#define BLIST_FORCELUN 0x002 /* Known to have LUNs, force scanning, - deprecated: Use max_luns=N */ -#define BLIST_BORKEN 0x004 /* Flag for broken handshaking */ -#define BLIST_KEY 0x008 /* unlock by special command */ -#define BLIST_SINGLELUN0x010 /* Do not use LUNs in parallel */ -#define BLIST_NOTQ 0x020 /* Buggy Tagged Command Queuing */ -#define BLIST_SPARSELUN0x040 /* Non consecutive LUN numbering */ -#define BLIST_MAX5LUN 0x080 /* Avoid LUNS >= 5 */ -#define BLIST_ISROM0x100 /* Treat as (removable) CD-ROM */ -#define BLIST_LARGELUN 0x200 /* LUNs past 7 on a SCSI-2 device */ -#define BLIST_INQUIRY_36 0x400 /* override additional length field */ -#define BLIST_NOSTARTONADD 0x1000 /* do not do automatic start on add */ -#define BLIST_REPORTLUN2 0x2 /* try REPORT_LUNS even for SCSI-2 devs - (if HBA supports more than 8 LUNs) */ -#define BLIST_NOREPORTLUN 0x4 /* don't try REPORT_LUNS scan (SCSI-3 devs) */ -#define BLIST_NOT_LOCKABLE 0x8 /* don't use PREVENT-ALLOW commands */ -#define BLIST_NO_ULD_ATTACH0x10 /* device is actually for RAID config */ -#define BLIST_SELECT_NO_ATN0x20 /* select without ATN */ -#define BLIST_RETRY_HWERROR0x40 /* retry HARDWARE_ERROR */ -#define BLIST_MAX_512 0x80 /* maximum 512 sector cdb length */ -#define BLIST_NO_DIF 0x200 /* Disable T10 PI (DIF) */ -#define BLIST_SKIP_VPD_PAGES 0x400 /* Ignore SBC-3 VPD pages */ -#define BLIST_TRY_VPD_PAGES0x1000 /* Attempt to read VPD pages */ -#define BLIST_NO_RSOC 0x2000 /* don't try to issue RSOC */ -#define BLIST_MAX_1024 0x4000 /* maximum 1024 sector cdb length */ + +/* Only scan LUN 0 */ +#define BLIST_NOLUN((__force __u32 __bitwise)(1 << 0)) +/* Known to have LUNs, force scanning. + * DEPRECATED: Use max_luns=N */ +#define BLIST_FORCELUN ((__force __u32 __bitwise)(1 << 1)) +/* Flag for broken handshaking */ +#define BLIST_BORKEN ((__force __u32 __bitwise)(1 << 2)) +/* unlock by special command */ +#define BLIST_KEY ((__force __u32 __bitwise)(1 << 3)) +/* Do not use LUNs in parallel */ +#define BLIST_SINGLELUN((__force __u32 __bitwise)(1 << 4)) +/* Buggy Tagged Command Queuing */ +#define BLIST_NOTQ ((__force __u32 __bitwise)(1 << 5)) +/* Non consecutive LUN numbering */ +#define BLIST_SPARSELUN((__force __u32 __bitwise)(1 << 6)) +/* Avoid LUNS >= 5 */ +#define BLIST_MAX5LUN ((__force __u32 __bitwise)(1 << 7)) +/* Treat as (removable) CD-ROM */ +#define BLIST_ISROM((__force __u32 __bitwise)(1 << 8)) +/* LUNs past 7 on a SCSI-2 device */ +#define BLIST_LARGELUN ((__force __u32 __bitwise)(1 << 9)) +/* override additional length field */ +#define BLIST_INQUIRY_36 ((__force __u32 __bitwise)(1 << 10)) +/* do not do automatic start on add */ +#define BLIST_NOSTARTONADD ((__force __u32 __bitwise)(1 << 12)) +/* try REPORT_LUNS even for SCSI-2 devs (if HBA supports more than 8 LUNs) */ +#define BLIST_REPORTLUN2 ((__force __u32 __bitwise)(1 << 17)) +/* don't try REPORT_LUNS scan (SCSI-3 devs) */ +#define BLIST_NOREPORTLUN ((__force __u32 __bitwise)(1 << 18)) +/* don't use PREVENT-ALLOW commands */ +#define BLIST_NOT_LOCKABLE ((__force __u32 __bitwise)(1 << 19)) +/* device is actually for RAID config */ +#define BLIST_NO_ULD_ATTACH((__force __u32 __bitwise)(1 << 20)) +/* select without ATN */ +#define BLIST_SELECT_NO_ATN((__force __u32 __bitwise)(1 << 21)) +/* retry HARDWARE_ERROR */ +#define BLIST_RETRY_HWERROR((__force __u32 __bitwise)(1 << 22)) +/* maximum 512 sector cdb length */ +#define BLIST_MAX_512 ((__force __u32 __bitwise)(1 << 23)) +/* Disable T10 PI (DIF) */ +#define BLIST_NO_DIF ((__force __u32 __bitwise)(1 << 25)) +/* Ignore SBC-3 VPD pages */ +#define BLIST_SKIP_VPD_PAGES ((__force __u32 __bitwise)(1 << 26)) +/* Attempt to read VPD pages */ +#define BLIST_TRY_VPD_PAGES((__force __u32 __bitwise)(1 << 28)) +/* don't try to issue RSOC */ +#define BLIST_NO_RSOC ((__force __u32 __bitwise)(1 << 29)) +/* maximum 1024 sector cdb length */ +#define
[PATCHv6 5/5] scsi_devinfo: fixup string compare
When checking the model and vendor string we need to use the minimum value of either string, otherwise we'll miss out on wildcard matches. And we should take card when matching with zero size strings; results might be unpredictable. With this patch the rules for matching devinfo strings are as follows: - Vendor strings must match exactly - Empty Model strings will only match if the devinfo model is also empty - Model strings shorter than the devinfo model string will not match Fixes: 5e7ff2c ("SCSI: fix new bug in scsi_dev_info_list string matching") Signed-off-by: Hannes Reinecke Reviewed-by: Alan Stern Reviewed-by: Bart Van Assche --- drivers/scsi/scsi_devinfo.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c index 6858ad8..d39b27c 100644 --- a/drivers/scsi/scsi_devinfo.c +++ b/drivers/scsi/scsi_devinfo.c @@ -399,8 +399,8 @@ int scsi_dev_info_list_add_keyed(int compatible, char *vendor, char *model, /** * scsi_dev_info_list_find - find a matching dev_info list entry. - * @vendor:vendor string - * @model: model (product) string + * @vendor:full vendor string + * @model: full model (product) string * @key: specify list to use * * Description: @@ -415,7 +415,7 @@ static struct scsi_dev_info_list *scsi_dev_info_list_find(const char *vendor, struct scsi_dev_info_list *devinfo; struct scsi_dev_info_list_table *devinfo_table = scsi_devinfo_lookup_by_key(key); - size_t vmax, mmax; + size_t vmax, mmax, mlen; const char *vskip, *mskip; if (IS_ERR(devinfo_table)) @@ -454,15 +454,18 @@ static struct scsi_dev_info_list *scsi_dev_info_list_find(const char *vendor, dev_info_list) { if (devinfo->compatible) { /* -* Behave like the older version of get_device_flags. +* vendor strings must be an exact match */ - if (memcmp(devinfo->vendor, vskip, vmax) || - (vmax < sizeof(devinfo->vendor) && - devinfo->vendor[vmax])) + if (vmax != strlen(devinfo->vendor) || + memcmp(devinfo->vendor, vskip, vmax)) continue; - if (memcmp(devinfo->model, mskip, mmax) || - (mmax < sizeof(devinfo->model) && - devinfo->model[mmax])) + + /* +* @model specifies the full string, and +* must be larger or equal to devinfo->model +*/ + mlen = strlen(devinfo->model); + if (mmax < mlen || memcmp(devinfo->model, mskip, mlen)) continue; return devinfo; } else { -- 1.8.5.6
Re: mptsas driver cannot detect hotplugging disk with the LSI SCSI SAS1068 controller in Ubuntu guest on VMware
On 09/27/2017 09:33 AM, Gavin Guo wrote: > There is a problem in the latest upstream kernel with the device: > > $ grep -i lsi lspci > 03:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios > Logic SAS1068 PCI-X Fusion-MPT SAS [1000:0054] (rev 01) > > The device is simulated by the VMware ESXi 5.5 > > When hotplugging a new disk to the Guest Ubuntu OS, the latest kernel > cannot automatically probe the disk. However, on the v3.19.0-80.88 > kernel, the disk can be dynamically probed and show the following > info message: > > mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1, > sas_addr 0x5000c29a6bdae0f5 > scsi 2:0:1:0: Direct-Access VMware Virtual disk 1.0 PQ: 0 > ANSI: 2 > sd 2:0:1:0: Attached scsi generic sg2 type 0 > sd 2:0:1:0: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) > sd 2:0:1:0: [sdb] Write Protect is off > sd 2:0:1:0: [sdb] Mode Sense: 61 00 00 00 > sd 2:0:1:0: [sdb] Cache data unavailable > sd 2:0:1:0: [sdb] Assuming drive cache: write through > sdb: unknown partition table > sd 2:0:1:0: [sdb] Attached SCSI disk > > After looking up the message: > mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1, > sas_addr 0x5000c29a6bdae0f5 > > I found it comes from the path: > mptsas_firmware_event_work -> mptsas_send_sas_event -> > mptsas_hotplug_work -> mptsas_add_end_device > > I'll appreciate if anyone can give the idea: If it's possible that the > irq from the simulated LSI SAS controller didn't come in to trigger > the event? However, it can work on the v3.19 kernel so if there is > any driver implementation issue in the latest kernel. > This is an issue with the mptsas driver, who originally assumed that no system will have direct-attached SAS devices. VMWare chose to implement exactly that, so the hotplug detection logic is flawed here. I'll be sending a patch. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper
Can you move this to the beginning of your series, just after the other edits to blk_mq_sched_dispatch_requests? > +static void blk_mq_do_dispatch_sched(struct request_queue *q, > + struct elevator_queue *e, > + struct blk_mq_hw_ctx *hctx) No need to pass anything but the hctx here, the other two can be trivially derived from it. > +{ > + LIST_HEAD(rq_list); > + > + do { > + struct request *rq; > + > + rq = e->type->ops.mq.dispatch_request(hctx); how about shortening this to: struct request *rq = e->type->ops.mq.dispatch_request(hctx); > void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) > { > struct request_queue *q = hctx->queue; > @@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx > *hctx) >* on the dispatch list or we were able to dispatch from the >* dispatch list. >*/ > - if (do_sched_dispatch && has_sched_dispatch) { > - do { > - struct request *rq; > - > - rq = e->type->ops.mq.dispatch_request(hctx); > - if (!rq) > - break; > - list_add(&rq->queuelist, &rq_list); > - } while (blk_mq_dispatch_rq_list(q, &rq_list)); > - } > + if (do_sched_dispatch && has_sched_dispatch) > + blk_mq_do_dispatch_sched(q, e, hctx); > } Please use this new helper to simplify the logic. E.g.: if (!list_empty(&rq_list)) { blk_mq_sched_mark_restart_hctx(hctx); if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch) blk_mq_do_dispatch_sched(hctx); } else if (has_sched_dispatch) { blk_mq_do_dispatch_sched(hctx); } else { blk_mq_flush_busy_ctxs(hctx, &rq_list); blk_mq_dispatch_rq_list(q, &rq_list); }
Re: [PATCH V7 6/6] SCSI: set block queue at preempt only when SCSI device is put into quiesce
> + /* > + * Simply quiesing SCSI device isn't safe, it is easy > + * to use up requests because all these allocated requests > + * can't be dispatched when device is put in QIUESCE. > + * Then no request can be allocated and we may hang > + * somewhere, such as system suspend/resume. > + * > + * So we set block queue in preempt only first, no new > + * normal request can enter queue any more, and all pending > + * requests are drained once blk_set_preempt_only() > + * returns. Only RQF_PREEMPT is allowed in preempt only mode. > + */ > + blk_set_preempt_only(sdev->request_queue, true); > + > mutex_lock(&sdev->state_mutex); > err = scsi_device_set_state(sdev, SDEV_QUIESCE); Why is this not under state_mutex so that we guarantee it's atomic vs sdev state changes? > @@ -2964,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev) > scsi_device_set_state(sdev, SDEV_RUNNING) == 0) > scsi_run_queue(sdev->request_queue); > mutex_unlock(&sdev->state_mutex); > + > + blk_set_preempt_only(sdev->request_queue, false); Same here.
Re: [PATCH V7 5/6] block: support PREEMPT_ONLY
> +void blk_set_preempt_only(struct request_queue *q, bool preempt_only) > +{ > + blk_mq_freeze_queue(q); > + if (preempt_only) > + queue_flag_set_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q); > + else > + queue_flag_clear_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q); > + blk_mq_unfreeze_queue(q); > +} > +EXPORT_SYMBOL(blk_set_preempt_only); What do you need the queue freeze for? > + /* > + * preempt_only flag has to be set after queue is frozen, > + * so it can be checked here lockless and safely > + */ > + if (blk_queue_preempt_only(q)) { We can always check a single bit flag safely, so I really don't understand that comment. > + if (!(flags & BLK_REQ_PREEMPT)) > + goto slow_path; > + } > + > if (percpu_ref_tryget_live(&q->q_usage_counter)) > return 0; > - > + slow_path: Also this looks a very spaghetti, why not: if (!blk_queue_preempt_only(q) || (flags & BLK_MQ_REQ_PREEMPT)) { if (percpu_ref_tryget_live(&q->q_usage_counter)) return 0; }
Re: [PATCH V7 4/6] block: prepare for passing RQF_PREEMPT to request allocation
As mentioned in the reply to Bart, I'd much rather use BLK_MQ_REQ_* for the sane blk_get_request version (which should have the same prototype as blk_mq_alloc_request) instead of having another flags namespace.
Re: [PATCH V7 3/6] block: pass flags to blk_queue_enter()
On Sat, Sep 30, 2017 at 02:12:11PM +0800, Ming Lei wrote: > We need to pass PREEMPT flags to blk_queue_enter() > for allocating request with RQF_PREEMPT in the > following patch. I don't like having another name space for flags. It seems like we should simply pass ops on, and then map the nowait allocations to RQF_PREEMPT to REQ_PREEMPT as done in the series from Bart.
Re: [PATCH V7 2/6] block: tracking request allocation with q_usage_counter
I think I already gate it to basically the same patch as queued up by Bart, but here again: Reviewed-by: Christoph Hellwig
Re: [PATCH V7 1/6] blk-mq: only run hw queues for blk-mq
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH v2 20/21] lpfc: correct nvme sg segment count check
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v2 19/21] lpfc: Fix oops of nvme host during driver unload.
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v2 09/21] lpfc: Fix FCP hba_wqidx assignment
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v2 07/21] lpfc: Make ktime sampling more accurate
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v2 02/21] lpfc: fix pci hot plug crash in list_add call
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH v2 01/21] lpfc: fix pci hot plug crash in timer management routines
Looks good, Reviewed-by: Johannes Thumshirn -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH V6 12/14] block: mq-deadline: Introduce zone locking support
For a write request to a zoned block device, lock the request target zone upon request displatch. The zone is unlocked either when the request completes or when the request is requeued (inserted). To indicate that a request has locked its target zone, use the first pointer of the request elevator private data to store the value RQ_ZONE_WLOCKED. Testing for this value allows quick decision in dd_insert_request() and dd_completed_request() regarding the need for unlocking the target zone of a request. Signed-off-by: Damien Le Moal --- block/mq-deadline.c | 111 1 file changed, 111 insertions(+) diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 6b7b84ee8f82..93a1aede5dd0 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -177,6 +177,91 @@ deadline_move_request(struct deadline_data *dd, struct request *rq) } /* + * Return true if a request is a write requests that needs zone + * write locking. + */ +static inline bool deadline_request_needs_zone_wlock(struct deadline_data *dd, +struct request *rq) +{ + + if (!dd->zones_wlock) + return false; + + if (blk_rq_is_passthrough(rq)) + return false; + + /* +* REQ_OP_SCSI_* and REQ_OP_DRV_* are already handled with +* the previous check. Add them again here so that all request +* operations defined by enum req_off are handled (so that a compiler +* warning shows up if/when request operation definitions change. +*/ + switch (req_op(rq)) { + case REQ_OP_WRITE_ZEROES: + case REQ_OP_WRITE_SAME: + case REQ_OP_WRITE: + return blk_rq_zone_is_seq(rq); + default: + return false; + } +} + +/* + * Abuse the elv.priv[0] pointer to indicate if a request has write + * locked its target zone. Only write request to a zoned block device + * can own a zone write lock. + */ +enum rq_zone_lock { + RQ_ZONE_NO_WLOCK = 0UL, + RQ_ZONE_WLOCKED = 1UL, +}; + +static inline void deadline_set_request_zone_wlock(struct request *rq) +{ + rq->elv.priv[0] = (void *)RQ_ZONE_WLOCKED; +} + +static inline void deadline_clear_request_zone_wlock(struct request *rq) +{ + rq->elv.priv[0] = (void *)RQ_ZONE_NO_WLOCK; +} + +static inline bool deadline_request_has_zone_wlock(struct request *rq) +{ + return rq->elv.priv[0] == (void *)RQ_ZONE_WLOCKED; +} + +/* + * Write lock the target zone of a write request. + */ +static void deadline_wlock_zone(struct deadline_data *dd, + struct request *rq) +{ + WARN_ON_ONCE(deadline_request_has_zone_wlock(rq)); + WARN_ON_ONCE(test_and_set_bit(blk_rq_zone_no(rq), dd->zones_wlock)); + deadline_set_request_zone_wlock(rq); +} + +/* + * Write unlock the target zone of a write request. + */ +static void deadline_wunlock_zone(struct deadline_data *dd, + struct request *rq) +{ + WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq), dd->zones_wlock)); + deadline_clear_request_zone_wlock(rq); +} + +/* + * Test the write lock state of the target zone of a write request. + */ +static inline bool deadline_zone_is_wlocked(struct deadline_data *dd, + struct request *rq) +{ + return test_bit(blk_rq_zone_no(rq), dd->zones_wlock); +} + +/* * deadline_check_fifo returns 0 if there are no expired requests on the fifo, * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir]) */ @@ -315,6 +400,11 @@ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx) dd->batching++; deadline_move_request(dd, rq); done: + /* +* If the request needs its target zone locked, do it. +*/ + if (deadline_request_needs_zone_wlock(dd, rq)) + deadline_wlock_zone(dd, rq); rq->rq_flags |= RQF_STARTED; return rq; } @@ -464,6 +554,13 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, struct deadline_data *dd = q->elevator->elevator_data; const int data_dir = rq_data_dir(rq); + /* +* This may be a requeue of a write request that has locked its +* target zone. If this is the case, release the zone lock. +*/ + if (deadline_request_has_zone_wlock(rq)) + deadline_wunlock_zone(dd, rq); + if (blk_mq_sched_try_insert_merge(q, rq)) return; @@ -508,6 +605,19 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx, spin_unlock(&dd->lock); } +/* + * For zoned block devices, write unlock the target zone of + * completed write requests. + */ +static void dd_completed_request(struct request *rq) +{ + if (deadline_request_has_zone_wlock(rq)) { + struct deadline_data *dd = rq->q->elevator->elevator_data; + + deadline_wu
[PATCH V6 04/14] scsi: sd_zbc: Use well defined macros
instead of open coding, use the min() macro to calculate a report zones reply buffer length in sd_zbc_check_zone_size() and the round_up() macro for calculating the number of zones in sd_zbc_setup(). No functional change is introduced by this patch. Signed-off-by: Damien Le Moal Reviewed-by: Johannes Thumshirn Reviewed-by: Bart Van Assche Reviewed-by: Christoph Hellwig --- drivers/scsi/sd_zbc.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 7dbaf920679e..bbad851c1789 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -475,7 +475,7 @@ static int sd_zbc_check_capacity(struct scsi_disk *sdkp, unsigned char *buf) return 0; } -#define SD_ZBC_BUF_SIZE 131072 +#define SD_ZBC_BUF_SIZE 131072U /** * sd_zbc_check_zone_size - Check the device zone sizes @@ -526,10 +526,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) /* Parse REPORT ZONES header */ list_length = get_unaligned_be32(&buf[0]) + 64; rec = buf + 64; - if (list_length < SD_ZBC_BUF_SIZE) - buf_len = list_length; - else - buf_len = SD_ZBC_BUF_SIZE; + buf_len = min(list_length, SD_ZBC_BUF_SIZE); /* Parse zone descriptors */ while (rec < buf + buf_len) { @@ -599,9 +596,8 @@ static int sd_zbc_setup(struct scsi_disk *sdkp) /* chunk_sectors indicates the zone size */ blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, sdkp->zone_blocks)); - sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift; - if (sdkp->capacity & (sdkp->zone_blocks - 1)) - sdkp->nr_zones++; + sdkp->nr_zones = + round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift; if (!sdkp->zones_wlock) { sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones), -- 2.13.6
[PATCH V6 09/14] scsi: sd_zbc: Disable zone write locking with scsi-mq
In the case of a ZBC disk used with scsi-mq, zone write locking does not prevent write reordering in sequential zones. Unlike the legacy case, zone locking is done after the command request is removed from the scheduler dispatch queue. That is, at the time of zone locking, the write command may already be out of order, making locking ineffective. Write order guarantees can only be provided by an adapted I/O scheduler. Disable zone write locking in sd_zbc_write_lock_zone() if the disk is used with scsi-mq. As the disk zones_wlock bitmap is not necessry, do not allocate it. Signed-off-by: Damien Le Moal Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Christoph Hellwig --- drivers/scsi/sd_zbc.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 4cb271619fc6..f56ed1bc3eaa 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -287,6 +287,7 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) { struct request *rq = cmd->request; + struct request_queue *q = rq->q; struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); sector_t sector = blk_rq_pos(rq); sector_t zone_sectors = sd_zbc_zone_sectors(sdkp); @@ -297,10 +298,14 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) */ /* Do not allow zone boundaries crossing on host-managed drives */ - if (blk_queue_zoned_model(rq->q) == BLK_ZONED_HM && + if (blk_queue_zoned_model(q) == BLK_ZONED_HM && (sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors) return BLKPREP_KILL; + /* No write locking with scsi-mq */ + if (q->mq_ops) + return BLKPREP_OK; + /* * There is no write constraints on conventional zones. So any write * command can be sent. But do not issue more than one write command @@ -711,7 +716,7 @@ static int sd_zbc_setup(struct scsi_disk *sdkp) if (sdkp->first_scan) return 0; - if (!sdkp->zones_wlock) { + if (!q->mq_ops && !sdkp->zones_wlock) { sdkp->zones_wlock = sd_zbc_alloc_zone_bitmap(sdkp); if (!sdkp->zones_wlock) return -ENOMEM; -- 2.13.6
[PATCH V6 03/14] scsi: sd_zbc: Rearrange code
Rearrange sd_zbc_setup() to include use_16_for_rw and use_10_for_rw assignments and move the calculation of sdkp->zone_shift together with the assignment of the verified zone_blocks value in sd_zbc_check_zone_size(). No functional change is introduced by this patch. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Bart Van Assche --- drivers/scsi/sd_zbc.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 023f705ae235..7dbaf920679e 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -584,6 +584,7 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) } sdkp->zone_blocks = zone_blocks; + sdkp->zone_shift = ilog2(zone_blocks); return 0; } @@ -591,10 +592,13 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) static int sd_zbc_setup(struct scsi_disk *sdkp) { + /* READ16/WRITE16 is mandatory for ZBC disks */ + sdkp->device->use_16_for_rw = 1; + sdkp->device->use_10_for_rw = 0; + /* chunk_sectors indicates the zone size */ blk_queue_chunk_sectors(sdkp->disk->queue, logical_to_sectors(sdkp->device, sdkp->zone_blocks)); - sdkp->zone_shift = ilog2(sdkp->zone_blocks); sdkp->nr_zones = sdkp->capacity >> sdkp->zone_shift; if (sdkp->capacity & (sdkp->zone_blocks - 1)) sdkp->nr_zones++; @@ -657,10 +661,6 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, unsigned char *buf) if (ret) goto err; - /* READ16/WRITE16 is mandatory for ZBC disks */ - sdkp->device->use_16_for_rw = 1; - sdkp->device->use_10_for_rw = 0; - return 0; err: -- 2.13.6
[PATCH V6 06/14] block: Add zoned block device information to request queue
Components relying only on the requeuest_queue structure for accessing block devices (e.g. I/O schedulers) have a limited knowledged of the device characteristics. In particular, the device capacity cannot be easily discovered, which for a zoned block device also result in the inability to easily know the number of zones of the device (the zone size is indicated by the chunk_sectors field of the queue limits). Introduce the nr_zones field to the request_queue sturcture to simplify access to this information. Also, add the bitmap seq_zone_bitmap which indicates which zones of the device are sequential zones (write preferred or write required). These two fields are initialized by the low level block device driver (sd.c for ZBC/ZAC disks). They are not initialized by stacking drivers (device mappers) handling zoned block devices (e.g. dm-linear). Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig Reviewed-by: Bart Van Assche --- include/linux/blkdev.h | 53 ++ 1 file changed, 53 insertions(+) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 02fa42d24b52..a44807c502f0 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -544,6 +544,18 @@ struct request_queue { struct queue_limits limits; /* +* Zoned block device information for mq I/O schedulers. +* nr_zones is the total number of zones of the device. This is always +* 0 for regular block devices. seq_zone_bitmap is a bitmap of nr_zones +* bits which indicates if a zone is conventional (bit clear) or +* sequential (bit set). Both nr_zones and seq_zone_bitmap are set +* by the low level device driver. Stacking drivers (device mappers) +* may or may not initialize these fields. +*/ + unsigned intnr_zones; + unsigned long *seq_zone_bitmap; + + /* * sg stuff */ unsigned intsg_timeout; @@ -786,6 +798,27 @@ static inline unsigned int blk_queue_zone_sectors(struct request_queue *q) return blk_queue_is_zoned(q) ? q->limits.chunk_sectors : 0; } +static inline unsigned int blk_queue_nr_zones(struct request_queue *q) +{ + return q->nr_zones; +} + +static inline unsigned int blk_queue_zone_no(struct request_queue *q, +sector_t sector) +{ + if (!blk_queue_is_zoned(q)) + return 0; + return sector >> ilog2(q->limits.chunk_sectors); +} + +static inline bool blk_queue_zone_is_seq(struct request_queue *q, +sector_t sector) +{ + if (!blk_queue_is_zoned(q) || !q->seq_zone_bitmap) + return false; + return test_bit(blk_queue_zone_no(q, sector), q->seq_zone_bitmap); +} + static inline bool rq_is_sync(struct request *rq) { return op_is_sync(rq->cmd_flags); @@ -1032,6 +1065,16 @@ static inline unsigned int blk_rq_cur_sectors(const struct request *rq) return blk_rq_cur_bytes(rq) >> 9; } +static inline unsigned int blk_rq_zone_no(struct request *rq) +{ + return blk_queue_zone_no(rq->q, blk_rq_pos(rq)); +} + +static inline unsigned int blk_rq_zone_is_seq(struct request *rq) +{ + return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq)); +} + /* * Some commands like WRITE SAME have a payload or data transfer size which * is different from the size of the request. Any driver that supports such @@ -1583,6 +1626,16 @@ static inline unsigned int bdev_zone_sectors(struct block_device *bdev) return 0; } +static inline unsigned int bdev_nr_zones(struct block_device *bdev) +{ + struct request_queue *q = bdev_get_queue(bdev); + + if (q) + return blk_queue_nr_zones(q); + + return 0; +} + static inline int queue_dma_alignment(struct request_queue *q) { return q ? q->dma_alignment : 511; -- 2.13.6
[PATCH V6 08/14] scsi: sd_zbc: Limit zone write locking to sequential zones
Conventional zones of zoned block devices have no write constraints. Write locking of conventional zones is thus not necessary and can even hurt performance by unnecessarily operating the disk under low queue depth. To avoid this, use the disk request queue seq_zone_bitmap to allow any write to be issued to conventional zones, locking only sequential zones. While at it, remove the helper sd_zbc_zone_no() and use blk_rq_zone_no() instead. Signed-off-by: Damien Le Moal --- drivers/scsi/sd_zbc.c | 32 1 file changed, 12 insertions(+), 20 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 4ccb20074cb3..4cb271619fc6 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -230,17 +230,6 @@ static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp) } /** - * sd_zbc_zone_no - Get the number of the zone conataining a sector. - * @sdkp: The target disk - * @sector: 512B sector address contained in the zone - */ -static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp, - sector_t sector) -{ - return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift; -} - -/** * sd_zbc_setup_reset_cmnd - Prepare a RESET WRITE POINTER scsi command. * @cmd: the command to setup * @@ -301,7 +290,6 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); sector_t sector = blk_rq_pos(rq); sector_t zone_sectors = sd_zbc_zone_sectors(sdkp); - unsigned int zno = sd_zbc_zone_no(sdkp, sector); /* * Note: Checks of the alignment of the write command on @@ -309,18 +297,21 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd) */ /* Do not allow zone boundaries crossing on host-managed drives */ - if (blk_queue_zoned_model(sdkp->disk->queue) == BLK_ZONED_HM && + if (blk_queue_zoned_model(rq->q) == BLK_ZONED_HM && (sector & (zone_sectors - 1)) + blk_rq_sectors(rq) > zone_sectors) return BLKPREP_KILL; /* -* Do not issue more than one write at a time per -* zone. This solves write ordering problems due to -* the unlocking of the request queue in the dispatch -* path in the non scsi-mq case. +* There is no write constraints on conventional zones. So any write +* command can be sent. But do not issue more than one write command +* at a time per sequential zone. This avoids write ordering problems +* due to the unlocking of the request queue in the dispatch path of +* legacy scsi path, as well as at the HBA level (e.g. AHCI). */ + if (!blk_rq_zone_is_seq(rq)) + return BLKPREP_OK; if (sdkp->zones_wlock && - test_and_set_bit(zno, sdkp->zones_wlock)) + test_and_set_bit(blk_rq_zone_no(rq), sdkp->zones_wlock)) return BLKPREP_DEFER; WARN_ON_ONCE(cmd->flags & SCMD_ZONE_WRITE_LOCK); @@ -341,8 +332,9 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd) struct request *rq = cmd->request; struct scsi_disk *sdkp = scsi_disk(rq->rq_disk); - if (sdkp->zones_wlock && cmd->flags & SCMD_ZONE_WRITE_LOCK) { - unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq)); + if (cmd->flags & SCMD_ZONE_WRITE_LOCK) { + unsigned int zno = blk_rq_zone_no(rq); + WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock)); cmd->flags &= ~SCMD_ZONE_WRITE_LOCK; clear_bit_unlock(zno, sdkp->zones_wlock); -- 2.13.6
[PATCH V6 13/14] block: mq-deadline: Limit write request dispatch for zoned block devices
When dispatching write requests to a zoned block device, only allow requests targeting an unlocked zone. Requests targeting a locked zone are left in the scheduler queue to preserve the initial write order. If no write request can be dispatched, allow reads to be dispatched even if the write batch is not done. To ensure that the search for an appropriate write request is atomic in deadline_fifo_request() and deadline_next_request() with reagrd to write requests zone lock state, introduce the spinlock zone_lock. Holding this lock while doing the search in these functions as well as when unlocking the target zone of a completed write request in dd_completed_request() ensure that the search does not pickup a write request in the middle of a zone queued write sequence. Signed-off-by: Damien Le Moal --- block/mq-deadline.c | 71 ++--- 1 file changed, 68 insertions(+), 3 deletions(-) diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 93a1aede5dd0..cbca24e1f96e 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -61,6 +61,7 @@ struct deadline_data { spinlock_t lock; struct list_head dispatch; + spinlock_t zone_lock; unsigned long *zones_wlock; }; @@ -244,12 +245,24 @@ static void deadline_wlock_zone(struct deadline_data *dd, /* * Write unlock the target zone of a write request. + * Clearing the target zone write lock bit is done with the scheduler zone_lock + * spinlock held so that deadline_next_request() and deadline_fifo_request() + * cannot see the lock state of a zone change due to a request completion during + * their eventual search for an appropriate write request. Otherwise, for a zone + * with multiple write requests queued, a non sequential write request + * can be chosen. */ static void deadline_wunlock_zone(struct deadline_data *dd, struct request *rq) { + unsigned long flags; + + spin_lock_irqsave(&dd->zone_lock, flags); + WARN_ON_ONCE(!test_and_clear_bit(blk_rq_zone_no(rq), dd->zones_wlock)); deadline_clear_request_zone_wlock(rq); + + spin_unlock_irqrestore(&dd->zone_lock, flags); } /* @@ -279,19 +292,47 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir) } /* + * Test if a request can be dispatched. + */ +static inline bool deadline_can_dispatch_request(struct deadline_data *dd, +struct request *rq) +{ + if (!deadline_request_needs_zone_wlock(dd, rq)) + return true; + return !deadline_zone_is_wlocked(dd, rq); +} + +/* * For the specified data direction, return the next request to * dispatch using arrival ordered lists. */ static struct request * deadline_fifo_request(struct deadline_data *dd, int data_dir) { + struct request *rq; + unsigned long flags; + if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE)) return NULL; if (list_empty(&dd->fifo_list[data_dir])) return NULL; - return rq_entry_fifo(dd->fifo_list[data_dir].next); + if (!dd->zones_wlock || data_dir == READ) + return rq_entry_fifo(dd->fifo_list[data_dir].next); + + spin_lock_irqsave(&dd->zone_lock, flags); + + list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) { + if (deadline_can_dispatch_request(dd, rq)) + goto out; + } + rq = NULL; + +out: + spin_unlock_irqrestore(&dd->zone_lock, flags); + + return rq; } /* @@ -301,10 +342,25 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir) static struct request * deadline_next_request(struct deadline_data *dd, int data_dir) { + struct request *rq; + unsigned long flags; + if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE)) return NULL; - return dd->next_rq[data_dir]; + rq = dd->next_rq[data_dir]; + if (!dd->zones_wlock || data_dir == READ) + return rq; + + spin_lock_irqsave(&dd->zone_lock, flags); + while (rq) { + if (deadline_can_dispatch_request(dd, rq)) + break; + rq = deadline_latter_request(rq); + } + spin_unlock_irqrestore(&dd->zone_lock, flags); + + return rq; } /* @@ -346,7 +402,8 @@ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx) if (reads) { BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ])); - if (writes && (dd->starved++ >= dd->writes_starved)) + if (deadline_fifo_request(dd, WRITE) && + (dd->starved++ >= dd->writes_starved)) goto dispatch_writes; data_dir = READ; @@ -391,6 +448,13 @@ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx) rq = next_rq; } + /*
[PATCH V6 14/14] block: do not set mq default scheduler
For blk-mq disks with a single hardware queue, setting by default the disk scheduler to mq-deadline early during the queue initialization prevents properly setting zone write locking for host managed zoned block device as the disk type is not yet known. Fix this by simply not setting the default scheduler to mq-deadline for single hardware queue disks. A udev rule can be used to easily do the same later in the system initialization sequence, when the device characteristics are known. Signed-off-by: Damien Le Moal --- block/elevator.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/block/elevator.c b/block/elevator.c index 153926a90901..8b65a757f726 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -222,19 +222,14 @@ int elevator_init(struct request_queue *q, char *name) if (!e) { /* -* For blk-mq devices, we default to using mq-deadline, -* if available, for single queue devices. If deadline -* isn't available OR we have multiple queues, default -* to "none". +* For blk-mq devices, default to "none". udev can later set +* an appropriate default scheduler based on the disk +* characteristics which we do not yet have here. */ - if (q->mq_ops) { - if (q->nr_hw_queues == 1) - e = elevator_get("mq-deadline", false); - if (!e) - return 0; - } else - e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); + if (q->mq_ops) + return 0; + e = elevator_get(CONFIG_DEFAULT_IOSCHED, false); if (!e) { printk(KERN_ERR "Default I/O scheduler not found. " \ -- 2.13.6
[PATCH V6 10/14] block: mq-deadline: Add zoned block device data
Introduce a zone bitmap field in mq-deadline private data to support zoned block devices. The zone bitmap is used to implement per zone write locking. Modify mq-deadline init_queue() and exit_queue() elevator methods to handle initialization and cleanup of the zone write lock bitmap. Signed-off-by: Damien Le Moal --- block/mq-deadline.c | 48 1 file changed, 44 insertions(+), 4 deletions(-) diff --git a/block/mq-deadline.c b/block/mq-deadline.c index a1cad4331edd..f0c01ef934b4 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -60,6 +60,8 @@ struct deadline_data { spinlock_t lock; struct list_head dispatch; + + unsigned long *zones_wlock; }; static inline struct rb_root * @@ -300,6 +302,34 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) return rq; } +static int deadline_init_zones_wlock(struct request_queue *q, +struct deadline_data *dd) +{ + /* +* For regular drives or non-conforming zoned block device, +* do not use zone write locking. +*/ + if (!blk_queue_nr_zones(q)) + return 0; + + /* +* Treat host aware drives as regular disks. +*/ + if (blk_queue_zoned_model(q) != BLK_ZONED_HM) + return 0; + + dd->zones_wlock = kzalloc_node(BITS_TO_LONGS(blk_queue_nr_zones(q)) + * sizeof(unsigned long), + GFP_KERNEL, q->node); + if (!dd->zones_wlock) + return -ENOMEM; + + pr_info("mq-deadline: %s: zones write locking enabled\n", + dev_name(q->backing_dev_info->dev)); + + return 0; +} + static void dd_exit_queue(struct elevator_queue *e) { struct deadline_data *dd = e->elevator_data; @@ -307,6 +337,7 @@ static void dd_exit_queue(struct elevator_queue *e) BUG_ON(!list_empty(&dd->fifo_list[READ])); BUG_ON(!list_empty(&dd->fifo_list[WRITE])); + kfree(dd->zones_wlock); kfree(dd); } @@ -317,16 +348,15 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e) { struct deadline_data *dd; struct elevator_queue *eq; + int ret = -ENOMEM; eq = elevator_alloc(q, e); if (!eq) return -ENOMEM; dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node); - if (!dd) { - kobject_put(&eq->kobj); - return -ENOMEM; - } + if (!dd) + goto out_put_elv; eq->elevator_data = dd; INIT_LIST_HEAD(&dd->fifo_list[READ]); @@ -341,8 +371,18 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e) spin_lock_init(&dd->lock); INIT_LIST_HEAD(&dd->dispatch); + ret = deadline_init_zones_wlock(q, dd); + if (ret) + goto out_free_dd; + q->elevator = eq; return 0; + +out_free_dd: + kfree(dd); +out_put_elv: + kobject_put(&eq->kobj); + return ret; } static int dd_request_merge(struct request_queue *q, struct request **rq, -- 2.13.6
[PATCH V6 02/14] scsi: sd_zbc: Fix comments and indentation
Fix comments style (use kernel-doc style) and content to clarify some functions. Also fix some functions signature indentation and remove a useless blank line in sd_zbc_read_zones(). No functional change is introduced by this patch. Signed-off-by: Damien Le Moal Reviewed-by: Christoph Hellwig --- drivers/scsi/scsi_lib.c | 5 ++- drivers/scsi/sd_zbc.c | 117 +--- 2 files changed, 104 insertions(+), 18 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9cf6a80fe297..c72b97a74906 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1752,7 +1752,10 @@ static void scsi_done(struct scsi_cmnd *cmd) * * Returns: Nothing * - * Lock status: IO request lock assumed to be held when called. + * Lock status: request queue lock assumed to be held when called. + * + * Note: See sd_zbc.c sd_zbc_write_lock_zone() for write order + * protection for ZBC disks. */ static void scsi_request_fn(struct request_queue *q) __releases(q->queue_lock) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 692c8cbc7ed8..023f705ae235 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -32,10 +32,14 @@ #include "sd.h" /** - * Convert a zone descriptor to a zone struct. + * sd_zbc_parse_report - Convert a zone descriptor to a struct blk_zone, + * @sdkp: The disk the report originated from + * @buf: Address of the report zone descriptor + * @zone: the destination zone structure + * + * All LBA sized values are converted to 512B sectors unit. */ -static void sd_zbc_parse_report(struct scsi_disk *sdkp, - u8 *buf, +static void sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf, struct blk_zone *zone) { struct scsi_device *sdp = sdkp->device; @@ -58,7 +62,13 @@ static void sd_zbc_parse_report(struct scsi_disk *sdkp, } /** - * Issue a REPORT ZONES scsi command. + * sd_zbc_report_zones - Issue a REPORT ZONES scsi command. + * @sdkp: The target disk + * @buf: Buffer to use for the reply + * @buflen: the buffer size + * @lba: Start LBA of the report + * + * For internal use during device validation. */ static int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buf, unsigned int buflen, sector_t lba) @@ -99,6 +109,12 @@ static int sd_zbc_report_zones(struct scsi_disk *sdkp, unsigned char *buf, return 0; } +/** + * sd_zbc_setup_report_cmnd - Prepare a REPORT ZONES scsi command + * @cmd: The command to setup + * + * Call in sd_init_command() for a REQ_OP_ZONE_REPORT request. + */ int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd) { struct request *rq = cmd->request; @@ -141,6 +157,14 @@ int sd_zbc_setup_report_cmnd(struct scsi_cmnd *cmd) return BLKPREP_OK; } +/** + * sd_zbc_report_zones_complete - Process a REPORT ZONES scsi command reply. + * @scmd: The completed report zones command + * @good_bytes: reply size in bytes + * + * Convert all reported zone descriptors to struct blk_zone. The conversion + * is done in-place, directly in the request specified sg buffer. + */ static void sd_zbc_report_zones_complete(struct scsi_cmnd *scmd, unsigned int good_bytes) { @@ -196,17 +220,32 @@ static void sd_zbc_report_zones_complete(struct scsi_cmnd *scmd, local_irq_restore(flags); } +/** + * sd_zbc_zone_sectors - Get the device zone size in number of 512B sectors. + * @sdkp: The target disk + */ static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp) { return logical_to_sectors(sdkp->device, sdkp->zone_blocks); } +/** + * sd_zbc_zone_no - Get the number of the zone conataining a sector. + * @sdkp: The target disk + * @sector: 512B sector address contained in the zone + */ static inline unsigned int sd_zbc_zone_no(struct scsi_disk *sdkp, sector_t sector) { return sectors_to_logical(sdkp->device, sector) >> sdkp->zone_shift; } +/** + * sd_zbc_setup_reset_cmnd - Prepare a RESET WRITE POINTER scsi command. + * @cmd: the command to setup + * + * Called from sd_init_command() for a REQ_OP_ZONE_RESET request. + */ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) { struct request *rq = cmd->request; @@ -239,6 +278,23 @@ int sd_zbc_setup_reset_cmnd(struct scsi_cmnd *cmd) return BLKPREP_OK; } +/** + * sd_zbc_write_lock_zone - Write lock a sequential zone. + * @cmd: write command + * + * Called from sd_init_cmd() for write requests (standard write, write same or + * write zeroes operations). If the request target zone is not already locked, + * the zone is locked and BLKPREP_OK returned, allowing the request to proceed + * through dispatch in scsi_request_fn(). Otherwise, BLKPREP_DEFER is returned, + * forcing the request to wait for the zone to be unlocked, that is, for the + * previously issued write request targ
[PATCH V6 11/14] blokc: mq-deadline: Introduce dispatch helpers
Avoid directly referencing the next_rq and fifo_list arrays using the helper functions deadline_next_request() and deadline_fifo_request() to facilitate changes in the dispatch request selection in __dd_dispatch_request(). Signed-off-by: Damien Le Moal Reviewed-by: Bart Van Assche --- block/mq-deadline.c | 45 + 1 file changed, 37 insertions(+), 8 deletions(-) diff --git a/block/mq-deadline.c b/block/mq-deadline.c index f0c01ef934b4..6b7b84ee8f82 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -194,13 +194,42 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir) } /* + * For the specified data direction, return the next request to + * dispatch using arrival ordered lists. + */ +static struct request * +deadline_fifo_request(struct deadline_data *dd, int data_dir) +{ + if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE)) + return NULL; + + if (list_empty(&dd->fifo_list[data_dir])) + return NULL; + + return rq_entry_fifo(dd->fifo_list[data_dir].next); +} + +/* + * For the specified data direction, return the next request to + * dispatch using sector position sorted lists. + */ +static struct request * +deadline_next_request(struct deadline_data *dd, int data_dir) +{ + if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE)) + return NULL; + + return dd->next_rq[data_dir]; +} + +/* * deadline_dispatch_requests selects the best request according to * read/write expire, fifo_batch, etc */ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx) { struct deadline_data *dd = hctx->queue->elevator->elevator_data; - struct request *rq; + struct request *rq, *next_rq; bool reads, writes; int data_dir; @@ -216,10 +245,9 @@ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx) /* * batches are currently reads XOR writes */ - if (dd->next_rq[WRITE]) - rq = dd->next_rq[WRITE]; - else - rq = dd->next_rq[READ]; + rq = deadline_next_request(dd, WRITE); + if (!rq) + rq = deadline_next_request(dd, READ); if (rq && dd->batching < dd->fifo_batch) /* we have a next request are still entitled to batch */ @@ -262,19 +290,20 @@ static struct request *__dd_dispatch_request(struct blk_mq_hw_ctx *hctx) /* * we are not running a batch, find best request for selected data_dir */ - if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) { + next_rq = deadline_next_request(dd, data_dir); + if (deadline_check_fifo(dd, data_dir) || !next_rq) { /* * A deadline has expired, the last request was in the other * direction, or we have run out of higher-sectored requests. * Start again from the request with the earliest expiry time. */ - rq = rq_entry_fifo(dd->fifo_list[data_dir].next); + rq = deadline_fifo_request(dd, data_dir); } else { /* * The last req was the same dir and we have a next request in * sort order. No expired requests so continue on from here. */ - rq = dd->next_rq[data_dir]; + rq = next_rq; } dd->batching = 0; -- 2.13.6
[PATCH V6 01/14] scsi: sd_zbc: Move ZBC declarations to scsi_proto.h
Move standard macro definitions for the zone types and zone conditions to scsi_proto.h together with the definitions related to the REPORT ZONES command. While at it, define all values in the enums to be clear. Also remove unnecessary includes in sd_zbc.c. No functional change is introduced by this patch. Signed-off-by: Damien Le Moal Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Christoph Hellwig --- drivers/scsi/sd_zbc.c | 24 include/scsi/scsi_proto.h | 45 ++--- 2 files changed, 34 insertions(+), 35 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 8aa54779aac1..692c8cbc7ed8 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -28,32 +28,8 @@ #include #include -#include -#include -#include -#include -#include #include "sd.h" -#include "scsi_priv.h" - -enum zbc_zone_type { - ZBC_ZONE_TYPE_CONV = 0x1, - ZBC_ZONE_TYPE_SEQWRITE_REQ, - ZBC_ZONE_TYPE_SEQWRITE_PREF, - ZBC_ZONE_TYPE_RESERVED, -}; - -enum zbc_zone_cond { - ZBC_ZONE_COND_NO_WP, - ZBC_ZONE_COND_EMPTY, - ZBC_ZONE_COND_IMP_OPEN, - ZBC_ZONE_COND_EXP_OPEN, - ZBC_ZONE_COND_CLOSED, - ZBC_ZONE_COND_READONLY = 0xd, - ZBC_ZONE_COND_FULL, - ZBC_ZONE_COND_OFFLINE, -}; /** * Convert a zone descriptor to a zone struct. diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h index 8c285d9a06d8..39130a9c05bf 100644 --- a/include/scsi/scsi_proto.h +++ b/include/scsi/scsi_proto.h @@ -301,19 +301,42 @@ struct scsi_lun { /* Reporting options for REPORT ZONES */ enum zbc_zone_reporting_options { - ZBC_ZONE_REPORTING_OPTION_ALL = 0, - ZBC_ZONE_REPORTING_OPTION_EMPTY, - ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN, - ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN, - ZBC_ZONE_REPORTING_OPTION_CLOSED, - ZBC_ZONE_REPORTING_OPTION_FULL, - ZBC_ZONE_REPORTING_OPTION_READONLY, - ZBC_ZONE_REPORTING_OPTION_OFFLINE, - ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10, - ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE, - ZBC_ZONE_REPORTING_OPTION_NON_WP = 0x3f, + ZBC_ZONE_REPORTING_OPTION_ALL = 0x00, + ZBC_ZONE_REPORTING_OPTION_EMPTY = 0x01, + ZBC_ZONE_REPORTING_OPTION_IMPLICIT_OPEN = 0x02, + ZBC_ZONE_REPORTING_OPTION_EXPLICIT_OPEN = 0x03, + ZBC_ZONE_REPORTING_OPTION_CLOSED= 0x04, + ZBC_ZONE_REPORTING_OPTION_FULL = 0x05, + ZBC_ZONE_REPORTING_OPTION_READONLY = 0x06, + ZBC_ZONE_REPORTING_OPTION_OFFLINE = 0x07, + /* 0x08 to 0x0f are reserved */ + ZBC_ZONE_REPORTING_OPTION_NEED_RESET_WP = 0x10, + ZBC_ZONE_REPORTING_OPTION_NON_SEQWRITE = 0x11, + /* 0x12 to 0x3e are reserved */ + ZBC_ZONE_REPORTING_OPTION_NON_WP= 0x3f, }; #define ZBC_REPORT_ZONE_PARTIAL 0x80 +/* Zone types of REPORT ZONES zone descriptors */ +enum zbc_zone_type { + ZBC_ZONE_TYPE_CONV = 0x1, + ZBC_ZONE_TYPE_SEQWRITE_REQ = 0x2, + ZBC_ZONE_TYPE_SEQWRITE_PREF = 0x3, + /* 0x4 to 0xf are reserved */ +}; + +/* Zone conditions of REPORT ZONES zone descriptors */ +enum zbc_zone_cond { + ZBC_ZONE_COND_NO_WP = 0x0, + ZBC_ZONE_COND_EMPTY = 0x1, + ZBC_ZONE_COND_IMP_OPEN = 0x2, + ZBC_ZONE_COND_EXP_OPEN = 0x3, + ZBC_ZONE_COND_CLOSED= 0x4, + /* 0x5 to 0xc are reserved */ + ZBC_ZONE_COND_READONLY = 0xd, + ZBC_ZONE_COND_FULL = 0xe, + ZBC_ZONE_COND_OFFLINE = 0xf, +}; + #endif /* _SCSI_PROTO_H_ */ -- 2.13.6
[PATCH V6 00/14] scsi-mq support for ZBC disks
This series implements support for ZBC disks used through the scsi-mq I/O path. The current scsi level support of ZBC disks guarantees write request ordering using a per-zone write lock which prevents issuing simultaneously multiple write commands to a zone, doing so avoid reordering of sequential writes to sequential zones. This method is however ineffective when scsi-mq is used with zoned block devices. This is due to the different execution model of blk-mq which passes a request to the scsi layer for dispatching after the request has been removed from the I/O scheduler queue. That is, when the scsi layer tries to lock the target zone of the request, the request may already be out of order and zone write locking fails to prevent that. Various approaches have been tried to solve this problem. All of them had the serious disadvantage of cluttering blk-mq code with zoned block device specific conditions and processing. As such extensive changes can only turn into a maintenance nightmares, a radically different solution is proposed here. This series proposes implementing scsi-mq support for zoned block devices at the I/O scheduler level with simple modifications of the mq-deadline scheduler. the modifications are the addition of a per zone write locking mechanism similar to that implemented in sd_zbc.c for the legacy scsi path. The zone write locking mechanism is used for the exact same purpose, that is, to limit writes per zone to at most one request to avoid reordering. The locking context however changes from that of scsi-sq and is moved to the dispatch_request method of the scheduler. Within this context, under a spin lock guaranteeing atomicity against other dispatch contexts, target zones of write requests can be locked before write requests removal from the scheduler. In effect, this results in the same behavior as the legacy scsi path. Sequential write ordering is preserved. The changes to mq-deadline do not affect regular disks: the same scheduling behavior is maintained for these. The modification are also optimized to not lock conventional zones. To do so, additional data is introduced in the request queue structure so that the low level scsi code can pass upward information such as the total number of zones and zone types of the device. The availability of this new data avoids difficulties in accessing this information from the I/O scheduler initialization method (init_queue() method) context. Of note is that the last patch of this series removes setting mq-deadline as the default scheduler for block devices with a single hardware queue. The reason for this is that setting the default scheduler is done very early in the device initialization sequence, when the disk characteristics are not yet known. This results in mq-deadline not correctly setting the default zones write locking behavior nased on the device zoning model. Setting of a default I/O scheduler can be done easily with udev rules later in the system initialization process, leading to correct default settings for zoned block devices. Comments are as always very much appreciated. Changes from v5: * Refactor patches to introduce the zone_lock spinlock only when needed * Addressed Bart's comments (in particular explanations of the zone_lock spinlock use) Changes from v4: * Various fixes and improvements (From Christoph's comments) * Dropped zones_wlock scheduler tunable attribute Changes from v3: * Integrated support directly into mq-deadline instead of creating a new I/O scheduler. * Disable setting of default mq scheduler for single queue devices Changes from v2: * Introduced blk_zoned structure * Moved I/O scheduler from drivers/scsi to block Changes from v1: * Addressed Bart's comments for the blk-mq patches (declarations files) * Split (former) patch 4 into multiple patches to facilitate review * Fixed scsi disk lookup from io scheduler by introducing scsi_disk_from_queue() Damien Le Moal (14): scsi: sd_zbc: Move ZBC declarations to scsi_proto.h scsi: sd_zbc: Fix comments and indentation scsi: sd_zbc: Rearrange code scsi: sd_zbc: Use well defined macros scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics() block: Add zoned block device information to request queue scsi: sd_zbc: Initialize device request queue zoned data scsi: sd_zbc: Limit zone write locking to sequential zones scsi: sd_zbc: Disable zone write locking with scsi-mq block: mq-deadline: Add zoned block device data blokc: mq-deadline: Introduce dispatch helpers block: mq-deadline: Introduce zone locking support block: mq-deadline: Limit write request dispatch for zoned block devices block: do not set mq default scheduler block/elevator.c | 17 +-- block/mq-deadline.c | 271 +++-- drivers/scsi/scsi_lib.c | 5 +- drivers/scsi/sd_zbc.c | 334 +++--- include/linux/blkdev.h| 53 include/scsi/scsi_proto.h | 45 +-
[PATCH V6 07/14] scsi: sd_zbc: Initialize device request queue zoned data
Initialize the seq_zone_bitmap and nr_zones fields of the disk request queue on disk revalidate. As the seq_zone_bitmap allocation is identical to the allocation of the zone write lock bitmap, introduce the helper sd_zbc_alloc_zone_bitmap(). Using this helper, wait for the disk capacity and number of zones to stabilize on the second revalidation pass to allocate and initialize the bitmaps. Signed-off-by: Damien Le Moal --- drivers/scsi/sd_zbc.c | 138 -- 1 file changed, 133 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index 27793b9f54c0..4ccb20074cb3 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -586,8 +586,121 @@ static int sd_zbc_check_zone_size(struct scsi_disk *sdkp) return 0; } +/** + * sd_zbc_alloc_zone_bitmap - Allocate a zone bitmap (one bit per zone). + * @sdkp: The disk of the bitmap + */ +static inline unsigned long *sd_zbc_alloc_zone_bitmap(struct scsi_disk *sdkp) +{ + struct request_queue *q = sdkp->disk->queue; + + return kzalloc_node(BITS_TO_LONGS(sdkp->nr_zones) + * sizeof(unsigned long), + GFP_KERNEL, q->node); +} + +/** + * sd_zbc_get_seq_zones - Parse report zones reply to identify sequential zones + * @sdkp: disk used + * @buf: report reply buffer + * @seq_zone_bitamp: bitmap of sequential zones to set + * + * Parse reported zone descriptors in @buf to identify sequential zones and + * set the reported zone bit in @seq_zone_bitmap accordingly. + * Since read-only and offline zones cannot be written, do not + * mark them as sequential in the bitmap. + * Return the LBA after the last zone reported. + */ +static sector_t sd_zbc_get_seq_zones(struct scsi_disk *sdkp, unsigned char *buf, +unsigned int buflen, +unsigned long *seq_zone_bitmap) +{ + sector_t lba, next_lba = sdkp->capacity; + unsigned int buf_len, list_length; + unsigned char *rec; + u8 type, cond; + + list_length = get_unaligned_be32(&buf[0]) + 64; + buf_len = min(list_length, buflen); + rec = buf + 64; + + while (rec < buf + buf_len) { + type = rec[0] & 0x0f; + cond = (rec[1] >> 4) & 0xf; + lba = get_unaligned_be64(&rec[16]); + if (type != ZBC_ZONE_TYPE_CONV && + cond != ZBC_ZONE_COND_READONLY && + cond != ZBC_ZONE_COND_OFFLINE) + set_bit(lba >> sdkp->zone_shift, seq_zone_bitmap); + next_lba = lba + get_unaligned_be64(&rec[8]); + rec += 64; + } + + return next_lba; +} + +/** + * sd_zbc_setup_seq_zone_bitmap - Initialize the disk seq zone bitmap. + * @sdkp: target disk + * + * Allocate a zone bitmap and initialize it by identifying sequential zones. + */ +static int sd_zbc_setup_seq_zone_bitmap(struct scsi_disk *sdkp) +{ + struct request_queue *q = sdkp->disk->queue; + unsigned long *seq_zone_bitmap; + sector_t lba = 0; + unsigned char *buf; + int ret = -ENOMEM; + + seq_zone_bitmap = sd_zbc_alloc_zone_bitmap(sdkp); + if (!seq_zone_bitmap) + return -ENOMEM; + + buf = kmalloc(SD_ZBC_BUF_SIZE, GFP_KERNEL); + if (!buf) + goto out; + + while (lba < sdkp->capacity) { + ret = sd_zbc_report_zones(sdkp, buf, SD_ZBC_BUF_SIZE, lba); + if (ret) + goto out; + lba = sd_zbc_get_seq_zones(sdkp, buf, SD_ZBC_BUF_SIZE, + seq_zone_bitmap); + } + + if (lba != sdkp->capacity) { + /* Something went wrong */ + ret = -EIO; + } + +out: + kfree(buf); + if (ret) { + kfree(seq_zone_bitmap); + return ret; + } + + q->seq_zone_bitmap = seq_zone_bitmap; + + return 0; +} + +static void sd_zbc_cleanup(struct scsi_disk *sdkp) +{ + struct request_queue *q = sdkp->disk->queue; + + kfree(q->seq_zone_bitmap); + q->seq_zone_bitmap = NULL; + + kfree(sdkp->zones_wlock); + sdkp->zones_wlock = NULL; +} + static int sd_zbc_setup(struct scsi_disk *sdkp) { + struct request_queue *q = sdkp->disk->queue; + int ret; /* READ16/WRITE16 is mandatory for ZBC disks */ sdkp->device->use_16_for_rw = 1; @@ -599,14 +712,29 @@ static int sd_zbc_setup(struct scsi_disk *sdkp) sdkp->nr_zones = round_up(sdkp->capacity, sdkp->zone_blocks) >> sdkp->zone_shift; + /* +* Wait for the disk capacity to stabilize before +* initializing zone related information. +*/ + if (sdkp->first_scan) + return 0; + if (!sdkp->zones_wlock) { - sdkp->zones_wlock = kcalloc(BITS_TO_LONGS(sdkp->nr_zones), -
[PATCH V6 05/14] scsi: sd_zbc: Fix sd_zbc_read_zoned_characteristics()
The three values starting at byte 8 of the Zoned Block Device Characteristics VPD page B6h are 32 bits values, not 64bits. So use get_unaligned_be32() to retrieve the values and not get_unaligned_be64() Fixes: 89d947561077 ("sd: Implement support for ZBC devices") Cc: Signed-off-by: Damien Le Moal Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Christoph Hellwig --- drivers/scsi/sd_zbc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c index bbad851c1789..27793b9f54c0 100644 --- a/drivers/scsi/sd_zbc.c +++ b/drivers/scsi/sd_zbc.c @@ -423,15 +423,15 @@ static int sd_zbc_read_zoned_characteristics(struct scsi_disk *sdkp, if (sdkp->device->type != TYPE_ZBC) { /* Host-aware */ sdkp->urswrz = 1; - sdkp->zones_optimal_open = get_unaligned_be64(&buf[8]); - sdkp->zones_optimal_nonseq = get_unaligned_be64(&buf[12]); + sdkp->zones_optimal_open = get_unaligned_be32(&buf[8]); + sdkp->zones_optimal_nonseq = get_unaligned_be32(&buf[12]); sdkp->zones_max_open = 0; } else { /* Host-managed */ sdkp->urswrz = buf[4] & 1; sdkp->zones_optimal_open = 0; sdkp->zones_optimal_nonseq = 0; - sdkp->zones_max_open = get_unaligned_be64(&buf[16]); + sdkp->zones_max_open = get_unaligned_be32(&buf[16]); } return 0; -- 2.13.6