RE: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD
V4 2/2] scsi: core: don't limit per-LUN queue depth > for SSD > > On Fri, Oct 18, 2019 at 12:00:07AM +0530, Kashyap Desai wrote: > > > On 10/9/19 2:32 AM, Ming Lei wrote: > > > > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device > > > > *sdev, > > > struct scsi_cmnd *cmd) > > > > if (starget->can_queue > 0) > > > > atomic_dec(&starget->target_busy); > > > > > > > > - atomic_dec(&sdev->device_busy); > > > > + if (!blk_queue_nonrot(sdev->request_queue)) > > > > + atomic_dec(&sdev->device_busy); > > > > } > > > > > > > > > > Hi Ming, > > > > > > Does this patch impact the meaning of the queue_depth sysfs > > > attribute (see also sdev_store_queue_depth()) and also the queue > > > depth ramp up/down mechanism (see also > scsi_handle_queue_ramp_up())? > > > Have you considered to enable/disable busy tracking per LUN > > > depending on whether or not sdev- > > > >queue_depth < shost->can_queue? > > > > > > The megaraid and mpt3sas drivers read sdev->device_busy directly. Is > > > the current version of this patch compatible with these drivers? > > > > We need to know per scsi device outstanding in mpt3sas and > > megaraid_sas driver. > > Is the READ done in fast path or slow path? If it is on slow path, it should be > easy to do via blk_mq_in_flight_rw(). READ is done in fast path. > > > Can we get supporting API from block layer (through SML) ? something > > similar to "atomic_read(&hctx->nr_active)" which can be derived from > > sdev->request_queue->hctx ? > > At least for those driver which is nr_hw_queue = 1, it will be useful > > and we can avoid sdev->device_busy dependency. > > If you mean to add new atomic counter, we just move the .device_busy into > blk-mq, that can become new bottleneck. How about below ? We define and use below API instead of "atomic_read(&scp->device->device_busy) >" and it is giving expected value. I have not captured performance impact on max IOPs profile. Inline unsigned long sdev_nr_inflight_request(struct request_queue *q) { struct blk_mq_hw_ctx *hctx; unsigned long nr_requests = 0; int i; queue_for_each_hw_ctx(q, hctx, i) nr_requests += atomic_read(&hctx->nr_active); return nr_requests; } Kashyap > > > thanks, > Ming
Re: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD
On Fri, Oct 18, 2019 at 12:00:07AM +0530, Kashyap Desai wrote: > > On 10/9/19 2:32 AM, Ming Lei wrote: > > > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, > > struct scsi_cmnd *cmd) > > > if (starget->can_queue > 0) > > > atomic_dec(&starget->target_busy); > > > > > > - atomic_dec(&sdev->device_busy); > > > + if (!blk_queue_nonrot(sdev->request_queue)) > > > + atomic_dec(&sdev->device_busy); > > > } > > > > > > > Hi Ming, > > > > Does this patch impact the meaning of the queue_depth sysfs attribute (see > > also sdev_store_queue_depth()) and also the queue depth ramp up/down > > mechanism (see also scsi_handle_queue_ramp_up())? Have you considered to > > enable/disable busy tracking per LUN depending on whether or not sdev- > > >queue_depth < shost->can_queue? > > > > The megaraid and mpt3sas drivers read sdev->device_busy directly. Is the > > current version of this patch compatible with these drivers? > > We need to know per scsi device outstanding in mpt3sas and megaraid_sas > driver. Is the READ done in fast path or slow path? If it is on slow path, it should be easy to do via blk_mq_in_flight_rw(). > Can we get supporting API from block layer (through SML) ? something > similar to "atomic_read(&hctx->nr_active)" which can be derived from > sdev->request_queue->hctx ? > At least for those driver which is nr_hw_queue = 1, it will be useful and we > can avoid sdev->device_busy dependency. If you mean to add new atomic counter, we just move the .device_busy into blk-mq, that can become new bottleneck. thanks, Ming
RE: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD
> On 10/9/19 2:32 AM, Ming Lei wrote: > > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, > struct scsi_cmnd *cmd) > > if (starget->can_queue > 0) > > atomic_dec(&starget->target_busy); > > > > - atomic_dec(&sdev->device_busy); > > + if (!blk_queue_nonrot(sdev->request_queue)) > > + atomic_dec(&sdev->device_busy); > > } > > > > Hi Ming, > > Does this patch impact the meaning of the queue_depth sysfs attribute (see > also sdev_store_queue_depth()) and also the queue depth ramp up/down > mechanism (see also scsi_handle_queue_ramp_up())? Have you considered to > enable/disable busy tracking per LUN depending on whether or not sdev- > >queue_depth < shost->can_queue? > > The megaraid and mpt3sas drivers read sdev->device_busy directly. Is the > current version of this patch compatible with these drivers? We need to know per scsi device outstanding in mpt3sas and megaraid_sas driver. Can we get supporting API from block layer (through SML) ? something similar to "atomic_read(&hctx->nr_active)" which can be derived from sdev->request_queue->hctx ? At least for those driver which is nr_hw_queue = 1, it will be useful and we can avoid sdev->device_busy dependency. Kashyap > > Thanks, > > Bart.
Re: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD
On Wed, Oct 09, 2019 at 09:05:26AM -0700, Bart Van Assche wrote: > On 10/9/19 2:32 AM, Ming Lei wrote: > > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, > > struct scsi_cmnd *cmd) > > if (starget->can_queue > 0) > > atomic_dec(&starget->target_busy); > > - atomic_dec(&sdev->device_busy); > > + if (!blk_queue_nonrot(sdev->request_queue)) > > + atomic_dec(&sdev->device_busy); > > } > > Hi Ming, > > Does this patch impact the meaning of the queue_depth sysfs attribute (see > also sdev_store_queue_depth()) and also the queue depth ramp up/down > mechanism (see also scsi_handle_queue_ramp_up())? Have you considered to This patch only ignores to track inflight SCSI commands on each LUN, so it doesn't affect the meaning of sdev->queue_depth, which is just bypassed for SSD. > enable/disable busy tracking per LUN depending on whether or not > sdev->queue_depth < shost->can_queue? Yeah, we can do it, but that isn't contradictory with this patch, :-) And we can do it even though sdev->queue_depth < shost->can_queue. Usually sdev->queue_depth is used for the following reasons: 1) this limit isn't a hard limit, which may improve IO merge with cost of IO latency. 2) fair dispatch among LUNs. This patch just tries to not apply per-LUN queue depth for SSD, because: 1) fair dispatch has been done by blk-mq in allocating driver tag 2) IO merge doesn't play big role for SSD, and IO latency can be increased by applying per-LUN queue depth. 3) NVMe doesn't apply per-ns queue depth > > The megaraid and mpt3sas drivers read sdev->device_busy directly. Is the > current version of this patch compatible with these drivers? For megaraid, sdev->device_busy is checked for choosing reply queue, this way shouldn't be better than using default managed IRQ's mapping. Similar usage is done on _base_get_high_iops_msix_index(). The only effect may be in scsih_dev_reset(), and 'SUCCESS' message is dumped even though there is in-flight command on this LUN, but it can be fixed easily. Thanks, Ming
Re: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD
On 10/9/19 2:32 AM, Ming Lei wrote: @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } Hi Ming, Does this patch impact the meaning of the queue_depth sysfs attribute (see also sdev_store_queue_depth()) and also the queue depth ramp up/down mechanism (see also scsi_handle_queue_ramp_up())? Have you considered to enable/disable busy tracking per LUN depending on whether or not sdev->queue_depth < shost->can_queue? The megaraid and mpt3sas drivers read sdev->device_busy directly. Is the current version of this patch compatible with these drivers? Thanks, Bart.
[RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD
SCSI core uses the atomic variable of sdev->device_busy to track in-flight IO requests dispatched to this scsi device. IO request may be submitted from any CPU, so the cost for maintaining the shared atomic counter can be very big on big NUMA machine with lots of CPU cores. sdev->queue_depth is usually used for two purposes: 1) improve IO merge; 2) fair IO request scattered among all LUNs. blk-mq already provides fair request allocation among all active shared request queues(LUNs), see hctx_may_queue(). NVMe doesn't have such per-request-queue(namespace) queue depth, so it is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't play big role for reaching top SSD performance. With this patch, big cost for tracking in-flight per-LUN requests via atomic variable can be saved. Cc: Jens Axboe Cc: Ewan D. Milne Cc: Omar Sandoval , Cc: "Martin K. Petersen" , Cc: James Bottomley , Cc: Christoph Hellwig , Cc: Kashyap Desai Cc: Hannes Reinecke Cc: Laurence Oberman Cc: Bart Van Assche Signed-off-by: Ming Lei --- drivers/scsi/scsi_lib.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index b6f66dcb15a5..b8f0898a15e4 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd) if (starget->can_queue > 0) atomic_dec(&starget->target_busy); - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } static void scsi_kick_queue(struct request_queue *q) @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev) static inline bool scsi_device_is_busy(struct scsi_device *sdev) { - if (atomic_read(&sdev->device_busy) >= sdev->queue_depth) + if (!blk_queue_nonrot(sdev->request_queue) && + atomic_read(&sdev->device_busy) >= sdev->queue_depth) return true; if (atomic_read(&sdev->device_blocked) > 0) return true; @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, struct scsi_device *sdev) { unsigned int busy; + bool bypass = blk_queue_nonrot(sdev->request_queue); - busy = atomic_inc_return(&sdev->device_busy) - 1; + if (!bypass) + busy = atomic_inc_return(&sdev->device_busy) - 1; + else + busy = 0; if (atomic_read(&sdev->device_blocked)) { if (busy) goto out_dec; @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q, "unblocking device at zero depth\n")); } + if (bypass) + return 1; + if (busy >= sdev->queue_depth) goto out_dec; return 1; out_dec: - atomic_dec(&sdev->device_busy); + if (!bypass) + atomic_dec(&sdev->device_busy); return 0; } @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - atomic_dec(&sdev->device_busy); + if (!blk_queue_nonrot(sdev->request_queue)) + atomic_dec(&sdev->device_busy); } static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) @@ -1706,7 +1717,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: - if (atomic_read(&sdev->device_busy) || + if ((!blk_queue_nonrot(sdev->request_queue) && +atomic_read(&sdev->device_busy)) || scsi_device_blocked(sdev)) ret = BLK_STS_DEV_RESOURCE; break; -- 2.20.1