RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
From: John Garry Sent: Tuesday, March 9, 2021 8:36 AM > > On 09/03/2021 15:57, Michael Kelley wrote: > > From: John Garry Sent: Tuesday, March 9, 2021 2:10 > > AM > >> > >> On 08/03/2021 17:56, Melanie Plageman wrote: > >>> On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: > From: Melanie Plageman (Microsoft) Sent: > Friday, > >> March 5, 2021 3:22 PM > > > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > > allocation. > > > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > > > Signed-off-by: Melanie Plageman (Microsoft) > > --- > >drivers/scsi/storvsc_drv.c | 2 ++ > >1 file changed, 2 insertions(+) > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > index 6bc5453cea8a..d7953a6e00e6 100644 > > --- a/drivers/scsi/storvsc_drv.c > > +++ b/drivers/scsi/storvsc_drv.c > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > > (max_sub_channels + 1) * > > (100 - ring_avail_percent_lowater) / > > 100; > > > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > >> scsi_driver.can_queue); > > + > > I'm not sure what you mean by "avoid dispatch errors". Can you > elaborate? > >>> > >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > >>> Scsi_Host->cmd_per_lun in storvsc_probe(). > >>> > >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > >>> called and sets the scsi_device->queue_depth to the Scsi_Host's > >>> cmd_per_lun with this code: > >>> > >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > >>> sdev->host->cmd_per_lun : 1); > >>> > >>> During dispatch, the scsi_device->queue_depth is used in > >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > >>> whether or not the device can queue another command. > >>> > >>> On some machines, with the 2048 value of cmd_per_lun that was used to > >>> set the initial scsi_device->queue_depth, commands can be queued that > >>> are later not able to be dispatched after running out of space in the > >>> ringbuffer. > >>> > >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > >>> (running an fio workload that I can provide if needed), storvsc_do_io() > >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > >>> > >>> This is the call stack: > >>> > >>> hv_get_bytes_to_write > >>> hv_ringbuffer_write > >>> vmbus_send_packet > >>> storvsc_dio_io > >>> storvsc_queuecommand > >>> scsi_dispatch_cmd > >>> scsi_queue_rq > >>> dispatch_rq_list > >>> > Be aware that the calculation of "can_queue" in this driver is somewhat > flawed -- it should not be based on the size of the ring buffer, but > instead on > the maximum number of requests Hyper-V will queue. And even then, > can_queue doesn't provide the cap you might expect because the blk-mq > layer > allocates can_queue tags for each HW queue, not as a total. > >>> > >>> > >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > >>> > >>> can_queue > >>> - must be greater than 0; do not send more than can_queue > >>> commands to the adapter. > >>> > >>> I did notice that in scsi_host.h, the comment for can_queue does say > >>> can_queue is the "maximum number of simultaneous commands a single hw > >>> queue in HBA will accept." However, I don't see it being used this way > >>> in the code. > >>> > >> > >> JFYI, the block layer ensures that no more than can_queue requests are > >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue > >> depth is set to shost->can_queue. > >> > >> Thanks, > >> John > > > > Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls > > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), > > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag > > set queue_depth as needed until it succeeds. > > > > The key thing is that __blk_mq_alloc_rq_maps() iterates over the > > number of HW queues calling __blk_mq_alloc_map_and_request(). > > The latter function allocates the map and the requests with a count > > of the tag set's queue_depth. There's no logic to apportion the > > can_queue value across multiple HW queues. So each HW queue gets > > can_queue tags allocated, and the SCSI host driver may see up to > > (can_queue * # HW queues) simultaneous requests. > > > > I'm certainly not an expert in this area, but that's what I see in the > > code. We've run live experiments, and can see the number > > simultaneous requests sent to the storvsc driver be greater than > > can_queue when the # of HW queues is greater than 1, which seems > > to be consistent with the code. > > ah, ok. I assumed that # of HW queues = 1 here. So you'r
Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
On 09/03/2021 15:57, Michael Kelley wrote: From: John Garry Sent: Tuesday, March 9, 2021 2:10 AM On 08/03/2021 17:56, Melanie Plageman wrote: On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: From: Melanie Plageman (Microsoft) Sent: Friday, March 5, 2021 3:22 PM The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during allocation. Cap cmd_per_lun at can_queue to avoid dispatch errors. Signed-off-by: Melanie Plageman (Microsoft) --- drivers/scsi/storvsc_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 6bc5453cea8a..d7953a6e00e6 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, (max_sub_channels + 1) * (100 - ring_avail_percent_lowater) / 100; + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); + I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set Scsi_Host->cmd_per_lun in storvsc_probe(). In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is called and sets the scsi_device->queue_depth to the Scsi_Host's cmd_per_lun with this code: scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? sdev->host->cmd_per_lun : 1); During dispatch, the scsi_device->queue_depth is used in scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine whether or not the device can queue another command. On some machines, with the 2048 value of cmd_per_lun that was used to set the initial scsi_device->queue_depth, commands can be queued that are later not able to be dispatched after running out of space in the ringbuffer. On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD (running an fio workload that I can provide if needed), storvsc_do_io() ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. This is the call stack: hv_get_bytes_to_write hv_ringbuffer_write vmbus_send_packet storvsc_dio_io storvsc_queuecommand scsi_dispatch_cmd scsi_queue_rq dispatch_rq_list Be aware that the calculation of "can_queue" in this driver is somewhat flawed -- it should not be based on the size of the ring buffer, but instead on the maximum number of requests Hyper-V will queue. And even then, can_queue doesn't provide the cap you might expect because the blk-mq layer allocates can_queue tags for each HW queue, not as a total. The docs for scsi_mid_low_api document Scsi_Host can_queue this way: can_queue - must be greater than 0; do not send more than can_queue commands to the adapter. I did notice that in scsi_host.h, the comment for can_queue does say can_queue is the "maximum number of simultaneous commands a single hw queue in HBA will accept." However, I don't see it being used this way in the code. JFYI, the block layer ensures that no more than can_queue requests are sent to the host. See scsi_mq_setup_tags(), and how the tagset queue depth is set to shost->can_queue. Thanks, John Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag set queue_depth as needed until it succeeds. The key thing is that __blk_mq_alloc_rq_maps() iterates over the number of HW queues calling __blk_mq_alloc_map_and_request(). The latter function allocates the map and the requests with a count of the tag set's queue_depth. There's no logic to apportion the can_queue value across multiple HW queues. So each HW queue gets can_queue tags allocated, and the SCSI host driver may see up to (can_queue * # HW queues) simultaneous requests. I'm certainly not an expert in this area, but that's what I see in the code. We've run live experiments, and can see the number simultaneous requests sent to the storvsc driver be greater than can_queue when the # of HW queues is greater than 1, which seems to be consistent with the code. ah, ok. I assumed that # of HW queues = 1 here. So you're describing a problem similar to https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1...@huawei.com/ So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads: the total queue depth per host is nr_hw_queues * can_queue. However, for when host_tagset is set, the total queue depth is can_queue. Setting .host_tagset will ensure at most can_queue requests will be sent over all HW queues at any given time. A few SCSI MQ drivers set this now. Thanks, John Michael During dispatch, In scsi_target_queue_ready(), there is this code: if (busy >= starget->can_queue) goto starved; And the scsi_target->can_queue value should be coming from Scsi_
RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
From: John Garry Sent: Tuesday, March 9, 2021 2:10 AM > > On 08/03/2021 17:56, Melanie Plageman wrote: > > On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: > >> From: Melanie Plageman (Microsoft) Sent: > >> Friday, > March 5, 2021 3:22 PM > >>> > >>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > >>> allocation. > >>> > >>> Cap cmd_per_lun at can_queue to avoid dispatch errors. > >>> > >>> Signed-off-by: Melanie Plageman (Microsoft) > >>> --- > >>> drivers/scsi/storvsc_drv.c | 2 ++ > >>> 1 file changed, 2 insertions(+) > >>> > >>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > >>> index 6bc5453cea8a..d7953a6e00e6 100644 > >>> --- a/drivers/scsi/storvsc_drv.c > >>> +++ b/drivers/scsi/storvsc_drv.c > >>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > >>> (max_sub_channels + 1) * > >>> (100 - ring_avail_percent_lowater) / > >>> 100; > >>> > >>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > scsi_driver.can_queue); > >>> + > >> > >> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? > > > > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > > Scsi_Host->cmd_per_lun in storvsc_probe(). > > > > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > > called and sets the scsi_device->queue_depth to the Scsi_Host's > > cmd_per_lun with this code: > > > > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > > sdev->host->cmd_per_lun : 1); > > > > During dispatch, the scsi_device->queue_depth is used in > > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > > whether or not the device can queue another command. > > > > On some machines, with the 2048 value of cmd_per_lun that was used to > > set the initial scsi_device->queue_depth, commands can be queued that > > are later not able to be dispatched after running out of space in the > > ringbuffer. > > > > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > > (running an fio workload that I can provide if needed), storvsc_do_io() > > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > > > > This is the call stack: > > > > hv_get_bytes_to_write > > hv_ringbuffer_write > > vmbus_send_packet > > storvsc_dio_io > > storvsc_queuecommand > > scsi_dispatch_cmd > > scsi_queue_rq > > dispatch_rq_list > > > >> Be aware that the calculation of "can_queue" in this driver is somewhat > >> flawed -- it should not be based on the size of the ring buffer, but > >> instead on > >> the maximum number of requests Hyper-V will queue. And even then, > >> can_queue doesn't provide the cap you might expect because the blk-mq layer > >> allocates can_queue tags for each HW queue, not as a total. > > > > > > The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > > > >can_queue > >- must be greater than 0; do not send more than can_queue > > commands to the adapter. > > > > I did notice that in scsi_host.h, the comment for can_queue does say > > can_queue is the "maximum number of simultaneous commands a single hw > > queue in HBA will accept." However, I don't see it being used this way > > in the code. > > > > JFYI, the block layer ensures that no more than can_queue requests are > sent to the host. See scsi_mq_setup_tags(), and how the tagset queue > depth is set to shost->can_queue. > > Thanks, > John Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag set queue_depth as needed until it succeeds. The key thing is that __blk_mq_alloc_rq_maps() iterates over the number of HW queues calling __blk_mq_alloc_map_and_request(). The latter function allocates the map and the requests with a count of the tag set's queue_depth. There's no logic to apportion the can_queue value across multiple HW queues. So each HW queue gets can_queue tags allocated, and the SCSI host driver may see up to (can_queue * # HW queues) simultaneous requests. I'm certainly not an expert in this area, but that's what I see in the code. We've run live experiments, and can see the number simultaneous requests sent to the storvsc driver be greater than can_queue when the # of HW queues is greater than 1, which seems to be consistent with the code. Michael > > > > During dispatch, In scsi_target_queue_ready(), there is this code: > > > > if (busy >= starget->can_queue) > > goto starved; > > > > And the scsi_target->can_queue value should be coming from Scsi_host as > > mentioned in the scsi_target definition in scsi_device.h > > /* > >* LLDs should set this in the slave_alloc host template callout. > >* If set to zero then there is not limit. > >
RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
From: Melanie Plageman Sent: Monday, March 8, 2021 9:56 AM > > On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: > > From: Melanie Plageman (Microsoft) Sent: > > Friday, March > 5, 2021 3:22 PM > > > > > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > > > allocation. > > > > > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > > > > > Signed-off-by: Melanie Plageman (Microsoft) > > > --- > > > drivers/scsi/storvsc_drv.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > index 6bc5453cea8a..d7953a6e00e6 100644 > > > --- a/drivers/scsi/storvsc_drv.c > > > +++ b/drivers/scsi/storvsc_drv.c > > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > > > (max_sub_channels + 1) * > > > (100 - ring_avail_percent_lowater) / 100; > > > > > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > scsi_driver.can_queue); > > > + > > > > I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? > > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > Scsi_Host->cmd_per_lun in storvsc_probe(). > > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > called and sets the scsi_device->queue_depth to the Scsi_Host's > cmd_per_lun with this code: > > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > sdev->host->cmd_per_lun : 1); > > During dispatch, the scsi_device->queue_depth is used in > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > whether or not the device can queue another command. > > On some machines, with the 2048 value of cmd_per_lun that was used to > set the initial scsi_device->queue_depth, commands can be queued that > are later not able to be dispatched after running out of space in the > ringbuffer. > > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > (running an fio workload that I can provide if needed), storvsc_do_io() > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > > This is the call stack: > > hv_get_bytes_to_write > hv_ringbuffer_write > vmbus_send_packet > storvsc_dio_io > storvsc_queuecommand > scsi_dispatch_cmd > scsi_queue_rq > dispatch_rq_list > > > Be aware that the calculation of "can_queue" in this driver is somewhat > > flawed -- it should not be based on the size of the ring buffer, but > > instead on > > the maximum number of requests Hyper-V will queue. And even then, > > can_queue doesn't provide the cap you might expect because the blk-mq layer > > allocates can_queue tags for each HW queue, not as a total. > > > The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > > can_queue > - must be greater than 0; do not send more than can_queue > commands to the adapter. > > I did notice that in scsi_host.h, the comment for can_queue does say > can_queue is the "maximum number of simultaneous commands a single hw > queue in HBA will accept." Yes, this comment is correct. The can_queue value is per HW queue. > However, I don't see it being used this way > in the code. > > During dispatch, In scsi_target_queue_ready(), there is this code: > > if (busy >= starget->can_queue) > goto starved; > > And the scsi_target->can_queue value should be coming from Scsi_host as > mentioned in the scsi_target definition in scsi_device.h > /* > * LLDs should set this in the slave_alloc host template callout. > * If set to zero then there is not limit. > */ > unsigned intcan_queue; > > So, I don't really see how this would be per hardware queue. For the storvsc driver, the can_queue value in the scsi_target is initialized to zero in scsi_alloc_target() and it remains unchanged. Maybe I'm missing something, but the only place I see that sets starget->can_queue to a non-zero value is iscsi_target_alloc(). The storvsc slave_alloc() function does not set it. So the test in scsi_target_queue_ready() for exceeding can_queue never executes. We've run live tests, and can see that the number of requests sent to the storvsc driver exceeds the can_queue value when the # of HW queues is greater than 1. That result is consistent with what I see in the code. Michael > > > > > I agree that the cmd_per_lun setting is also too big, but we should fix > > that in > > the context of getting all of these different settings working together > > correctly, > > and not piecemeal. > > > > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe > will also prevent the LUN queue_depth from being set to a value that is > higher than it can ever be set to again by the user when > storvsc_change_queue_depth() is invoked. > > Also in scsi_sysfs sdev_store_queue_depth() there is this check: > > if (depth < 1 || depth > sdev->host->can_q
Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
On 08/03/2021 17:56, Melanie Plageman wrote: On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: From: Melanie Plageman (Microsoft) Sent: Friday, March 5, 2021 3:22 PM The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during allocation. Cap cmd_per_lun at can_queue to avoid dispatch errors. Signed-off-by: Melanie Plageman (Microsoft) --- drivers/scsi/storvsc_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 6bc5453cea8a..d7953a6e00e6 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, (max_sub_channels + 1) * (100 - ring_avail_percent_lowater) / 100; + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); + I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set Scsi_Host->cmd_per_lun in storvsc_probe(). In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is called and sets the scsi_device->queue_depth to the Scsi_Host's cmd_per_lun with this code: scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? sdev->host->cmd_per_lun : 1); During dispatch, the scsi_device->queue_depth is used in scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine whether or not the device can queue another command. On some machines, with the 2048 value of cmd_per_lun that was used to set the initial scsi_device->queue_depth, commands can be queued that are later not able to be dispatched after running out of space in the ringbuffer. On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD (running an fio workload that I can provide if needed), storvsc_do_io() ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. This is the call stack: hv_get_bytes_to_write hv_ringbuffer_write vmbus_send_packet storvsc_dio_io storvsc_queuecommand scsi_dispatch_cmd scsi_queue_rq dispatch_rq_list Be aware that the calculation of "can_queue" in this driver is somewhat flawed -- it should not be based on the size of the ring buffer, but instead on the maximum number of requests Hyper-V will queue. And even then, can_queue doesn't provide the cap you might expect because the blk-mq layer allocates can_queue tags for each HW queue, not as a total. The docs for scsi_mid_low_api document Scsi_Host can_queue this way: can_queue - must be greater than 0; do not send more than can_queue commands to the adapter. I did notice that in scsi_host.h, the comment for can_queue does say can_queue is the "maximum number of simultaneous commands a single hw queue in HBA will accept." However, I don't see it being used this way in the code. JFYI, the block layer ensures that no more than can_queue requests are sent to the host. See scsi_mq_setup_tags(), and how the tagset queue depth is set to shost->can_queue. Thanks, John During dispatch, In scsi_target_queue_ready(), there is this code: if (busy >= starget->can_queue) goto starved; And the scsi_target->can_queue value should be coming from Scsi_host as mentioned in the scsi_target definition in scsi_device.h /* * LLDs should set this in the slave_alloc host template callout. * If set to zero then there is not limit. */ unsigned intcan_queue; So, I don't really see how this would be per hardware queue. I agree that the cmd_per_lun setting is also too big, but we should fix that in the context of getting all of these different settings working together correctly, and not piecemeal. Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe will also prevent the LUN queue_depth from being set to a value that is higher than it can ever be set to again by the user when storvsc_change_queue_depth() is invoked. Also in scsi_sysfs sdev_store_queue_depth() there is this check: if (depth < 1 || depth > sdev->host->can_queue) return -EINVAL; I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun is set to the min of the configured cmd_per_lun and Scsi_Host->can_queue: shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); Best, Melanie .
Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: > From: Melanie Plageman (Microsoft) Sent: Friday, > March 5, 2021 3:22 PM > > > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > > allocation. > > > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > > > Signed-off-by: Melanie Plageman (Microsoft) > > --- > > drivers/scsi/storvsc_drv.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > index 6bc5453cea8a..d7953a6e00e6 100644 > > --- a/drivers/scsi/storvsc_drv.c > > +++ b/drivers/scsi/storvsc_drv.c > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > > (max_sub_channels + 1) * > > (100 - ring_avail_percent_lowater) / 100; > > > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > > scsi_driver.can_queue); > > + > > I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set Scsi_Host->cmd_per_lun in storvsc_probe(). In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is called and sets the scsi_device->queue_depth to the Scsi_Host's cmd_per_lun with this code: scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? sdev->host->cmd_per_lun : 1); During dispatch, the scsi_device->queue_depth is used in scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine whether or not the device can queue another command. On some machines, with the 2048 value of cmd_per_lun that was used to set the initial scsi_device->queue_depth, commands can be queued that are later not able to be dispatched after running out of space in the ringbuffer. On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD (running an fio workload that I can provide if needed), storvsc_do_io() ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. This is the call stack: hv_get_bytes_to_write hv_ringbuffer_write vmbus_send_packet storvsc_dio_io storvsc_queuecommand scsi_dispatch_cmd scsi_queue_rq dispatch_rq_list > Be aware that the calculation of "can_queue" in this driver is somewhat > flawed -- it should not be based on the size of the ring buffer, but instead > on > the maximum number of requests Hyper-V will queue. And even then, > can_queue doesn't provide the cap you might expect because the blk-mq layer > allocates can_queue tags for each HW queue, not as a total. The docs for scsi_mid_low_api document Scsi_Host can_queue this way: can_queue - must be greater than 0; do not send more than can_queue commands to the adapter. I did notice that in scsi_host.h, the comment for can_queue does say can_queue is the "maximum number of simultaneous commands a single hw queue in HBA will accept." However, I don't see it being used this way in the code. During dispatch, In scsi_target_queue_ready(), there is this code: if (busy >= starget->can_queue) goto starved; And the scsi_target->can_queue value should be coming from Scsi_host as mentioned in the scsi_target definition in scsi_device.h /* * LLDs should set this in the slave_alloc host template callout. * If set to zero then there is not limit. */ unsigned intcan_queue; So, I don't really see how this would be per hardware queue. > > I agree that the cmd_per_lun setting is also too big, but we should fix that > in > the context of getting all of these different settings working together > correctly, > and not piecemeal. > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe will also prevent the LUN queue_depth from being set to a value that is higher than it can ever be set to again by the user when storvsc_change_queue_depth() is invoked. Also in scsi_sysfs sdev_store_queue_depth() there is this check: if (depth < 1 || depth > sdev->host->can_queue) return -EINVAL; I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun is set to the min of the configured cmd_per_lun and Scsi_Host->can_queue: shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); Best, Melanie
RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
From: Melanie Plageman (Microsoft) Sent: Friday, March 5, 2021 3:22 PM > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > allocation. > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > Signed-off-by: Melanie Plageman (Microsoft) > --- > drivers/scsi/storvsc_drv.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index 6bc5453cea8a..d7953a6e00e6 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > (max_sub_channels + 1) * > (100 - ring_avail_percent_lowater) / 100; > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > scsi_driver.can_queue); > + I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? Be aware that the calculation of "can_queue" in this driver is somewhat flawed -- it should not be based on the size of the ring buffer, but instead on the maximum number of requests Hyper-V will queue. And even then, can_queue doesn't provide the cap you might expect because the blk-mq layer allocates can_queue tags for each HW queue, not as a total. I agree that the cmd_per_lun setting is also too big, but we should fix that in the context of getting all of these different settings working together correctly, and not piecemeal. Michael > host = scsi_host_alloc(&scsi_driver, > sizeof(struct hv_host_device)); > if (!host) > -- > 2.20.1
[PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during allocation. Cap cmd_per_lun at can_queue to avoid dispatch errors. Signed-off-by: Melanie Plageman (Microsoft) --- drivers/scsi/storvsc_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 6bc5453cea8a..d7953a6e00e6 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, (max_sub_channels + 1) * (100 - ring_avail_percent_lowater) / 100; + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); + host = scsi_host_alloc(&scsi_driver, sizeof(struct hv_host_device)); if (!host) -- 2.20.1