RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread Michael Kelley
From: John Garry  Sent: Tuesday, March 9, 2021 8:36 AM
> 
> On 09/03/2021 15:57, Michael Kelley wrote:
> > From: John Garry  Sent: Tuesday, March 9, 2021 2:10 
> > AM
> >>
> >> On 08/03/2021 17:56, Melanie Plageman wrote:
> >>> On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:
>  From: Melanie Plageman (Microsoft)  Sent: 
>  Friday,
> >> March 5, 2021 3:22 PM
> >
> > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> > allocation.
> >
> > Cap cmd_per_lun at can_queue to avoid dispatch errors.
> >
> > Signed-off-by: Melanie Plageman (Microsoft) 
> > ---
> >drivers/scsi/storvsc_drv.c | 2 ++
> >1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index 6bc5453cea8a..d7953a6e00e6 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> > (max_sub_channels + 1) *
> > (100 - ring_avail_percent_lowater) / 
> > 100;
> >
> > +   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,
> >> scsi_driver.can_queue);
> > +
> 
>  I'm not sure what you mean by "avoid dispatch errors".  Can you 
>  elaborate?
> >>>
> >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
> >>> Scsi_Host->cmd_per_lun in storvsc_probe().
> >>>
> >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
> >>> called and sets the scsi_device->queue_depth to the Scsi_Host's
> >>> cmd_per_lun with this code:
> >>>
> >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
> >>>   sdev->host->cmd_per_lun : 1);
> >>>
> >>> During dispatch, the scsi_device->queue_depth is used in
> >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
> >>> whether or not the device can queue another command.
> >>>
> >>> On some machines, with the 2048 value of cmd_per_lun that was used to
> >>> set the initial scsi_device->queue_depth, commands can be queued that
> >>> are later not able to be dispatched after running out of space in the
> >>> ringbuffer.
> >>>
> >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
> >>> (running an fio workload that I can provide if needed), storvsc_do_io()
> >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
> >>>
> >>> This is the call stack:
> >>>
> >>> hv_get_bytes_to_write
> >>> hv_ringbuffer_write
> >>> vmbus_send_packet
> >>> storvsc_dio_io
> >>> storvsc_queuecommand
> >>> scsi_dispatch_cmd
> >>> scsi_queue_rq
> >>> dispatch_rq_list
> >>>
>  Be aware that the calculation of "can_queue" in this driver is somewhat
>  flawed -- it should not be based on the size of the ring buffer, but 
>  instead on
>  the maximum number of requests Hyper-V will queue.  And even then,
>  can_queue doesn't provide the cap you might expect because the blk-mq 
>  layer
>  allocates can_queue tags for each HW queue, not as a total.
> >>>
> >>>
> >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:
> >>>
> >>> can_queue
> >>> - must be greater than 0; do not send more than can_queue
> >>>   commands to the adapter.
> >>>
> >>> I did notice that in scsi_host.h, the comment for can_queue does say
> >>> can_queue is the "maximum number of simultaneous commands a single hw
> >>> queue in HBA will accept." However, I don't see it being used this way
> >>> in the code.
> >>>
> >>
> >> JFYI, the block layer ensures that no more than can_queue requests are
> >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue
> >> depth is set to shost->can_queue.
> >>
> >> Thanks,
> >> John
> >
> > Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls
> > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
> > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
> > set queue_depth as needed until it succeeds.
> >
> > The key thing is that __blk_mq_alloc_rq_maps() iterates over the
> > number of HW queues calling __blk_mq_alloc_map_and_request().
> > The latter function allocates the map and the requests with a count
> > of the tag set's queue_depth.   There's no logic to apportion the
> > can_queue value across multiple HW queues. So each HW queue gets
> > can_queue tags allocated, and the SCSI host driver may see up to
> > (can_queue * # HW queues) simultaneous requests.
> >
> > I'm certainly not an expert in this area, but that's what I see in the
> > code.  We've run live experiments, and can see the number
> > simultaneous requests sent to the storvsc driver be greater than
> > can_queue when the # of HW queues is greater than 1, which seems
> > to be consistent with the code.
> 
> ah, ok. I assumed that # of HW queues = 1 here. So you'r

Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread John Garry

On 09/03/2021 15:57, Michael Kelley wrote:

From: John Garry  Sent: Tuesday, March 9, 2021 2:10 AM


On 08/03/2021 17:56, Melanie Plageman wrote:

On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:

From: Melanie Plageman (Microsoft)  Sent: Friday,

March 5, 2021 3:22 PM


The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
allocation.

Cap cmd_per_lun at can_queue to avoid dispatch errors.

Signed-off-by: Melanie Plageman (Microsoft) 
---
   drivers/scsi/storvsc_drv.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6bc5453cea8a..d7953a6e00e6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
(max_sub_channels + 1) *
(100 - ring_avail_percent_lowater) / 100;

+   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,

scsi_driver.can_queue);

+


I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?


The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
Scsi_Host->cmd_per_lun in storvsc_probe().

In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
called and sets the scsi_device->queue_depth to the Scsi_Host's
cmd_per_lun with this code:

scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
  sdev->host->cmd_per_lun : 1);

During dispatch, the scsi_device->queue_depth is used in
scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
whether or not the device can queue another command.

On some machines, with the 2048 value of cmd_per_lun that was used to
set the initial scsi_device->queue_depth, commands can be queued that
are later not able to be dispatched after running out of space in the
ringbuffer.

On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
(running an fio workload that I can provide if needed), storvsc_do_io()
ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

This is the call stack:

hv_get_bytes_to_write
hv_ringbuffer_write
vmbus_send_packet
storvsc_dio_io
storvsc_queuecommand
scsi_dispatch_cmd
scsi_queue_rq
dispatch_rq_list


Be aware that the calculation of "can_queue" in this driver is somewhat
flawed -- it should not be based on the size of the ring buffer, but instead on
the maximum number of requests Hyper-V will queue.  And even then,
can_queue doesn't provide the cap you might expect because the blk-mq layer
allocates can_queue tags for each HW queue, not as a total.



The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

can_queue
- must be greater than 0; do not send more than can_queue
  commands to the adapter.

I did notice that in scsi_host.h, the comment for can_queue does say
can_queue is the "maximum number of simultaneous commands a single hw
queue in HBA will accept." However, I don't see it being used this way
in the code.



JFYI, the block layer ensures that no more than can_queue requests are
sent to the host. See scsi_mq_setup_tags(), and how the tagset queue
depth is set to shost->can_queue.

Thanks,
John


Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls
blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
set queue_depth as needed until it succeeds.

The key thing is that __blk_mq_alloc_rq_maps() iterates over the
number of HW queues calling __blk_mq_alloc_map_and_request().
The latter function allocates the map and the requests with a count
of the tag set's queue_depth.   There's no logic to apportion the
can_queue value across multiple HW queues. So each HW queue gets
can_queue tags allocated, and the SCSI host driver may see up to
(can_queue * # HW queues) simultaneous requests.

I'm certainly not an expert in this area, but that's what I see in the
code.  We've run live experiments, and can see the number
simultaneous requests sent to the storvsc driver be greater than
can_queue when the # of HW queues is greater than 1, which seems
to be consistent with the code.


ah, ok. I assumed that # of HW queues = 1 here. So you're describing a 
problem similar to 
https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1...@huawei.com/


So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads:
the total queue depth per host is nr_hw_queues * can_queue. However, for 
when host_tagset is set, the total queue depth is can_queue.


Setting .host_tagset will ensure at most can_queue requests will be sent 
over all HW queues at any given time. A few SCSI MQ drivers set this now.


Thanks,
John



Michael





During dispatch, In scsi_target_queue_ready(), there is this code:

  if (busy >= starget->can_queue)
  goto starved;

And the scsi_target->can_queue value should be coming from Scsi_

RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread Michael Kelley
From: John Garry  Sent: Tuesday, March 9, 2021 2:10 AM
> 
> On 08/03/2021 17:56, Melanie Plageman wrote:
> > On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:
> >> From: Melanie Plageman (Microsoft)  Sent: 
> >> Friday,
> March 5, 2021 3:22 PM
> >>>
> >>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> >>> allocation.
> >>>
> >>> Cap cmd_per_lun at can_queue to avoid dispatch errors.
> >>>
> >>> Signed-off-by: Melanie Plageman (Microsoft) 
> >>> ---
> >>>   drivers/scsi/storvsc_drv.c | 2 ++
> >>>   1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> >>> index 6bc5453cea8a..d7953a6e00e6 100644
> >>> --- a/drivers/scsi/storvsc_drv.c
> >>> +++ b/drivers/scsi/storvsc_drv.c
> >>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> >>>   (max_sub_channels + 1) *
> >>>   (100 - ring_avail_percent_lowater) / 
> >>> 100;
> >>>
> >>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,
> scsi_driver.can_queue);
> >>> +
> >>
> >> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?
> >
> > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
> > Scsi_Host->cmd_per_lun in storvsc_probe().
> >
> > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
> > called and sets the scsi_device->queue_depth to the Scsi_Host's
> > cmd_per_lun with this code:
> >
> > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
> >  sdev->host->cmd_per_lun : 1);
> >
> > During dispatch, the scsi_device->queue_depth is used in
> > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
> > whether or not the device can queue another command.
> >
> > On some machines, with the 2048 value of cmd_per_lun that was used to
> > set the initial scsi_device->queue_depth, commands can be queued that
> > are later not able to be dispatched after running out of space in the
> > ringbuffer.
> >
> > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
> > (running an fio workload that I can provide if needed), storvsc_do_io()
> > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
> >
> > This is the call stack:
> >
> > hv_get_bytes_to_write
> > hv_ringbuffer_write
> > vmbus_send_packet
> > storvsc_dio_io
> > storvsc_queuecommand
> > scsi_dispatch_cmd
> > scsi_queue_rq
> > dispatch_rq_list
> >
> >> Be aware that the calculation of "can_queue" in this driver is somewhat
> >> flawed -- it should not be based on the size of the ring buffer, but 
> >> instead on
> >> the maximum number of requests Hyper-V will queue.  And even then,
> >> can_queue doesn't provide the cap you might expect because the blk-mq layer
> >> allocates can_queue tags for each HW queue, not as a total.
> >
> >
> > The docs for scsi_mid_low_api document Scsi_Host can_queue this way:
> >
> >can_queue
> >- must be greater than 0; do not send more than can_queue
> >  commands to the adapter.
> >
> > I did notice that in scsi_host.h, the comment for can_queue does say
> > can_queue is the "maximum number of simultaneous commands a single hw
> > queue in HBA will accept." However, I don't see it being used this way
> > in the code.
> >
> 
> JFYI, the block layer ensures that no more than can_queue requests are
> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue
> depth is set to shost->can_queue.
> 
> Thanks,
> John

Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls
blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
set queue_depth as needed until it succeeds.

The key thing is that __blk_mq_alloc_rq_maps() iterates over the
number of HW queues calling __blk_mq_alloc_map_and_request().
The latter function allocates the map and the requests with a count
of the tag set's queue_depth.   There's no logic to apportion the
can_queue value across multiple HW queues. So each HW queue gets
can_queue tags allocated, and the SCSI host driver may see up to
(can_queue * # HW queues) simultaneous requests.

I'm certainly not an expert in this area, but that's what I see in the
code.  We've run live experiments, and can see the number
simultaneous requests sent to the storvsc driver be greater than
can_queue when the # of HW queues is greater than 1, which seems
to be consistent with the code.

Michael

> 
> 
> > During dispatch, In scsi_target_queue_ready(), there is this code:
> >
> >  if (busy >= starget->can_queue)
> >  goto starved;
> >
> > And the scsi_target->can_queue value should be coming from Scsi_host as
> > mentioned in the scsi_target definition in scsi_device.h
> >  /*
> >* LLDs should set this in the slave_alloc host template callout.
> >* If set to zero then there is not limit.
> >  

RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread Michael Kelley
From: Melanie Plageman  Sent: Monday, March 8, 2021 
9:56 AM
> 
> On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:
> > From: Melanie Plageman (Microsoft)  Sent: 
> > Friday, March
> 5, 2021 3:22 PM
> > >
> > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> > > allocation.
> > >
> > > Cap cmd_per_lun at can_queue to avoid dispatch errors.
> > >
> > > Signed-off-by: Melanie Plageman (Microsoft) 
> > > ---
> > >  drivers/scsi/storvsc_drv.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index 6bc5453cea8a..d7953a6e00e6 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> > >   (max_sub_channels + 1) *
> > >   (100 - ring_avail_percent_lowater) / 100;
> > >
> > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,
> scsi_driver.can_queue);
> > > +
> >
> > I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?
> 
> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
> Scsi_Host->cmd_per_lun in storvsc_probe().
> 
> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
> called and sets the scsi_device->queue_depth to the Scsi_Host's
> cmd_per_lun with this code:
> 
> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
> sdev->host->cmd_per_lun : 1);
> 
> During dispatch, the scsi_device->queue_depth is used in
> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
> whether or not the device can queue another command.
> 
> On some machines, with the 2048 value of cmd_per_lun that was used to
> set the initial scsi_device->queue_depth, commands can be queued that
> are later not able to be dispatched after running out of space in the
> ringbuffer.
> 
> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
> (running an fio workload that I can provide if needed), storvsc_do_io()
> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
> 
> This is the call stack:
> 
> hv_get_bytes_to_write
> hv_ringbuffer_write
> vmbus_send_packet
> storvsc_dio_io
> storvsc_queuecommand
> scsi_dispatch_cmd
> scsi_queue_rq
> dispatch_rq_list
> 
> > Be aware that the calculation of "can_queue" in this driver is somewhat
> > flawed -- it should not be based on the size of the ring buffer, but 
> > instead on
> > the maximum number of requests Hyper-V will queue.  And even then,
> > can_queue doesn't provide the cap you might expect because the blk-mq layer
> > allocates can_queue tags for each HW queue, not as a total.
> 
> 
> The docs for scsi_mid_low_api document Scsi_Host can_queue this way:
> 
>   can_queue
>   - must be greater than 0; do not send more than can_queue
> commands to the adapter.
> 
> I did notice that in scsi_host.h, the comment for can_queue does say
> can_queue is the "maximum number of simultaneous commands a single hw
> queue in HBA will accept." 

Yes, this comment is correct.  The can_queue value is per HW queue.

> However, I don't see it being used this way
> in the code.
> 
> During dispatch, In scsi_target_queue_ready(), there is this code:
> 
> if (busy >= starget->can_queue)
> goto starved;
> 
> And the scsi_target->can_queue value should be coming from Scsi_host as
> mentioned in the scsi_target definition in scsi_device.h
> /*
>   * LLDs should set this in the slave_alloc host template callout.
>   * If set to zero then there is not limit.
>   */
> unsigned intcan_queue;
> 
> So, I don't really see how this would be per hardware queue.

For the storvsc driver, the can_queue value in the scsi_target is initialized
to zero in scsi_alloc_target() and it remains unchanged.  Maybe I'm missing
something, but the only place I see that sets starget->can_queue to a
non-zero value is iscsi_target_alloc().  The storvsc slave_alloc() function does
not set it.  So the test in scsi_target_queue_ready() for exceeding can_queue
never executes.

We've run live tests, and can see that the number of requests sent to the
storvsc driver exceeds the can_queue value when the # of HW queues is
greater than 1.  That result is consistent with what I see in the code.

Michael

> 
> >
> > I agree that the cmd_per_lun setting is also too big, but we should fix 
> > that in
> > the context of getting all of these different settings working together 
> > correctly,
> > and not piecemeal.
> >
> 
> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
> will also prevent the LUN queue_depth from being set to a value that is
> higher than it can ever be set to again by the user when
> storvsc_change_queue_depth() is invoked.
> 
> Also in scsi_sysfs sdev_store_queue_depth() there is this check:
> 
>   if (depth < 1 || depth > sdev->host->can_q

Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread John Garry

On 08/03/2021 17:56, Melanie Plageman wrote:

On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:

From: Melanie Plageman (Microsoft)  Sent: Friday, 
March 5, 2021 3:22 PM


The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
allocation.

Cap cmd_per_lun at can_queue to avoid dispatch errors.

Signed-off-by: Melanie Plageman (Microsoft) 
---
  drivers/scsi/storvsc_drv.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6bc5453cea8a..d7953a6e00e6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
(max_sub_channels + 1) *
(100 - ring_avail_percent_lowater) / 100;

+   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, 
scsi_driver.can_queue);
+


I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?


The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
Scsi_Host->cmd_per_lun in storvsc_probe().

In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
called and sets the scsi_device->queue_depth to the Scsi_Host's
cmd_per_lun with this code:
 
scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

 sdev->host->cmd_per_lun : 1);

During dispatch, the scsi_device->queue_depth is used in
scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
whether or not the device can queue another command.

On some machines, with the 2048 value of cmd_per_lun that was used to
set the initial scsi_device->queue_depth, commands can be queued that
are later not able to be dispatched after running out of space in the
ringbuffer.

On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
(running an fio workload that I can provide if needed), storvsc_do_io()
ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
   
This is the call stack:
 
hv_get_bytes_to_write

hv_ringbuffer_write
vmbus_send_packet
storvsc_dio_io
storvsc_queuecommand
scsi_dispatch_cmd
scsi_queue_rq
dispatch_rq_list


Be aware that the calculation of "can_queue" in this driver is somewhat
flawed -- it should not be based on the size of the ring buffer, but instead on
the maximum number of requests Hyper-V will queue.  And even then,
can_queue doesn't provide the cap you might expect because the blk-mq layer
allocates can_queue tags for each HW queue, not as a total.



The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

   can_queue
   - must be greater than 0; do not send more than can_queue
 commands to the adapter.
 
I did notice that in scsi_host.h, the comment for can_queue does say

can_queue is the "maximum number of simultaneous commands a single hw
queue in HBA will accept." However, I don't see it being used this way
in the code.
   


JFYI, the block layer ensures that no more than can_queue requests are 
sent to the host. See scsi_mq_setup_tags(), and how the tagset queue 
depth is set to shost->can_queue.


Thanks,
John



During dispatch, In scsi_target_queue_ready(), there is this code:

 if (busy >= starget->can_queue)
 goto starved;

And the scsi_target->can_queue value should be coming from Scsi_host as
mentioned in the scsi_target definition in scsi_device.h
 /*
   * LLDs should set this in the slave_alloc host template callout.
   * If set to zero then there is not limit.
   */
 unsigned intcan_queue;

So, I don't really see how this would be per hardware queue.



I agree that the cmd_per_lun setting is also too big, but we should fix that in
the context of getting all of these different settings working together 
correctly,
and not piecemeal.



Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
will also prevent the LUN queue_depth from being set to a value that is
higher than it can ever be set to again by the user when
storvsc_change_queue_depth() is invoked.

Also in scsi_sysfs sdev_store_queue_depth() there is this check:

   if (depth < 1 || depth > sdev->host->can_queue)
 return -EINVAL;

I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun
is set to the min of the configured cmd_per_lun and
Scsi_Host->can_queue:

 shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

Best,
Melanie
.





Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-08 Thread Melanie Plageman
On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:
> From: Melanie Plageman (Microsoft)  Sent: Friday, 
> March 5, 2021 3:22 PM
> > 
> > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> > allocation.
> > 
> > Cap cmd_per_lun at can_queue to avoid dispatch errors.
> > 
> > Signed-off-by: Melanie Plageman (Microsoft) 
> > ---
> >  drivers/scsi/storvsc_drv.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index 6bc5453cea8a..d7953a6e00e6 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
> > (max_sub_channels + 1) *
> > (100 - ring_avail_percent_lowater) / 100;
> > 
> > +   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, 
> > scsi_driver.can_queue);
> > +
> 
> I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?

The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
Scsi_Host->cmd_per_lun in storvsc_probe().

In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
called and sets the scsi_device->queue_depth to the Scsi_Host's
cmd_per_lun with this code:

scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
sdev->host->cmd_per_lun : 1);

During dispatch, the scsi_device->queue_depth is used in
scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
whether or not the device can queue another command.

On some machines, with the 2048 value of cmd_per_lun that was used to
set the initial scsi_device->queue_depth, commands can be queued that
are later not able to be dispatched after running out of space in the
ringbuffer.

On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
(running an fio workload that I can provide if needed), storvsc_do_io()
ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
  
This is the call stack: 

hv_get_bytes_to_write
hv_ringbuffer_write
vmbus_send_packet  
storvsc_dio_io  
storvsc_queuecommand
scsi_dispatch_cmd
scsi_queue_rq   
dispatch_rq_list

> Be aware that the calculation of "can_queue" in this driver is somewhat
> flawed -- it should not be based on the size of the ring buffer, but instead 
> on
> the maximum number of requests Hyper-V will queue.  And even then,
> can_queue doesn't provide the cap you might expect because the blk-mq layer
> allocates can_queue tags for each HW queue, not as a total.


The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

  can_queue
  - must be greater than 0; do not send more than can_queue
commands to the adapter. 

I did notice that in scsi_host.h, the comment for can_queue does say
can_queue is the "maximum number of simultaneous commands a single hw
queue in HBA will accept." However, I don't see it being used this way
in the code.

During dispatch, In scsi_target_queue_ready(), there is this code:

if (busy >= starget->can_queue)
goto starved;

And the scsi_target->can_queue value should be coming from Scsi_host as
mentioned in the scsi_target definition in scsi_device.h
/*
  * LLDs should set this in the slave_alloc host template callout.
  * If set to zero then there is not limit.
  */
unsigned intcan_queue;

So, I don't really see how this would be per hardware queue.

> 
> I agree that the cmd_per_lun setting is also too big, but we should fix that 
> in
> the context of getting all of these different settings working together 
> correctly,
> and not piecemeal.
> 

Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
will also prevent the LUN queue_depth from being set to a value that is
higher than it can ever be set to again by the user when
storvsc_change_queue_depth() is invoked. 

Also in scsi_sysfs sdev_store_queue_depth() there is this check:

  if (depth < 1 || depth > sdev->host->can_queue)
return -EINVAL;

I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun
is set to the min of the configured cmd_per_lun and
Scsi_Host->can_queue:

shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

Best,
Melanie


RE: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-08 Thread Michael Kelley
From: Melanie Plageman (Microsoft)  Sent: Friday, 
March 5, 2021 3:22 PM
> 
> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
> allocation.
> 
> Cap cmd_per_lun at can_queue to avoid dispatch errors.
> 
> Signed-off-by: Melanie Plageman (Microsoft) 
> ---
>  drivers/scsi/storvsc_drv.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 6bc5453cea8a..d7953a6e00e6 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
>   (max_sub_channels + 1) *
>   (100 - ring_avail_percent_lowater) / 100;
> 
> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, 
> scsi_driver.can_queue);
> +

I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?
Be aware that the calculation of "can_queue" in this driver is somewhat
flawed -- it should not be based on the size of the ring buffer, but instead on
the maximum number of requests Hyper-V will queue.  And even then,
can_queue doesn't provide the cap you might expect because the blk-mq layer
allocates can_queue tags for each HW queue, not as a total.

I agree that the cmd_per_lun setting is also too big, but we should fix that in
the context of getting all of these different settings working together 
correctly,
and not piecemeal.

Michael

>   host = scsi_host_alloc(&scsi_driver,
>  sizeof(struct hv_host_device));
>   if (!host)
> --
> 2.20.1



[PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-05 Thread Melanie Plageman (Microsoft)
The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
allocation.

Cap cmd_per_lun at can_queue to avoid dispatch errors.

Signed-off-by: Melanie Plageman (Microsoft) 
---
 drivers/scsi/storvsc_drv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6bc5453cea8a..d7953a6e00e6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
(max_sub_channels + 1) *
(100 - ring_avail_percent_lowater) / 100;
 
+   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, 
scsi_driver.can_queue);
+
host = scsi_host_alloc(&scsi_driver,
   sizeof(struct hv_host_device));
if (!host)
-- 
2.20.1