from:"Ming Lei"

Re: [PATCH V3] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()

2023-09-25 Thread Ming Lei

On Mon, Sep 25, 2023 at 05:17:10PM -0400, Stefan Hajnoczi wrote:
> On Fri, Sep 15, 2023 at 03:04:05PM +0800, Jason Wang wrote:
> > On Fri, Sep 8, 2023 at 11:25 PM Ming Lei  wrote:
> > >
> > > On Fri, Sep 08, 2023 at 08:44:45AM -0600, Jens Axboe wrote:
> > > > On 9/8/23 8:34 AM, Ming Lei wrote:
> > > > > On Fri, Sep 08, 2023 at 07:49:53AM -0600, Jens Axboe wrote:
> > > > >> On 9/8/23 3:30 AM, Ming Lei wrote:
> > > > >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > >>> index ad636954abae..95a3d31a1ef1 100644
> > > > >>> --- a/io_uring/io_uring.c
> > > > >>> +++ b/io_uring/io_uring.c
> > > > >>> @@ -1930,6 +1930,10 @@ void io_wq_submit_work(struct io_wq_work 
> > > > >>> *work)
> > > > >>>   }
> > > > >>>   }
> > > > >>>
> > > > >>> + /* It is fragile to block POLLED IO, so switch to NON_BLOCK */
> > > > >>> + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && def->iopoll_queue)
> > > > >>> + issue_flags |= IO_URING_F_NONBLOCK;
> > > > >>> +
> > > > >>
> > > > >> I think this comment deserves to be more descriptive. Normally we
> > > > >> absolutely cannot block for polled IO, it's only OK here because 
> > > > >> io-wq
> > > > >
> > > > > Yeah, we don't do that until commit 2bc057692599 ("block: don't make 
> > > > > REQ_POLLED
> > > > > imply REQ_NOWAIT") which actually push the responsibility/risk up to
> > > > > io_uring.
> > > > >
> > > > >> is the issuer and not necessarily the poller of it. That generally 
> > > > >> falls
> > > > >> upon the original issuer to poll these requests.
> > > > >>
> > > > >> I think this should be a separate commit, coming before the main fix
> > > > >> which is below.
> > > > >
> > > > > Looks fine, actually IO_URING_F_NONBLOCK change isn't a must, and the
> > > > > approach in V2 doesn't need this change.
> > > > >
> > > > >>
> > > > >>> @@ -3363,6 +3367,12 @@ __cold void io_uring_cancel_generic(bool 
> > > > >>> cancel_all, struct io_sq_data *sqd)
> > > > >>>   finish_wait(>wait, );
> > > > >>>   } while (1);
> > > > >>>
> > > > >>> + /*
> > > > >>> +  * Reap events from each ctx, otherwise these requests may take
> > > > >>> +  * resources and prevent other contexts from being moved on.
> > > > >>> +  */
> > > > >>> + xa_for_each(>xa, index, node)
> > > > >>> + io_iopoll_try_reap_events(node->ctx);
> > > > >>
> > > > >> The main issue here is that if someone isn't polling for them, then 
> > > > >> we
> > > > >
> > > > > That is actually what this patch is addressing, :-)
> > > >
> > > > Right, that part is obvious :)
> > > >
> > > > >> get to wait for a timeout before they complete. This can delay exit, 
> > > > >> for
> > > > >> example, as we're now just waiting 30 seconds (or whatever the 
> > > > >> timeout
> > > > >> is on the underlying device) for them to get timed out before exit 
> > > > >> can
> > > > >> finish.
> > > > >
> > > > > For the issue on null_blk, device timeout handler provides
> > > > > forward-progress, such as requests are released, so new IO can be
> > > > > handled.
> > > > >
> > > > > However, not all devices support timeout, such as virtio device.
> > > >
> > > > That's a bug in the driver, you cannot sanely support polled IO and not
> > > > be able to deal with timeouts. Someone HAS to reap the requests and
> > > > there are only two things that can do that - the application doing the
> > > > polled IO, or if that doesn't happen, a timeout.
> > >
> > > OK, then device driver timeout handler has new responsibility of covering
> > > userspace accident, :-)
> 
> Sorry, I don't have enough context so this is probably a silly question:
> 
> When an application doesn't reap a polled request, why doesn't the block
> layer take care of this in a generic way? I don't see anything
> driver-specific about this.

block layer doesn't have knowledge to handle that, io_uring knows the
application is exiting, and can help to reap the events.

But the big question is that if there is really IO timeout for virtio-blk.
If there is, the reap done in io_uring may never return and cause other
issue, so if it is done in io_uring, that can be just thought as sort of
improvement.

The real bug fix is still in device driver, usually only the driver timeout
handler can provide forward progress guarantee.

> 
> Driver-specific behavior would be sending an abort/cancel upon timeout.
> virtio-blk cannot do that because there is no such command in the device
> specification at the moment. So simply waiting for the polled request to
> complete is the only thing that can be done (aside from resetting the
> device), and it's generic behavior.

Then looks not safe to support IO polling for virtio-blk, maybe disable it
at default now until the virtio-blk spec starts to support IO abort?

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()

2023-09-08 Thread Ming Lei

On Fri, Sep 08, 2023 at 08:44:45AM -0600, Jens Axboe wrote:
> On 9/8/23 8:34 AM, Ming Lei wrote:
> > On Fri, Sep 08, 2023 at 07:49:53AM -0600, Jens Axboe wrote:
> >> On 9/8/23 3:30 AM, Ming Lei wrote:
> >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> >>> index ad636954abae..95a3d31a1ef1 100644
> >>> --- a/io_uring/io_uring.c
> >>> +++ b/io_uring/io_uring.c
> >>> @@ -1930,6 +1930,10 @@ void io_wq_submit_work(struct io_wq_work *work)
> >>>   }
> >>>   }
> >>>  
> >>> + /* It is fragile to block POLLED IO, so switch to NON_BLOCK */
> >>> + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && def->iopoll_queue)
> >>> + issue_flags |= IO_URING_F_NONBLOCK;
> >>> +
> >>
> >> I think this comment deserves to be more descriptive. Normally we
> >> absolutely cannot block for polled IO, it's only OK here because io-wq
> > 
> > Yeah, we don't do that until commit 2bc057692599 ("block: don't make 
> > REQ_POLLED
> > imply REQ_NOWAIT") which actually push the responsibility/risk up to
> > io_uring.
> > 
> >> is the issuer and not necessarily the poller of it. That generally falls
> >> upon the original issuer to poll these requests.
> >>
> >> I think this should be a separate commit, coming before the main fix
> >> which is below.
> > 
> > Looks fine, actually IO_URING_F_NONBLOCK change isn't a must, and the
> > approach in V2 doesn't need this change.
> > 
> >>
> >>> @@ -3363,6 +3367,12 @@ __cold void io_uring_cancel_generic(bool 
> >>> cancel_all, struct io_sq_data *sqd)
> >>>   finish_wait(>wait, );
> >>>   } while (1);
> >>>  
> >>> + /*
> >>> +  * Reap events from each ctx, otherwise these requests may take
> >>> +  * resources and prevent other contexts from being moved on.
> >>> +  */
> >>> + xa_for_each(>xa, index, node)
> >>> + io_iopoll_try_reap_events(node->ctx);
> >>
> >> The main issue here is that if someone isn't polling for them, then we
> > 
> > That is actually what this patch is addressing, :-)
> 
> Right, that part is obvious :)
> 
> >> get to wait for a timeout before they complete. This can delay exit, for
> >> example, as we're now just waiting 30 seconds (or whatever the timeout
> >> is on the underlying device) for them to get timed out before exit can
> >> finish.
> > 
> > For the issue on null_blk, device timeout handler provides
> > forward-progress, such as requests are released, so new IO can be
> > handled.
> > 
> > However, not all devices support timeout, such as virtio device.
> 
> That's a bug in the driver, you cannot sanely support polled IO and not
> be able to deal with timeouts. Someone HAS to reap the requests and
> there are only two things that can do that - the application doing the
> polled IO, or if that doesn't happen, a timeout.

OK, then device driver timeout handler has new responsibility of covering
userspace accident, :-)

We may document this requirement for driver.

So far the only one should be virtio-blk, and the two virtio storage
drivers never implement timeout handler.

> 
> > Here we just call io_iopoll_try_reap_events() to poll submitted IOs
> > for releasing resources, so no need to rely on device timeout handler
> > any more, and the extra exit delay can be avoided.
> > 
> > But io_iopoll_try_reap_events() may not be enough because io_wq
> > associated with current context can get released resource immediately,
> > then new IOs are submitted successfully, but who can poll these new
> > submitted IOs, then all device resources can be held by this (freed)io_wq
> > for nothing.
> > 
> > I guess we may have to take the approach in patch V2 by only canceling
> > polled IO for avoiding the thread_exit regression, or other ideas?
> 
> Ideally the behavior seems like it should be that if a task goes away,
> any pending polled IO it has should be reaped. With the above notion
> that a driver supporting poll absolutely must be able to deal with
> timeouts, it's not a strict requirement as we know that requests will be
> reaped.

Then looks the io_uring fix is less important, and I will see if one
easy fix can be figured out, one way is to reap event when exiting both
current task and the associated io_wq.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH V3 04/14] virtio-blk: limit max allowed submit queues

2023-08-08 Thread Ming Lei

Take blk-mq's knowledge into account for calculating io queues.

Fix wrong queue mapping in case of kdump kernel.

On arm and ppc64, 'maxcpus=1' is passed to kdump command line, see
`Documentation/admin-guide/kdump/kdump.rst`, so num_possible_cpus()
still returns all CPUs because 'maxcpus=1' just bring up one single
cpu core during booting.

blk-mq sees single queue in kdump kernel, and in driver's viewpoint
there are still multiple queues, this inconsistency causes driver to apply
wrong queue mapping for handling IO, and IO timeout is triggered.

Meantime, single queue makes much less resource utilization, and reduce
risk of kernel failure.

Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ming Lei 
---
 drivers/block/virtio_blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1fe011676d07..4ba79fe2a1b4 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -1047,7 +1047,8 @@ static int init_vq(struct virtio_blk *vblk)
 
num_poll_vqs = min_t(unsigned int, poll_queues, num_vqs - 1);
 
-   vblk->io_queues[HCTX_TYPE_DEFAULT] = num_vqs - num_poll_vqs;
+   vblk->io_queues[HCTX_TYPE_DEFAULT] = min_t(unsigned,
+   num_vqs - num_poll_vqs, blk_mq_max_nr_hw_queues());
vblk->io_queues[HCTX_TYPE_READ] = 0;
vblk->io_queues[HCTX_TYPE_POLL] = num_poll_vqs;
 
-- 
2.40.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V3 1/1] blk-mq: avoid double ->queue_rq() because of early timeout

2022-10-30 Thread Ming Lei

On Wed, Oct 26, 2022 at 01:19:57PM +0800, Ming Lei wrote:
> From: David Jeffery 
> 
> David Jeffery found one double ->queue_rq() issue, so far it can
> be triggered in VM use case because of long vmexit latency or preempt
> latency of vCPU pthread or long page fault in vCPU pthread, then block
> IO req could be timed out before queuing the request to hardware but after
> calling blk_mq_start_request() during ->queue_rq(), then timeout handler
> may handle it by requeue, then double ->queue_rq() is caused, and kernel
> panic.
> 
> So far, it is driver's responsibility to cover the race between timeout
> and completion, so it seems supposed to be solved in driver in theory,
> given driver has enough knowledge.
> 
> But it is really one common problem, lots of driver could have similar
> issue, and could be hard to fix all affected drivers, even it isn't easy
> for driver to handle the race. So David suggests this patch by draining
> in-progress ->queue_rq() for solving this issue.
> 
> Cc: Stefan Hajnoczi 
> Cc: Keith Busch 
> Cc: virtualization@lists.linux-foundation.org
> Cc: Bart Van Assche 
> Signed-off-by: David Jeffery 
> Signed-off-by: Ming Lei 
> ---
> V3:
>   - add callback for handle expired only, suggested by Keith Busch

Hi Jens,

Any chance to merge this fix? Either 6.1 or 6.2 is fine for me.


Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH V3 1/1] blk-mq: avoid double ->queue_rq() because of early timeout

2022-10-25 Thread Ming Lei

From: David Jeffery 

David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.

So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.

Cc: Stefan Hajnoczi 
Cc: Keith Busch 
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche 
Signed-off-by: David Jeffery 
Signed-off-by: Ming Lei 
---
V3:
- add callback for handle expired only, suggested by Keith Busch
V2:
- follow Jens's suggestion to run sync rcu only if there is timeout
- rename 'now' as 'start_timeout'
 block/blk-mq.c | 56 +++---
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33292c01875d..030bbb8deca6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1523,7 +1523,13 @@ static void blk_mq_rq_timed_out(struct request *req)
blk_add_timer(req);
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+struct blk_expired_data {
+   bool has_timedout_rq;
+   unsigned long next;
+   unsigned long timeout_start;
+};
+
+static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data 
*expired)
 {
unsigned long deadline;
 
@@ -1533,13 +1539,13 @@ static bool blk_mq_req_expired(struct request *rq, 
unsigned long *next)
return false;
 
deadline = READ_ONCE(rq->deadline);
-   if (time_after_eq(jiffies, deadline))
+   if (time_after_eq(expired->timeout_start, deadline))
return true;
 
-   if (*next == 0)
-   *next = deadline;
-   else if (time_after(*next, deadline))
-   *next = deadline;
+   if (expired->next == 0)
+   expired->next = deadline;
+   else if (time_after(expired->next, deadline))
+   expired->next = deadline;
return false;
 }
 
@@ -1555,7 +1561,7 @@ void blk_mq_put_rq_ref(struct request *rq)
 
 static bool blk_mq_check_expired(struct request *rq, void *priv)
 {
-   unsigned long *next = priv;
+   struct blk_expired_data *expired = priv;
 
/*
 * blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
@@ -1564,7 +1570,18 @@ static bool blk_mq_check_expired(struct request *rq, 
void *priv)
 * it was completed and reallocated as a new request after returning
 * from blk_mq_check_expired().
 */
-   if (blk_mq_req_expired(rq, next))
+   if (blk_mq_req_expired(rq, expired)) {
+   expired->has_timedout_rq = true;
+   return false;
+   }
+   return true;
+}
+
+static bool blk_mq_handle_expired(struct request *rq, void *priv)
+{
+   struct blk_expired_data *expired = priv;
+
+   if (blk_mq_req_expired(rq, expired))
blk_mq_rq_timed_out(rq);
return true;
 }
@@ -1573,7 +1590,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
container_of(work, struct request_queue, timeout_work);
-   unsigned long next = 0;
+   struct blk_expired_data expired = {
+   .timeout_start = jiffies,
+   };
struct blk_mq_hw_ctx *hctx;
unsigned long i;
 
@@ -1593,10 +1612,23 @@ static void blk_mq_timeout_work(struct work_struct 
*work)
if (!percpu_ref_tryget(>q_usage_counter))
return;
 
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   /* check if there is any timed-out request */
+   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   if (expired.has_timedout_rq) {
+   /*
+* Before walking tags, we must ensure any submit started
+* before the current time has finished. Since the submit
+* uses srcu or rcu, wait for a synchronization point to
+* ensure all running submits have finished
+*/
+   blk_mq_wait_quiesce_done(q);
+
+   expired.next = 0;
+   blk_mq_queue_tag_busy_iter(q, blk_mq_handle_expired, );
+   }
 
-   if (next != 0) {
-   mod_timer(>timeout, next);
+   if (expired.next != 0) {
+   mod_timer(

Re: [PATCH V2 1/1] blk-mq: avoid double ->queue_rq() because of early timeout

2022-10-25 Thread Ming Lei

On Tue, Oct 25, 2022 at 09:11:37PM -0600, Keith Busch wrote:
> On Wed, Oct 26, 2022 at 09:55:21AM +0800, Ming Lei wrote:
> > @@ -1564,8 +1571,13 @@ static bool blk_mq_check_expired(struct request *rq, 
> > void *priv)
> >  * it was completed and reallocated as a new request after returning
> >  * from blk_mq_check_expired().
> >  */
> > -   if (blk_mq_req_expired(rq, next))
> > +   if (blk_mq_req_expired(rq, expired)) {
> > +   if (expired->check_only) {
> > +   expired->has_timedout_rq = true;
> > +   return false;
> > +   }
> > blk_mq_rq_timed_out(rq);
> > +   }
> > return true;
> >  }
> >  
> > @@ -1573,7 +1585,10 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> >  {
> > struct request_queue *q =
> > container_of(work, struct request_queue, timeout_work);
> > -   unsigned long next = 0;
> > +   struct blk_expired_data expired = {
> > +   .check_only = true,
> > +   .timeout_start = jiffies,
> > +   };
> > struct blk_mq_hw_ctx *hctx;
> > unsigned long i;
> >  
> > @@ -1593,10 +1608,24 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > if (!percpu_ref_tryget(>q_usage_counter))
> > return;
> >  
> > -   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> > +   /* check if there is any timed-out request */
> > +   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> > +   if (expired.has_timedout_rq) {
> > +   /*
> > +* Before walking tags, we must ensure any submit started
> > +* before the current time has finished. Since the submit
> > +* uses srcu or rcu, wait for a synchronization point to
> > +* ensure all running submits have finished
> > +*/
> > +   blk_mq_wait_quiesce_done(q);
> > +
> > +   expired.check_only = false;
> > +   expired.next = 0;
> > +   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> 
> I think it would be easier to follow with separate callbacks instead of
> special casing for 'check_only'. One callback for checking timeouts, and
> a different one for handling them?

Both two are basically same, with two callbacks, just .check_only is saved,
nothing else, meantime with one extra similar callback added.

If you or anyone think it is one big deal, I can switch to two callback version.


Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH V2 1/1] blk-mq: avoid double ->queue_rq() because of early timeout

2022-10-25 Thread Ming Lei

From: David Jeffery 

David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.

So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.

Cc: Stefan Hajnoczi 
Cc: Keith Busch 
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche 
Signed-off-by: David Jeffery 
Signed-off-by: Ming Lei 
---
V2:
- follow Jens's suggestion to run sync rcu only if there is timeout
- rename 'now' as 'start_timeout'

 block/blk-mq.c | 53 ++
 1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33292c01875d..431e71af0e06 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1523,7 +1523,14 @@ static void blk_mq_rq_timed_out(struct request *req)
blk_add_timer(req);
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+struct blk_expired_data {
+   bool check_only;
+   bool has_timedout_rq;
+   unsigned long next;
+   unsigned long timeout_start;
+};
+
+static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data 
*expired)
 {
unsigned long deadline;
 
@@ -1533,13 +1540,13 @@ static bool blk_mq_req_expired(struct request *rq, 
unsigned long *next)
return false;
 
deadline = READ_ONCE(rq->deadline);
-   if (time_after_eq(jiffies, deadline))
+   if (time_after_eq(expired->timeout_start, deadline))
return true;
 
-   if (*next == 0)
-   *next = deadline;
-   else if (time_after(*next, deadline))
-   *next = deadline;
+   if (expired->next == 0)
+   expired->next = deadline;
+   else if (time_after(expired->next, deadline))
+   expired->next = deadline;
return false;
 }
 
@@ -1555,7 +1562,7 @@ void blk_mq_put_rq_ref(struct request *rq)
 
 static bool blk_mq_check_expired(struct request *rq, void *priv)
 {
-   unsigned long *next = priv;
+   struct blk_expired_data *expired = priv;
 
/*
 * blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
@@ -1564,8 +1571,13 @@ static bool blk_mq_check_expired(struct request *rq, 
void *priv)
 * it was completed and reallocated as a new request after returning
 * from blk_mq_check_expired().
 */
-   if (blk_mq_req_expired(rq, next))
+   if (blk_mq_req_expired(rq, expired)) {
+   if (expired->check_only) {
+   expired->has_timedout_rq = true;
+   return false;
+   }
blk_mq_rq_timed_out(rq);
+   }
return true;
 }
 
@@ -1573,7 +1585,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
container_of(work, struct request_queue, timeout_work);
-   unsigned long next = 0;
+   struct blk_expired_data expired = {
+   .check_only = true,
+   .timeout_start = jiffies,
+   };
struct blk_mq_hw_ctx *hctx;
unsigned long i;
 
@@ -1593,10 +1608,24 @@ static void blk_mq_timeout_work(struct work_struct 
*work)
if (!percpu_ref_tryget(>q_usage_counter))
return;
 
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   /* check if there is any timed-out request */
+   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   if (expired.has_timedout_rq) {
+   /*
+* Before walking tags, we must ensure any submit started
+* before the current time has finished. Since the submit
+* uses srcu or rcu, wait for a synchronization point to
+* ensure all running submits have finished
+*/
+   blk_mq_wait_quiesce_done(q);
+
+   expired.check_only = false;
+   expired.next = 0;
+   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   }
 
-   if (next != 0) {
-   mod_timer(>timeout, next);
+   if (expired.next != 0) {
+   mod_timer(>timeout, expired.next);
} else {
/*
 * Request timeouts are hand

Re: [PATCH] blk-mq: avoid double ->queue_rq() because of early timeout

2022-10-25 Thread Ming Lei

On Tue, Oct 25, 2022 at 12:11:39PM -0600, Jens Axboe wrote:
> On 10/24/22 6:55 PM, Ming Lei wrote:
> > @@ -1593,10 +1598,18 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > if (!percpu_ref_tryget(>q_usage_counter))
> > return;
> >  
> > -   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> > +   /*
> > +* Before walking tags, we must ensure any submit started before
> > +* the current time has finished. Since the submit uses srcu or rcu,
> > +* wait for a synchronization point to ensure all running submits
> > +* have finished
> > +*/
> > +   blk_mq_wait_quiesce_done(q);
> 
> I'm a little worried about this bit - so we'll basically do a sync RCU
> every time the timeout timer runs... Depending on machine load, that
> can take a long time.

Yeah, the per-queue timeout timer is never canceled after request is
completed, so most of times the timeout work does nothing.

Can we run the sync RCU only if there is timed out request found? Then
the wait is only needed in case that timeout handling is required. Also
sync rcu is already done in some driver's ->timeout().


Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] blk-mq: avoid double ->queue_rq() because of early timeout

2022-10-24 Thread Ming Lei

From: David Jeffery 

David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.

So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.

Cc: Bart Van Assche 
Cc: David Jeffery 
Cc: Stefan Hajnoczi 
Cc: Keith Busch 
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 33292c01875d..e0aafbc7390f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1523,7 +1523,12 @@ static void blk_mq_rq_timed_out(struct request *req)
blk_add_timer(req);
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+struct blk_expired_data {
+   unsigned long next;
+   unsigned long now;
+};
+
+static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data 
*expired)
 {
unsigned long deadline;
 
@@ -1533,13 +1538,13 @@ static bool blk_mq_req_expired(struct request *rq, 
unsigned long *next)
return false;
 
deadline = READ_ONCE(rq->deadline);
-   if (time_after_eq(jiffies, deadline))
+   if (time_after_eq(expired->now, deadline))
return true;
 
-   if (*next == 0)
-   *next = deadline;
-   else if (time_after(*next, deadline))
-   *next = deadline;
+   if (expired->next == 0)
+   expired->next = deadline;
+   else if (time_after(expired->next, deadline))
+   expired->next = deadline;
return false;
 }
 
@@ -1555,7 +1560,7 @@ void blk_mq_put_rq_ref(struct request *rq)
 
 static bool blk_mq_check_expired(struct request *rq, void *priv)
 {
-   unsigned long *next = priv;
+   struct blk_expired_data *expired = priv;
 
/*
 * blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
@@ -1564,7 +1569,7 @@ static bool blk_mq_check_expired(struct request *rq, void 
*priv)
 * it was completed and reallocated as a new request after returning
 * from blk_mq_check_expired().
 */
-   if (blk_mq_req_expired(rq, next))
+   if (blk_mq_req_expired(rq, expired))
blk_mq_rq_timed_out(rq);
return true;
 }
@@ -1573,7 +1578,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
container_of(work, struct request_queue, timeout_work);
-   unsigned long next = 0;
+   struct blk_expired_data expired = {.next = 0, .now = jiffies};
struct blk_mq_hw_ctx *hctx;
unsigned long i;
 
@@ -1593,10 +1598,18 @@ static void blk_mq_timeout_work(struct work_struct 
*work)
if (!percpu_ref_tryget(>q_usage_counter))
return;
 
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   /*
+* Before walking tags, we must ensure any submit started before
+* the current time has finished. Since the submit uses srcu or rcu,
+* wait for a synchronization point to ensure all running submits
+* have finished
+*/
+   blk_mq_wait_quiesce_done(q);
+
+   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
 
-   if (next != 0) {
-   mod_timer(>timeout, next);
+   if (expired.next != 0) {
+   mod_timer(>timeout, expired.next);
} else {
/*
 * Request timeouts are handled as a forward rolling timer. If
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Bug] double ->queue_rq() because of timeout in ->queue_rq()

2022-10-24 Thread Ming Lei

On Mon, Oct 24, 2022 at 11:30:39AM -0400, Stefan Hajnoczi wrote:
> On Fri, Oct 21, 2022 at 10:23:57AM +0800, Ming Lei wrote:
> > On Thu, Oct 20, 2022 at 04:01:11PM -0400, Stefan Hajnoczi wrote:
> > > On Thu, Oct 20, 2022 at 05:10:13PM +0800, Ming Lei wrote:
> > > > Hi,
> > > > 
> > > > David Jeffery found one double ->queue_rq() issue, so far it can
> > > > be triggered in the following two cases:
> > > > 
> > > > 1) scsi driver in guest kernel
> > > > 
> > > > - the story could be long vmexit latency or long preempt latency of
> > > > vCPU pthread, then IO req is timed out before queuing the request
> > > > to hardware but after calling blk_mq_start_request() during 
> > > > ->queue_rq(),
> > > > then timeout handler handles it by requeue, then double ->queue_rq() is
> > > > caused, and kernel panic
> > > > 
> > > > 2) burst of kernel messages from irq handler 
> > > > 
> > > > For 1), I think it is one reasonable case, given latency from host side
> > > > can come anytime in theory because vCPU is emulated by one normal host
> > > > pthread which can be preempted anywhere. For 2), I guess kernel message 
> > > > is
> > > > supposed to be rate limited.
> > > > 
> > > > Firstly, is this kind of so long(30sec) random latency when running 
> > > > kernel
> > > > code something normal? Or do we need to take care of it? IMO, it looks
> > > > reasonable in case of VM, but our VM experts may have better idea about 
> > > > this
> > > > situation. Also the default 30sec timeout could be reduced via sysfs or
> > > > drivers.
> > > 
> > > 30 seconds is a long latency that does not occur during normal
> > > operation, but unfortunately does happen on occasion.
> > 
> > Thanks for the confirmation!
> > 
> > > 
> > > I think there's an interest in understanding the root cause and solving
> > > long latencies (if possible) in the QEMU/KVM communities. We can
> > > investigate specific cases on k...@vger.kernel.org and/or
> > > qemu-de...@nongnu.org.
> > 
> > The issue was original reported on VMware VM, but maybe David can figure
> > out how to trigger it on QEMU/KVM.
> 
> A very basic question:
> 
> The virtio_blk driver has no q->mq_ops->timeout() callback. Why does the
> block layer still enable the timeout mechanism when the driver doesn't
> implement ->timeout()?

No matter if ->timeout() is implemented or not, request still may
be timed out, and it is better for block layer to find such issue
and simply reset timer in case of no ->timeout().

> 
> I saw there was some "idle" hctx logic and I guess the requests are

timeout timer is reused for idle hctx detection.

> resubmitted (although it wasn't obvious to me how that happens in the
> code)? Maybe that's why the timer is still used if the driver doesn't
> care about timeouts...

Timeout handling is totally decided by driver's ->timeout() callback.
If driver doesn't implement ->timeout(), the request's timer is
reset.



Thanks
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Bug] double ->queue_rq() because of timeout in ->queue_rq()

2022-10-21 Thread Ming Lei

On Fri, Oct 21, 2022 at 02:33:21PM -0400, David Jeffery wrote:
> On Fri, Oct 21, 2022 at 11:22 AM Ming Lei  wrote:
> >
> > On Fri, Oct 21, 2022 at 08:32:31AM -0600, Keith Busch wrote:
> > >
> > > I agree with your idea that this is a lower level driver responsibility:
> > > it should reclaim all started requests before allowing new queuing.
> > > Perhaps the block layer should also raise a clear warning if it's
> > > queueing a request that's already started.
> >
> > The thing is that it is one generic issue, lots of VM drivers could be
> > affected, and it may not be easy for drivers to handle the race too.
> >
> 
> While virtual systems are a common source of the problem, fully
> preempt kernels (with or without real-time patches) can also trigger
> this condition rather simply with a poorly behaved real-time task. The
> involuntary preemption means the queue_rq call can be stopped to let
> another task run. Poorly behaving tasks claiming the CPU for longer
> than the request timeout when preempting a task in a queue_rq function
> could cause the condition on real or virtual hardware. So it's not
> just VM related drivers that are affected by the race.

In theory, yes. But ->queue_rq() is in rcu read critical area, and
usually CONFIG_RCU_BOOST is enabled for covering this problem otherwise
OOM can be triggered easily too.

I guess it is hard to trigger it in real hardware with preempt kernel.


Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Bug] double ->queue_rq() because of timeout in ->queue_rq()

2022-10-21 Thread Ming Lei

On Fri, Oct 21, 2022 at 08:32:31AM -0600, Keith Busch wrote:
> On Thu, Oct 20, 2022 at 05:10:13PM +0800, Ming Lei wrote:
> > @@ -1593,10 +1598,17 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > if (!percpu_ref_tryget(>q_usage_counter))
> > return;
> >  
> > -   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> > +   /* Before walking tags, we must ensure any submit started before the
> > +* current time has finished. Since the submit uses srcu or rcu, wait
> > +* for a synchronization point to ensure all running submits have
> > +* finished
> > +*/
> > +   blk_mq_wait_quiesce_done(q);
> > +
> > +   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> 
> The blk_mq_wait_quiesce_done() will only wait for tasks that entered
> just before calling that function. It will not wait for tasks that
> entered immediately after.

Yeah, but the patch records the jiffies before calling
blk_mq_wait_quiesce_done, and only time out requests which are timed out
before the recorded time, so it is fine to use blk_mq_wait_quiesce_done
in this way.

> 
> If I correctly understand the problem you're describing, the hypervisor
> may prevent any guest process from running. If so, the timeout work may
> be stalled after the quiesce, and if a queue_rq() process also stalled
> after starting quiesce_done(), then we're in the same situation you're
> trying to prevent, right?

No, the stall just happens on one vCPU, and other vCPUs may run smoothly.

1) vmexit, which only stalls one vCPU, some vmexit could come anytime,
such as external interrupt

2) vCPU is emulated by pthread usually, and the pthread is just one
normal host userspace pthread, which can be preempted anytime, and
the preempt latency could be long enough when the system load is
heavy.

And it is like random stall added when running any instruction of
VM kernel code.

> 
> I agree with your idea that this is a lower level driver responsibility:
> it should reclaim all started requests before allowing new queuing.
> Perhaps the block layer should also raise a clear warning if it's
> queueing a request that's already started.

The thing is that it is one generic issue, lots of VM drivers could be
affected, and it may not be easy for drivers to handle the race too.



Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Bug] double ->queue_rq() because of timeout in ->queue_rq()

2022-10-20 Thread Ming Lei

On Thu, Oct 20, 2022 at 04:01:11PM -0400, Stefan Hajnoczi wrote:
> On Thu, Oct 20, 2022 at 05:10:13PM +0800, Ming Lei wrote:
> > Hi,
> > 
> > David Jeffery found one double ->queue_rq() issue, so far it can
> > be triggered in the following two cases:
> > 
> > 1) scsi driver in guest kernel
> > 
> > - the story could be long vmexit latency or long preempt latency of
> > vCPU pthread, then IO req is timed out before queuing the request
> > to hardware but after calling blk_mq_start_request() during ->queue_rq(),
> > then timeout handler handles it by requeue, then double ->queue_rq() is
> > caused, and kernel panic
> > 
> > 2) burst of kernel messages from irq handler 
> > 
> > For 1), I think it is one reasonable case, given latency from host side
> > can come anytime in theory because vCPU is emulated by one normal host
> > pthread which can be preempted anywhere. For 2), I guess kernel message is
> > supposed to be rate limited.
> > 
> > Firstly, is this kind of so long(30sec) random latency when running kernel
> > code something normal? Or do we need to take care of it? IMO, it looks
> > reasonable in case of VM, but our VM experts may have better idea about this
> > situation. Also the default 30sec timeout could be reduced via sysfs or
> > drivers.
> 
> 30 seconds is a long latency that does not occur during normal
> operation, but unfortunately does happen on occasion.

Thanks for the confirmation!

> 
> I think there's an interest in understanding the root cause and solving
> long latencies (if possible) in the QEMU/KVM communities. We can
> investigate specific cases on k...@vger.kernel.org and/or
> qemu-de...@nongnu.org.

The issue was original reported on VMware VM, but maybe David can figure
out how to trigger it on QEMU/KVM.



Thanks, 
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Bug] double ->queue_rq() because of timeout in ->queue_rq()

2022-10-20 Thread Ming Lei

On Thu, Oct 20, 2022 at 01:26:48PM -0700, Bart Van Assche wrote:
> On 10/20/22 02:10, Ming Lei wrote:
> > [ ... ]
> 
> Hi Ming,
> 
> Fixing this in the block layer seems fine to me. A few comments:
> 
> > +   /* Before walking tags, we must ensure any submit started before the
> > +* current time has finished. Since the submit uses srcu or rcu, wait
> > +* for a synchronization point to ensure all running submits have
> > +* finished
> > +*/
> 
> Should the above comment follow the style of other comments in the block
> layer?

OK.

> 
> > +   blk_mq_wait_quiesce_done(q);
> > +
> > +   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> 
> The above doesn't look sufficient to me since .queue_rq() may be called
> while blk_mq_queue_tag_busy_iter() is in progress. How about moving the
> blk_mq_wait_quiesce_done() call into blk_mq_check_expired() and preventing
> new .queue_rq() calls before the timeout handler is called?

blk_mq_timeout_work() records the time before calling
blk_mq_wait_quiesce_done(), and only handle requests which is timed out
before the recorded jiffies, so new queued request won't be covered
in this time.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[Bug] double ->queue_rq() because of timeout in ->queue_rq()

2022-10-20 Thread Ming Lei

Hi,

David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in the following two cases:

1) scsi driver in guest kernel

- the story could be long vmexit latency or long preempt latency of
vCPU pthread, then IO req is timed out before queuing the request
to hardware but after calling blk_mq_start_request() during ->queue_rq(),
then timeout handler handles it by requeue, then double ->queue_rq() is
caused, and kernel panic

2) burst of kernel messages from irq handler 

For 1), I think it is one reasonable case, given latency from host side
can come anytime in theory because vCPU is emulated by one normal host
pthread which can be preempted anywhere. For 2), I guess kernel message is
supposed to be rate limited.

Firstly, is this kind of so long(30sec) random latency when running kernel
code something normal? Or do we need to take care of it? IMO, it looks
reasonable in case of VM, but our VM experts may have better idea about this
situation. Also the default 30sec timeout could be reduced via sysfs or
drivers.

Suppose it is one reasonable report to fix, what is the preferred solution?

So far, it is driver's responsibility to cover the race between timeout
and completion, so it is supposed to be solved in driver in theory, given
driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, so David suggests
the following patch by draining in-progress ->queue_rq() for this issue.
And the patch looks reasonable too.

Any comments for this issue and the solution?


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8070b6c10e8d..ca57c060bb65 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1523,7 +1523,12 @@ static void blk_mq_rq_timed_out(struct request *req)
blk_add_timer(req);
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+struct blk_expired_data {
+   unsigned long next;
+   unsigned long now;
+};
+
+static bool blk_mq_req_expired(struct request *rq, struct blk_expired_data 
*expired)
 {
unsigned long deadline;
 
@@ -1533,13 +1538,13 @@ static bool blk_mq_req_expired(struct request *rq, 
unsigned long *next)
return false;
 
deadline = READ_ONCE(rq->deadline);
-   if (time_after_eq(jiffies, deadline))
+   if (time_after_eq(expired->now, deadline))
return true;
 
-   if (*next == 0)
-   *next = deadline;
-   else if (time_after(*next, deadline))
-   *next = deadline;
+   if (expired->next == 0)
+   expired->next = deadline;
+   else if (time_after(expired->next, deadline))
+   expired->next = deadline;
return false;
 }
 
@@ -1555,7 +1560,7 @@ void blk_mq_put_rq_ref(struct request *rq)
 
 static bool blk_mq_check_expired(struct request *rq, void *priv)
 {
-   unsigned long *next = priv;
+   struct blk_expired_data *expired = priv;
 
/*
 * blk_mq_queue_tag_busy_iter() has locked the request, so it cannot
@@ -1564,7 +1569,7 @@ static bool blk_mq_check_expired(struct request *rq, void 
*priv)
 * it was completed and reallocated as a new request after returning
 * from blk_mq_check_expired().
 */
-   if (blk_mq_req_expired(rq, next))
+   if (blk_mq_req_expired(rq, expired))
blk_mq_rq_timed_out(rq);
return true;
 }
@@ -1573,7 +1578,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
container_of(work, struct request_queue, timeout_work);
-   unsigned long next = 0;
+   struct blk_expired_data expired = {.next = 0, .now = jiffies};
struct blk_mq_hw_ctx *hctx;
unsigned long i;
 
@@ -1593,10 +1598,17 @@ static void blk_mq_timeout_work(struct work_struct 
*work)
if (!percpu_ref_tryget(>q_usage_counter))
return;
 
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   /* Before walking tags, we must ensure any submit started before the
+* current time has finished. Since the submit uses srcu or rcu, wait
+* for a synchronization point to ensure all running submits have
+* finished
+*/
+   blk_mq_wait_quiesce_done(q);
+
+   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
 
-   if (next != 0) {
-   mod_timer(>timeout, next);
+   if (expired.next != 0) {
+   mod_timer(>timeout, expired.next);
} else {
/*
 * Request timeouts are handled as a forward rolling timer. If



Thanks, 
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1] virtio_blk: should not use IRQD_AFFINITY_MANAGED in init_rq

2022-09-28 Thread Ming Lei

On Sat, Sep 24, 2022 at 11:48:54AM +0800, Angus Chen wrote:
> The background is that we use dpu in cloud computing,the arch is x86,80
> cores.We will have a lots of virtio devices,like 512 or more.
> When we probe about 200 virtio_blk devices,it will fail and
> the stack is print as follows:
> 
> [25338.485128] virtio-pci :b3:00.0: virtio_pci: leaving for legacy driver
> [25338.496174] genirq: Flags mismatch irq 0. 0080 (virtio418) vs. 
> 00015a00 (timer)
> [25338.503822] CPU: 20 PID: 5431 Comm: kworker/20:0 Kdump: loaded Tainted: G  
>  OE- -  - 4.18.0-305.30.1.el8.x86_64
> [25338.516403] Hardware name: Inspur NF5280M5/YZMB-00882-10E, BIOS 4.1.21 
> 08/25/2021
> [25338.523881] Workqueue: events work_for_cpu_fn
> [25338.528235] Call Trace:
> [25338.530687]  dump_stack+0x5c/0x80
> [25338.534000]  __setup_irq.cold.53+0x7c/0xd3
> [25338.538098]  request_threaded_irq+0xf5/0x160
> [25338.542371]  vp_find_vqs+0xc7/0x190
> [25338.545866]  init_vq+0x17c/0x2e0 [virtio_blk]
> [25338.550223]  ? ncpus_cmp_func+0x10/0x10
> [25338.554061]  virtblk_probe+0xe6/0x8a0 [virtio_blk]
> [25338.558846]  virtio_dev_probe+0x158/0x1f0
> [25338.562861]  really_probe+0x255/0x4a0
> [25338.566524]  ? __driver_attach_async_helper+0x90/0x90
> [25338.571567]  driver_probe_device+0x49/0xc0
> [25338.575660]  bus_for_each_drv+0x79/0xc0
> [25338.579499]  __device_attach+0xdc/0x160
> [25338.583337]  bus_probe_device+0x9d/0xb0
> [25338.587167]  device_add+0x418/0x780
> [25338.590654]  register_virtio_device+0x9e/0xe0
> [25338.595011]  virtio_pci_probe+0xb3/0x140
> [25338.598941]  local_pci_probe+0x41/0x90
> [25338.602689]  work_for_cpu_fn+0x16/0x20
> [25338.606443]  process_one_work+0x1a7/0x360
> [25338.610456]  ? create_worker+0x1a0/0x1a0
> [25338.614381]  worker_thread+0x1cf/0x390
> [25338.618132]  ? create_worker+0x1a0/0x1a0
> [25338.622051]  kthread+0x116/0x130
> [25338.625283]  ? kthread_flush_work_fn+0x10/0x10
> [25338.629731]  ret_from_fork+0x1f/0x40
> [25338.633395] virtio_blk: probe of virtio418 failed with error -16
> 
> After I did some work of this stack,took stap and crash to get more
> information,I found that the auto irq_affinity affect this.
> When "vp_find_vqs" call "vp_find_vqs_msix" failed,it will be go back
> to call vp_find_vqs_msix again with ctx be false, and when it failed again,
> we will call vp_find_vqs_intx,if the vp_dev->pci_dev->irq is zero,
> we will get a backtrace like above.
> 
> The log :
> "genirq: Flags mismatch irq 0. 0080 (virtio418) vs. 00015a00 (timer)"
> was print because of the irq 0 is used by timer exclusive,and when
> vp_find_vqs called vp_find_vqs_msix and return false twice,then it will
> call vp_find_vqs_intx for the last try.
> Because vp_dev->pci_dev->irq is zero,so it will be request irq 0 with
> flag IRQF_SHARED.
> 
> without config CONFIG_GENERIC_IRQ_DEBUGFS,
> I found that we called vp_find_vqs_msix failed twice because of
> the irq resource was exhausted.
> 
> crash> irq_domain.name,parent 0x9bff87d4dec0
>   name = 0x9bff87c1fd60 "INTEL-IR-MSI-1-2"
>   parent = 0x9bff8740
> crash> irq_domain.name,parent 0x9bff8740
>   name = 0x9bff87c24300 "INTEL-IR-1"
>   parent = 0x9bff87c6c900
> crash> irq_domain.name,parent 0x9bff87c6c900
>   name = 0x9bff87c3ecd0 "VECTOR"
>   parent = 0x0--the highest level
> 
> and stap irq_matrix_alloc_managed get return value -ENOSPC.
> 
> When no virtio_blk device probe,the vctor_matrix is:
> crash>  p *vector_matrix
> $1 = {
>   matrix_bits = 256,
>   alloc_start = 32,
>   alloc_end = 236,
>   alloc_size = 204,
>   global_available = 15593,
>   global_reserved = 149,
>   systembits_inalloc = 3,
>   total_allocated = 409,
>   online_maps = 80,
>   maps = 0x2ff20,
>   scratch_map = {1161063342014463, 0, 1, 18446726481523507200,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>   system_map = {1125904739729407, 0, 1, 18446726481523507200,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
> }
> 
> When the dump stack occur,the vector_matrix of system is exhausted.
> crash> p *vector_matrix
> $82 = {
>   matrix_bits = 256,
>   alloc_start = 32,
>   alloc_end = 236,
>   alloc_size = 204,
>   global_available = 0,//caq:irq left
>   global_reserved = 151,
>   systembits_inalloc = 3,
>   total_allocated = 1922,//caq:irq that allocated
>   online_maps = 80,
>   maps = 0x2ff20,
>   scratch_map = {18446744069952503807, 18446744073709551615,
>  18446744073709551615, 18446735277616529407, 0, 0, 0, 0, 0,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0},
>   system_map = {1125904739729407, 0, 1, 18446726481523507200,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
> }
> 
> And we tested the virtio_blk device which request irq success,
> we found that in a system with 80 cores and two numa ,one
> virtio_blk device with just two data queues

Re: [PATCH 2/5] memstick/ms_block: simplify refcounting

2022-02-10 Thread Ming Lei

On Wed, Feb 09, 2022 at 09:21:17AM +0100, Christoph Hellwig wrote:
> Implement the ->free_disk method to free the msb_data structure only once
> the last gendisk reference goes away instead of keeping a local refcount.
> 

The approach looks good, just the error handling needs to be careful,
such as, once driver data is bound to disk->private_data, the previous
error handling code shouldn't touch/free the driver data any more. That
said assigning disk->private_data implies driver data ownership transfer
after this conversion.

Such as, in msb_init_disk(), once blk_cleanup_disk() is done, the code
branch of out_release_id shouldn't be run; msb_probe() has the similar
issue too.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 5/6] virtio: add one field into virtio_device for recording if device uses managed irq

2021-07-08 Thread Ming Lei

On Wed, Jul 07, 2021 at 04:05:29PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 07, 2021 at 05:42:54PM +0800, Ming Lei wrote:
> > The problem is that how blk-mq looks at that flag, since the device
> > representing the controller which allocates irq vectors isn't visible
> > to blk-mq.
> 
> In blk_mq_pci_map_queues and similar helpers.

Firstly it depends if drivers call into these helpers, so this way
is fragile.

Secondly, I think it isn't good to expose specific physical devices
into blk-mq which shouldn't deal with physical device directly, also
all the three helpers just duplicates same logic except for retrieving
each vector's affinity from specific physical device.

I will think about how to cleanup them.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 5/6] virtio: add one field into virtio_device for recording if device uses managed irq

2021-07-07 Thread Ming Lei

On Wed, Jul 07, 2021 at 11:06:25AM +0200, Thomas Gleixner wrote:
> On Tue, Jul 06 2021 at 07:42, Christoph Hellwig wrote:
> > On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:
> >> blk-mq needs to know if the device uses managed irq, so add one field
> >> to virtio_device for recording if device uses managed irq.
> >> 
> >> If the driver use managed irq, this flag has to be set so it can be
> >> passed to blk-mq.
> >
> > I don't think all this boilerplate code make a whole lot of sense.
> > I think we need to record this information deep down in the irq code by
> > setting a flag in struct device only if pci_alloc_irq_vectors_affinity
> > atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY
> > flag was set.  Then blk-mq can look at that flag, and also check that
> > more than one queue is in used and work based on that.
> 
> Ack.

The problem is that how blk-mq looks at that flag, since the device
representing the controller which allocates irq vectors isn't visible
to blk-mq.

Usually blk-mq(block layer) provides queue limits abstract for drivers
to tell any physical property(dma segment, max sectors, ...) to block
layer, that is why this patch takes this very similar usage to check if
HBA uses managed irq or not.


Thanks, 
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 5/6] virtio: add one field into virtio_device for recording if device uses managed irq

2021-07-06 Thread Ming Lei

On Tue, Jul 06, 2021 at 07:42:03AM +0200, Christoph Hellwig wrote:
> On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:
> > blk-mq needs to know if the device uses managed irq, so add one field
> > to virtio_device for recording if device uses managed irq.
> > 
> > If the driver use managed irq, this flag has to be set so it can be
> > passed to blk-mq.
> 
> I don't think all this boilerplate code make a whole lot of sense.
> I think we need to record this information deep down in the irq code by
> setting a flag in struct device only if pci_alloc_irq_vectors_affinity
> atually managed to allocate multiple vectors and the PCI_IRQ_AFFINITY
> flag was set.  Then blk-mq can look at that flag, and also check that
> more than one queue is in used and work based on that.

How can blk-mq look at that flag? Usually blk-mq doesn't play with
physical device(HBA).

So far almost all physically properties(segment, max_hw_sectors, ...)
are not provided to blk-mq directly, instead by queue limits abstract.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V2 5/6] virtio: add one field into virtio_device for recording if device uses managed irq

2021-07-04 Thread Ming Lei

On Fri, Jul 02, 2021 at 11:55:40AM -0400, Michael S. Tsirkin wrote:
> On Fri, Jul 02, 2021 at 11:05:54PM +0800, Ming Lei wrote:
> > blk-mq needs to know if the device uses managed irq, so add one field
> > to virtio_device for recording if device uses managed irq.
> > 
> > If the driver use managed irq, this flag has to be set so it can be
> > passed to blk-mq.
> > 
> > Cc: "Michael S. Tsirkin" 
> > Cc: Jason Wang 
> > Cc: virtualization@lists.linux-foundation.org
> > Signed-off-by: Ming Lei 
> 
> 
> The API seems somewhat confusing. virtio does not request
> a managed irq as such, does it? I think it's a decision taken
> by the irq core.

vp_request_msix_vectors():

if (desc) {
flags |= PCI_IRQ_AFFINITY;
desc->pre_vectors++; /* virtio config vector */
vdev->use_managed_irq = true;
}

err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
 nvectors, flags, desc);

Managed irq is used once PCI_IRQ_AFFINITY is passed to
pci_alloc_irq_vectors_affinity().

> 
> Any way to query the irq to find out if it's managed?

So far the managed info isn't exported via /proc/irq, but if one irq is
managed, its smp_affinity & smp_affinity_list attributes can't be
changed successfully.

If irq's debugfs is enabled, this info can be retrieved.


Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH V2 5/6] virtio: add one field into virtio_device for recording if device uses managed irq

2021-07-02 Thread Ming Lei

blk-mq needs to know if the device uses managed irq, so add one field
to virtio_device for recording if device uses managed irq.

If the driver use managed irq, this flag has to be set so it can be
passed to blk-mq.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ming Lei 
---
 drivers/block/virtio_blk.c | 2 ++
 drivers/scsi/virtio_scsi.c | 1 +
 drivers/virtio/virtio_pci_common.c | 1 +
 include/linux/virtio.h | 1 +
 4 files changed, 5 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e4bd3b1fc3c2..33b9c80ac475 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -764,6 +764,8 @@ static int virtblk_probe(struct virtio_device *vdev)
vblk->tag_set.queue_depth = queue_depth;
vblk->tag_set.numa_node = NUMA_NO_NODE;
vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+   if (vdev->use_managed_irq)
+   vblk->tag_set.flags |= BLK_MQ_F_MANAGED_IRQ;
vblk->tag_set.cmd_size =
sizeof(struct virtblk_req) +
sizeof(struct scatterlist) * sg_elems;
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index b9c86a7e3b97..f301917abc84 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -891,6 +891,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
shost->max_channel = 0;
shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE;
shost->nr_hw_queues = num_queues;
+   shost->use_managed_irq = vdev->use_managed_irq;
 
 #ifdef CONFIG_BLK_DEV_INTEGRITY
if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) {
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index 222d630c41fc..f2ac48fb477b 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -128,6 +128,7 @@ static int vp_request_msix_vectors(struct virtio_device 
*vdev, int nvectors,
if (desc) {
flags |= PCI_IRQ_AFFINITY;
desc->pre_vectors++; /* virtio config vector */
+   vdev->use_managed_irq = true;
}
 
err = pci_alloc_irq_vectors_affinity(vp_dev->pci_dev, nvectors,
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b1894e0323fa..85cc773b2dc7 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -109,6 +109,7 @@ struct virtio_device {
bool failed;
bool config_enabled;
bool config_change_pending;
+   bool use_managed_irq;
spinlock_t config_lock;
struct device dev;
struct virtio_device_id id;
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-26 Thread Ming Lei

On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> Request completion latency can be reduced by using polling instead of
> irqs. Even Posted Interrupts or similar hardware support doesn't beat
> polling. The reason is that disabling virtqueue notifications saves
> critical-path CPU cycles on the host by skipping irq injection and in
> the guest by skipping the irq handler. So let's add blk_mq_ops->poll()
> support to virtio_blk.
> 
> The approach taken by this patch differs from the NVMe driver's
> approach. NVMe dedicates hardware queues to polling and submits
> REQ_HIPRI requests only on those queues. This patch does not require
> exclusive polling queues for virtio_blk. Instead, it switches between
> irqs and polling when one or more REQ_HIPRI requests are in flight on a
> virtqueue.
> 
> This is possible because toggling virtqueue notifications is cheap even
> while the virtqueue is running. NVMe cqs can't do this because irqs are
> only enabled/disabled at queue creation time.
> 
> This toggling approach requires no configuration. There is no need to
> dedicate queues ahead of time or to teach users and orchestration tools
> how to set up polling queues.

This approach looks good, and very neat thanks per-vq lock.

BTW, is there any virt-exit saved by disabling vq interrupt? I understand
there isn't since virt-exit may only be involved in remote completion
via sending IPI.

> 
> Possible drawbacks of this approach:
> 
> - Hardware virtio_blk implementations may find virtqueue_disable_cb()
>   expensive since it requires DMA. If such devices become popular then

You mean the hardware need to consider order between DMA completion and
interrupt notify? But it is disabling notify, guest just calls
virtqueue_get_buf() to see if there is buffer available, if not, it will be
polled again.

>   the virtio_blk driver could use a similar approach to NVMe when
>   VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> 
> - If a blk_poll() thread is descheduled it not only hurts polling
>   performance but also delays completion of non-REQ_HIPRI requests on
>   that virtqueue since vq notifications are disabled.
> 
> Performance:
> 
> - Benchmark: fio ioengine=pvsync2 numjobs=4 direct=1
> - Guest: 4 vCPUs with one virtio-blk device (4 virtqueues)

4 jobs can consume up all 4 vCPUs. Just run a quick fio test with
'ioengine=io_uring --numjobs=1' on single vq, and IOPS can be improved
by ~20%(hipri=1 vs hipri=0) with the 3 patches, and the virtio-blk is
still backed on NVMe SSD.


Thanks, 
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] virtio_blk: implement blk_mq_ops->poll()

2021-05-25 Thread Ming Lei

On Tue, May 25, 2021 at 09:22:48AM +0200, Paolo Bonzini wrote:
> On 24/05/21 16:59, Christoph Hellwig wrote:
> > On Thu, May 20, 2021 at 03:13:05PM +0100, Stefan Hajnoczi wrote:
> > > Possible drawbacks of this approach:
> > > 
> > > - Hardware virtio_blk implementations may find virtqueue_disable_cb()
> > >expensive since it requires DMA. If such devices become popular then
> > >the virtio_blk driver could use a similar approach to NVMe when
> > >VIRTIO_F_ACCESS_PLATFORM is detected in the future.
> > > 
> > > - If a blk_poll() thread is descheduled it not only hurts polling
> > >performance but also delays completion of non-REQ_HIPRI requests on
> > >that virtqueue since vq notifications are disabled.
> > 
> > Yes, I think this is a dangerous configuration.  What argument exists
> > again just using dedicated poll queues?
> 
> There isn't an equivalent of the admin queue in virtio-blk, which would
> allow the guest to configure the desired number of poll queues.  The number
> of queues is fixed.

Dedicated vqs can be used for poll only, and I understand VM needn't to know
if the vq is polled or driven by IRQ in VM.

I tried that in v5.4, but not see obvious IOPS boost, so give up.

https://github.com/ming1/linux/commits/my_v5.4-virtio-irq-poll


Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH V8 04/13] sbitmap: move allocation hint into sbitmap

2021-02-07 Thread Ming Lei

Allocation hint should have belonged to sbitmap, also when sbitmap's
depth is high and no need to use mulitple wakeup queues, user can
benefit from percpu allocation hint too.

So move allocation hint into sbitmap, then scsi device queue can benefit
from allocation hint when converting to plain sbitmap in the following
patches.

Meantime convert vhost/scsi.c to use sbitmap allocation with percpu
alloc hint, which way is more efficient than previous way.

Cc: Omar Sandoval 
Cc: Kashyap Desai 
Cc: Sumanesh Samanta 
Cc: Ewan D. Milne 
Cc: Mike Christie 
Cc: virtualization@lists.linux-foundation.org
Reviewed-by: Hannes Reinecke 
Tested-by: Sumanesh Samanta 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c  |   2 +-
 block/kyber-iosched.c   |   2 +-
 drivers/vhost/scsi.c|   4 +-
 include/linux/sbitmap.h |  41 +--
 lib/sbitmap.c   | 112 +++-
 5 files changed, 96 insertions(+), 65 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a2bac31afaff..bcca3e9c04e8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2729,7 +2729,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct 
blk_mq_tag_set *set,
goto free_cpumask;
 
if (sbitmap_init_node(>ctx_map, nr_cpu_ids, ilog2(8),
-   gfp, node, false))
+   gfp, node, false, false))
goto free_ctxs;
hctx->nr_ctx = 0;
 
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index d017c43c0041..d95a2b3393cc 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -480,7 +480,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
unsigned int hctx_idx)
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
if (sbitmap_init_node(>kcq_map[i], hctx->nr_ctx,
  ilog2(8), GFP_KERNEL, hctx->numa_node,
- false)) {
+ false, false)) {
while (--i >= 0)
sbitmap_free(>kcq_map[i]);
goto err_kcqs;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index ab230f6f79e8..8531751b28dd 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -614,7 +614,7 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct 
vhost_scsi_tpg *tpg,
return ERR_PTR(-EIO);
}
 
-   tag = sbitmap_get(>scsi_tags, 0);
+   tag = sbitmap_get(>scsi_tags);
if (tag < 0) {
pr_err("Unable to obtain tag for vhost_scsi_cmd\n");
return ERR_PTR(-ENOMEM);
@@ -1512,7 +1512,7 @@ static int vhost_scsi_setup_vq_cmds(struct 
vhost_virtqueue *vq, int max_cmds)
return 0;
 
if (sbitmap_init_node(>scsi_tags, max_cmds, -1, GFP_KERNEL,
- NUMA_NO_NODE, false))
+ NUMA_NO_NODE, false, true))
return -ENOMEM;
svq->max_cmds = max_cmds;
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 734ee6214cd6..247776fcc02c 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -65,6 +65,14 @@ struct sbitmap {
 * @map: Allocated bitmap.
 */
struct sbitmap_word *map;
+
+   /*
+* @alloc_hint: Cache of last successfully allocated or freed bit.
+*
+* This is per-cpu, which allows multiple users to stick to different
+* cachelines until the map is exhausted.
+*/
+   unsigned int __percpu *alloc_hint;
 };
 
 #define SBQ_WAIT_QUEUES 8
@@ -100,14 +108,6 @@ struct sbitmap_queue {
 */
struct sbitmap sb;
 
-   /*
-* @alloc_hint: Cache of last successfully allocated or freed bit.
-*
-* This is per-cpu, which allows multiple users to stick to different
-* cachelines until the map is exhausted.
-*/
-   unsigned int __percpu *alloc_hint;
-
/**
 * @wake_batch: Number of bits which must be freed before we wake up any
 * waiters.
@@ -147,11 +147,13 @@ struct sbitmap_queue {
  * @round_robin: If true, be stricter about allocation order; always allocate
  *   starting from the last allocated bit. This is less efficient
  *   than the default behavior (false).
+ * @alloc_hint: If true, apply percpu hint for where to start searching for
+ *  a free bit.
  *
  * Return: Zero on success or negative errno on failure.
  */
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
- gfp_t flags, int node, bool round_robin);
+ gfp_t flags, int node, bool round_robin, bool alloc_hint);
 
 /**
  * sbitmap_free() - Free memory used by a  sbitmap.
@@ -159,6 +161,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int 
depth, int shift,
  */
 static inline void sbitmap_f

[PATCH V8 02/13] sbitmap: maintain allocation round_robin in sbitmap

2021-02-07 Thread Ming Lei

Now allocation round_robin info is maintained by sbitmap_queue.

Actually, bit allocation belongs to sbitmap. Also the following
patch will move alloc_hint to sbitmap for users with high depth.

So move round_robin to sbitmap.

Cc: Omar Sandoval 
Cc: Kashyap Desai 
Cc: Sumanesh Samanta 
Cc: Ewan D. Milne 
Cc: Hannes Reinecke 
Cc: virtualization@lists.linux-foundation.org
Reviewed-by: Hannes Reinecke 
Tested-by: Sumanesh Samanta 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c  |  2 +-
 block/kyber-iosched.c   |  3 ++-
 drivers/vhost/scsi.c|  4 ++--
 include/linux/sbitmap.h | 20 ++--
 lib/sbitmap.c   | 28 ++--
 5 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f21d922ecfaf..a2bac31afaff 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2729,7 +2729,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct 
blk_mq_tag_set *set,
goto free_cpumask;
 
if (sbitmap_init_node(>ctx_map, nr_cpu_ids, ilog2(8),
-   gfp, node))
+   gfp, node, false))
goto free_ctxs;
hctx->nr_ctx = 0;
 
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index c25c41d0d061..d017c43c0041 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -479,7 +479,8 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
unsigned int hctx_idx)
 
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
if (sbitmap_init_node(>kcq_map[i], hctx->nr_ctx,
- ilog2(8), GFP_KERNEL, hctx->numa_node)) {
+ ilog2(8), GFP_KERNEL, hctx->numa_node,
+ false)) {
while (--i >= 0)
sbitmap_free(>kcq_map[i]);
goto err_kcqs;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4ce9f00ae10e..ab230f6f79e8 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -614,7 +614,7 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct 
vhost_scsi_tpg *tpg,
return ERR_PTR(-EIO);
}
 
-   tag = sbitmap_get(>scsi_tags, 0, false);
+   tag = sbitmap_get(>scsi_tags, 0);
if (tag < 0) {
pr_err("Unable to obtain tag for vhost_scsi_cmd\n");
return ERR_PTR(-ENOMEM);
@@ -1512,7 +1512,7 @@ static int vhost_scsi_setup_vq_cmds(struct 
vhost_virtqueue *vq, int max_cmds)
return 0;
 
if (sbitmap_init_node(>scsi_tags, max_cmds, -1, GFP_KERNEL,
- NUMA_NO_NODE))
+ NUMA_NO_NODE, false))
return -ENOMEM;
svq->max_cmds = max_cmds;
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 16353fbee765..734ee6214cd6 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -56,6 +56,11 @@ struct sbitmap {
 */
unsigned int map_nr;
 
+   /**
+* @round_robin: Allocate bits in strict round-robin order.
+*/
+   bool round_robin;
+
/**
 * @map: Allocated bitmap.
 */
@@ -124,11 +129,6 @@ struct sbitmap_queue {
 */
atomic_t ws_active;
 
-   /**
-* @round_robin: Allocate bits in strict round-robin order.
-*/
-   bool round_robin;
-
/**
 * @min_shallow_depth: The minimum shallow depth which may be passed to
 * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
@@ -144,11 +144,14 @@ struct sbitmap_queue {
  * given, a good default is chosen.
  * @flags: Allocation flags.
  * @node: Memory node to allocate on.
+ * @round_robin: If true, be stricter about allocation order; always allocate
+ *   starting from the last allocated bit. This is less efficient
+ *   than the default behavior (false).
  *
  * Return: Zero on success or negative errno on failure.
  */
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
- gfp_t flags, int node);
+ gfp_t flags, int node, bool round_robin);
 
 /**
  * sbitmap_free() - Free memory used by a  sbitmap.
@@ -174,15 +177,12 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int 
depth);
  * sbitmap_get() - Try to allocate a free bit from a  sbitmap.
  * @sb: Bitmap to allocate from.
  * @alloc_hint: Hint for where to start searching for a free bit.
- * @round_robin: If true, be stricter about allocation order; always allocate
- *   starting from the last allocated bit. This is less efficient
- *   than the default behavior (false).
  *
  * This operation provides acquire barrier semantics if it succeeds.
  *
  * Return: Non-negative allocated bit number if successful, -1 otherwise.
  */
-int sbitmap_get(struct sbitmap *sb, unsigned int al

[PATCH V7 04/13] sbitmap: move allocation hint into sbitmap

2021-01-21 Thread Ming Lei

Allocation hint should have belonged to sbitmap, also when sbitmap's
depth is high and no need to use mulitple wakeup queues, user can
benefit from percpu allocation hint too.

So move allocation hint into sbitmap, then scsi device queue can benefit
from allocation hint when converting to plain sbitmap in the following
patches.

Meantime convert vhost/scsi.c to use sbitmap allocation with percpu
alloc hint, which way is more efficient than previous way.

Cc: Omar Sandoval 
Cc: Kashyap Desai 
Cc: Sumanesh Samanta 
Cc: Ewan D. Milne 
Cc: Mike Christie 
Cc: virtualization@lists.linux-foundation.org
Reviewed-by: Hannes Reinecke 
Tested-by: Sumanesh Samanta 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c  |   2 +-
 block/kyber-iosched.c   |   2 +-
 drivers/vhost/scsi.c|   4 +-
 include/linux/sbitmap.h |  41 +--
 lib/sbitmap.c   | 112 +++-
 5 files changed, 96 insertions(+), 65 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c5eead257568..ef1a9f2003a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2676,7 +2676,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct 
blk_mq_tag_set *set,
goto free_cpumask;
 
if (sbitmap_init_node(>ctx_map, nr_cpu_ids, ilog2(8),
-   gfp, node, false))
+   gfp, node, false, false))
goto free_ctxs;
hctx->nr_ctx = 0;
 
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index cc8bcfe1d587..3949d68ac4c1 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -480,7 +480,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
unsigned int hctx_idx)
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
if (sbitmap_init_node(>kcq_map[i], hctx->nr_ctx,
  ilog2(8), GFP_KERNEL, hctx->numa_node,
- false)) {
+ false, false)) {
while (--i >= 0)
sbitmap_free(>kcq_map[i]);
goto err_kcqs;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index ab230f6f79e8..8531751b28dd 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -614,7 +614,7 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct 
vhost_scsi_tpg *tpg,
return ERR_PTR(-EIO);
}
 
-   tag = sbitmap_get(>scsi_tags, 0);
+   tag = sbitmap_get(>scsi_tags);
if (tag < 0) {
pr_err("Unable to obtain tag for vhost_scsi_cmd\n");
return ERR_PTR(-ENOMEM);
@@ -1512,7 +1512,7 @@ static int vhost_scsi_setup_vq_cmds(struct 
vhost_virtqueue *vq, int max_cmds)
return 0;
 
if (sbitmap_init_node(>scsi_tags, max_cmds, -1, GFP_KERNEL,
- NUMA_NO_NODE, false))
+ NUMA_NO_NODE, false, true))
return -ENOMEM;
svq->max_cmds = max_cmds;
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 734ee6214cd6..247776fcc02c 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -65,6 +65,14 @@ struct sbitmap {
 * @map: Allocated bitmap.
 */
struct sbitmap_word *map;
+
+   /*
+* @alloc_hint: Cache of last successfully allocated or freed bit.
+*
+* This is per-cpu, which allows multiple users to stick to different
+* cachelines until the map is exhausted.
+*/
+   unsigned int __percpu *alloc_hint;
 };
 
 #define SBQ_WAIT_QUEUES 8
@@ -100,14 +108,6 @@ struct sbitmap_queue {
 */
struct sbitmap sb;
 
-   /*
-* @alloc_hint: Cache of last successfully allocated or freed bit.
-*
-* This is per-cpu, which allows multiple users to stick to different
-* cachelines until the map is exhausted.
-*/
-   unsigned int __percpu *alloc_hint;
-
/**
 * @wake_batch: Number of bits which must be freed before we wake up any
 * waiters.
@@ -147,11 +147,13 @@ struct sbitmap_queue {
  * @round_robin: If true, be stricter about allocation order; always allocate
  *   starting from the last allocated bit. This is less efficient
  *   than the default behavior (false).
+ * @alloc_hint: If true, apply percpu hint for where to start searching for
+ *  a free bit.
  *
  * Return: Zero on success or negative errno on failure.
  */
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
- gfp_t flags, int node, bool round_robin);
+ gfp_t flags, int node, bool round_robin, bool alloc_hint);
 
 /**
  * sbitmap_free() - Free memory used by a  sbitmap.
@@ -159,6 +161,7 @@ int sbitmap_init_node(struct sbitmap *sb, unsigned int 
depth, int shift,
  */
 static inline void sbitmap_f

[PATCH V7 02/13] sbitmap: maintain allocation round_robin in sbitmap

2021-01-21 Thread Ming Lei

Now allocation round_robin info is maintained by sbitmap_queue.

Actually, bit allocation belongs to sbitmap. Also the following
patch will move alloc_hint to sbitmap for users with high depth.

So move round_robin to sbitmap.

Cc: Omar Sandoval 
Cc: Kashyap Desai 
Cc: Sumanesh Samanta 
Cc: Ewan D. Milne 
Cc: Hannes Reinecke 
Cc: virtualization@lists.linux-foundation.org
Reviewed-by: Hannes Reinecke 
Tested-by: Sumanesh Samanta 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c  |  2 +-
 block/kyber-iosched.c   |  3 ++-
 drivers/vhost/scsi.c|  4 ++--
 include/linux/sbitmap.h | 20 ++--
 lib/sbitmap.c   | 28 ++--
 5 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f285a9123a8b..c5eead257568 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2676,7 +2676,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct 
blk_mq_tag_set *set,
goto free_cpumask;
 
if (sbitmap_init_node(>ctx_map, nr_cpu_ids, ilog2(8),
-   gfp, node))
+   gfp, node, false))
goto free_ctxs;
hctx->nr_ctx = 0;
 
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index dc89199bc8c6..cc8bcfe1d587 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -479,7 +479,8 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, 
unsigned int hctx_idx)
 
for (i = 0; i < KYBER_NUM_DOMAINS; i++) {
if (sbitmap_init_node(>kcq_map[i], hctx->nr_ctx,
- ilog2(8), GFP_KERNEL, hctx->numa_node)) {
+ ilog2(8), GFP_KERNEL, hctx->numa_node,
+ false)) {
while (--i >= 0)
sbitmap_free(>kcq_map[i]);
goto err_kcqs;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 4ce9f00ae10e..ab230f6f79e8 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -614,7 +614,7 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct 
vhost_scsi_tpg *tpg,
return ERR_PTR(-EIO);
}
 
-   tag = sbitmap_get(>scsi_tags, 0, false);
+   tag = sbitmap_get(>scsi_tags, 0);
if (tag < 0) {
pr_err("Unable to obtain tag for vhost_scsi_cmd\n");
return ERR_PTR(-ENOMEM);
@@ -1512,7 +1512,7 @@ static int vhost_scsi_setup_vq_cmds(struct 
vhost_virtqueue *vq, int max_cmds)
return 0;
 
if (sbitmap_init_node(>scsi_tags, max_cmds, -1, GFP_KERNEL,
- NUMA_NO_NODE))
+ NUMA_NO_NODE, false))
return -ENOMEM;
svq->max_cmds = max_cmds;
 
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index 16353fbee765..734ee6214cd6 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -56,6 +56,11 @@ struct sbitmap {
 */
unsigned int map_nr;
 
+   /**
+* @round_robin: Allocate bits in strict round-robin order.
+*/
+   bool round_robin;
+
/**
 * @map: Allocated bitmap.
 */
@@ -124,11 +129,6 @@ struct sbitmap_queue {
 */
atomic_t ws_active;
 
-   /**
-* @round_robin: Allocate bits in strict round-robin order.
-*/
-   bool round_robin;
-
/**
 * @min_shallow_depth: The minimum shallow depth which may be passed to
 * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
@@ -144,11 +144,14 @@ struct sbitmap_queue {
  * given, a good default is chosen.
  * @flags: Allocation flags.
  * @node: Memory node to allocate on.
+ * @round_robin: If true, be stricter about allocation order; always allocate
+ *   starting from the last allocated bit. This is less efficient
+ *   than the default behavior (false).
  *
  * Return: Zero on success or negative errno on failure.
  */
 int sbitmap_init_node(struct sbitmap *sb, unsigned int depth, int shift,
- gfp_t flags, int node);
+ gfp_t flags, int node, bool round_robin);
 
 /**
  * sbitmap_free() - Free memory used by a  sbitmap.
@@ -174,15 +177,12 @@ void sbitmap_resize(struct sbitmap *sb, unsigned int 
depth);
  * sbitmap_get() - Try to allocate a free bit from a  sbitmap.
  * @sb: Bitmap to allocate from.
  * @alloc_hint: Hint for where to start searching for a free bit.
- * @round_robin: If true, be stricter about allocation order; always allocate
- *   starting from the last allocated bit. This is less efficient
- *   than the default behavior (false).
  *
  * This operation provides acquire barrier semantics if it succeeds.
  *
  * Return: Non-negative allocated bit number if successful, -1 otherwise.
  */
-int sbitmap_get(struct sbitmap *sb, unsigned int al

Re: [PATCH] drivers\block: Use kobj_to_dev() API

2020-08-04 Thread Ming Lei

On Fri, Jun 12, 2020 at 03:10:56PM +0800, Wang Qing wrote:
> Use kobj_to_dev() API instead of container_of().
> 
> Signed-off-by: Wang Qing 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>  mode change 100644 => 100755 drivers/block/virtio_blk.c
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 9d21bf0..c808405
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -630,7 +630,7 @@ static struct attribute *virtblk_attrs[] = {
>  static umode_t virtblk_attrs_are_visible(struct kobject *kobj,
>   struct attribute *a, int n)
>  {
> - struct device *dev = container_of(kobj, struct device, kobj);
> + struct device *dev = kobj_to_dev(kobj);
>   struct gendisk *disk = dev_to_disk(dev);
>   struct virtio_blk *vblk = disk->private_data;
>   struct virtio_device *vdev = vblk->vdev;
> -- 
> 2.7.4
> 

Reviewed-by: Ming Lei 

-- 
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio-blk: check host supplied logical block size

2020-07-15 Thread Ming Lei

On Wed, Jul 15, 2020 at 5:55 PM Maxim Levitsky  wrote:
>
> Linux kernel only supports logical block sizes which are power of two,
> at least 512 bytes and no more that PAGE_SIZE.
>
> Check this instead of crashing later on.
>
> Note that there is no need to check physical block size since it is
> only a hint, and virtio-blk already only supports power of two values.
>
> Bugzilla link: https://bugzilla.redhat.com/show_bug.cgi?id=1664619
>
> Signed-off-by: Maxim Levitsky 
> ---
>  drivers/block/virtio_blk.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 980df853ee497..36dda31cc4e96 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -681,6 +681,12 @@ static const struct blk_mq_ops virtio_mq_ops = {
>  static unsigned int virtblk_queue_depth;
>  module_param_named(queue_depth, virtblk_queue_depth, uint, 0444);
>
> +
> +static bool virtblk_valid_block_size(unsigned int blksize)
> +{
> +   return blksize >= 512 && blksize <= PAGE_SIZE && 
> is_power_of_2(blksize);
> +}
> +
>  static int virtblk_probe(struct virtio_device *vdev)
>  {
> struct virtio_blk *vblk;
> @@ -809,9 +815,16 @@ static int virtblk_probe(struct virtio_device *vdev)
> err = virtio_cread_feature(vdev, VIRTIO_BLK_F_BLK_SIZE,
>struct virtio_blk_config, blk_size,
>_size);
> -   if (!err)
> +   if (!err) {
> +   if (!virtblk_valid_block_size(blk_size)) {
> +   dev_err(>dev,
> +   "%s failure: unsupported logical block size 
> %d\n",
> +   __func__, blk_size);
> +   err = -EINVAL;
> +   goto out_cleanup_queue;
> +   }
> blk_queue_logical_block_size(q, blk_size);
> -   else
> +   } else
> blk_size = queue_logical_block_size(q);
>
> /* Use topology information if available */
> @@ -872,6 +885,9 @@ static int virtblk_probe(struct virtio_device *vdev)
> device_add_disk(>dev, vblk->disk, virtblk_attr_groups);
> return 0;
>
> +out_cleanup_queue:
> +   blk_cleanup_queue(vblk->disk->queue);
> +   vblk->disk->queue = NULL;
>  out_free_tags:
> blk_mq_free_tag_set(>tag_set);
>  out_put_disk:
> --
> 2.26.2
>
> ___
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Looks fine,

Reviewed-by: Ming Lei 

-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio-blk: free vblk-vqs in error path of virtblk_probe()

2020-06-30 Thread Ming Lei

On Mon, Jun 15, 2020 at 12:14:59PM +0800, Hou Tao wrote:
> Else there will be memory leak if alloc_disk() fails.
> 
> Fixes: 6a27b656fc02 ("block: virtio-blk: support multi virt queues per 
> virtio-blk device")
> Signed-off-by: Hou Tao 
> ---
>  drivers/block/virtio_blk.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 9d21bf0f155e..980df853ee49 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -878,6 +878,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>   put_disk(vblk->disk);
>  out_free_vq:
>   vdev->config->del_vqs(vdev);
> + kfree(vblk->vqs);
>  out_free_vblk:
>   kfree(vblk);
>  out_free_index:
> -- 
> 2.25.0.4.g0ad7144999
> 

Reviewed-by: Ming Lei 

-- 
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 12/24] hpsa: use reserved commands

2020-03-11 Thread Ming Lei

On Wed, Mar 11, 2020 at 12:25:38AM +0800, John Garry wrote:
> From: Hannes Reinecke 
> 
> Enable the use of reserved commands, and drop the hand-crafted
> command allocation.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/scsi/hpsa.c | 147 ++--
>  drivers/scsi/hpsa.h |   1 -
>  2 files changed, 45 insertions(+), 103 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 703f824584fe..c14dd4b6e598 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -244,10 +244,6 @@ static struct hpsa_scsi_dev_t
>   *hpsa_find_device_by_sas_rphy(struct ctlr_info *h,
>   struct sas_rphy *rphy);
>  
> -#define SCSI_CMD_BUSY ((struct scsi_cmnd *)_cmd_busy)
> -static const struct scsi_cmnd hpsa_cmd_busy;
> -#define SCSI_CMD_IDLE ((struct scsi_cmnd *)_cmd_idle)
> -static const struct scsi_cmnd hpsa_cmd_idle;
>  static int number_of_controllers;
>  
>  static irqreturn_t do_hpsa_intr_intx(int irq, void *dev_id);
> @@ -342,7 +338,7 @@ static inline struct ctlr_info *shost_to_hba(struct 
> Scsi_Host *sh)
>  
>  static inline bool hpsa_is_cmd_idle(struct CommandList *c)
>  {
> - return c->scsi_cmd == SCSI_CMD_IDLE;
> + return c->scsi_cmd == NULL;
>  }
>  
>  /* extract sense key, asc, and ascq from sense data.  -1 means invalid. */
> @@ -2445,7 +2441,12 @@ static void hpsa_cmd_resolve_events(struct ctlr_info 
> *h,
>* this command has completed.  Then, check to see if the handler is
>* waiting for this command, and, if so, wake it.
>*/
> - c->scsi_cmd = SCSI_CMD_IDLE;
> + if (c->scsi_cmd && c->cmd_type == CMD_IOCTL_PEND) {
> + struct scsi_cmnd *scmd = c->scsi_cmd;
> +
> + scsi_put_reserved_cmd(scmd);
> + }
> + c->scsi_cmd = NULL;
>   mb();   /* Declare command idle before checking for pending events. */
>   if (dev) {
>   atomic_dec(>commands_outstanding);
> @@ -5502,7 +5503,6 @@ static void hpsa_cmd_init(struct ctlr_info *h, int 
> index,
>   c->ErrDesc.Addr = cpu_to_le64((u64) err_dma_handle);
>   c->ErrDesc.Len = cpu_to_le32((u32) sizeof(*c->err_info));
>   c->h = h;
> - c->scsi_cmd = SCSI_CMD_IDLE;
>  }
>  
>  static void hpsa_preinitialize_commands(struct ctlr_info *h)
> @@ -5803,6 +5803,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
>   sh->max_lun = HPSA_MAX_LUN;
>   sh->max_id = HPSA_MAX_LUN;
>   sh->can_queue = h->nr_cmds - HPSA_NRESERVED_CMDS;
> + sh->nr_reserved_cmds = HPSA_NRESERVED_CMDS;

Now .nr_reserved_cmds has been passed to blk-mq, you need to increase
sh->can_queue to h->nr_cmds, because .can_queue is the whole queue depth
(include the part of reserved tags), otherwise, IO tags will be
decreased.

Not look into other drivers, I guess they need such change too.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 01/24] scsi: add 'nr_reserved_cmds' field to the SCSI host template

2020-03-11 Thread Ming Lei

On Wed, Mar 11, 2020 at 07:55:46AM +0100, Hannes Reinecke wrote:
> On 3/11/20 12:08 AM, Ming Lei wrote:
> > On Wed, Mar 11, 2020 at 12:25:27AM +0800, John Garry wrote:
> >> From: Hannes Reinecke 
> >>
> >> Add a new field 'nr_reserved_cmds' to the SCSI host template to
> >> instruct the block layer to set aside a tag space for reserved
> >> commands.
> >>
> >> Signed-off-by: Hannes Reinecke 
> >> ---
> >>  drivers/scsi/scsi_lib.c  | 1 +
> >>  include/scsi/scsi_host.h | 6 ++
> >>  2 files changed, 7 insertions(+)
> >>
> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> >> index 610ee41fa54c..2967325df7a0 100644
> >> --- a/drivers/scsi/scsi_lib.c
> >> +++ b/drivers/scsi/scsi_lib.c
> >> @@ -1896,6 +1896,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
> >>shost->tag_set.ops = _mq_ops_no_commit;
> >>shost->tag_set.nr_hw_queues = shost->nr_hw_queues ? : 1;
> >>shost->tag_set.queue_depth = shost->can_queue;
> >> +  shost->tag_set.reserved_tags = shost->nr_reserved_cmds;
> > 
> > You reserve tags for special usage, meantime the whole queue depth
> > isn't increased, that means the tags for IO request is decreased given
> > reserved tags can't be used for IO, so IO performance may be effected.
> > 
> > If not the case, please explain a bit in commit log.
> > 
> The overall idea of this patchset is to fold _existing_ management
> command handling into using the blk-mq bitmap.
> Currently, quite a lot of drivers are using management commands
> internally, which typically use the same hardware tag pool (ie they are
> being allocated from the same hardware resources) than the 'normal' I/O
> commands.
> But as they are using the same tagpool these drivers already decrement
> the available number of commands; check eg. hpsa:
> 
> static int hpsa_scsi_host_alloc(struct ctlr_info *h)
> {
>   struct Scsi_Host *sh;
> 
>   sh = scsi_host_alloc(_driver_template, sizeof(h));
>   if (sh == NULL) {
>   dev_err(>pdev->dev, "scsi_host_alloc failed\n");
>   return -ENOMEM;
>   }
> 
>   sh->io_port = 0;
>   sh->n_io_port = 0;
>   sh->this_id = -1;
>   sh->max_channel = 3;
>   sh->max_cmd_len = MAX_COMMAND_SIZE;
>   sh->max_lun = HPSA_MAX_LUN;
>   sh->max_id = HPSA_MAX_LUN;
>   sh->can_queue = h->nr_cmds - HPSA_NRESERVED_CMDS;
>   sh->cmd_per_lun = sh->can_queue;
> 
> So the idea of this patchset is to convert existing use-cases; seeing
> that they already reserve commands internally performance will not be
> affected.
> 

OK, got it.

I'd suggest to add comment about this idea in commit log, cause
it is one fundamental thing about this change.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH RFC v2 01/24] scsi: add 'nr_reserved_cmds' field to the SCSI host template

2020-03-10 Thread Ming Lei

On Wed, Mar 11, 2020 at 12:25:27AM +0800, John Garry wrote:
> From: Hannes Reinecke 
> 
> Add a new field 'nr_reserved_cmds' to the SCSI host template to
> instruct the block layer to set aside a tag space for reserved
> commands.
> 
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/scsi/scsi_lib.c  | 1 +
>  include/scsi/scsi_host.h | 6 ++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41fa54c..2967325df7a0 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1896,6 +1896,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
>   shost->tag_set.ops = _mq_ops_no_commit;
>   shost->tag_set.nr_hw_queues = shost->nr_hw_queues ? : 1;
>   shost->tag_set.queue_depth = shost->can_queue;
> + shost->tag_set.reserved_tags = shost->nr_reserved_cmds;

You reserve tags for special usage, meantime the whole queue depth
isn't increased, that means the tags for IO request is decreased given
reserved tags can't be used for IO, so IO performance may be effected.

If not the case, please explain a bit in commit log.

Thanks,
Ming

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error

2020-02-18 Thread Ming Lei

On Tue, Feb 18, 2020 at 8:35 PM Halil Pasic  wrote:
>
> On Tue, 18 Feb 2020 10:21:18 +0800
> Ming Lei  wrote:
>
> > On Thu, Feb 13, 2020 at 8:38 PM Halil Pasic  wrote:
> > >
> > > Since nobody else is going to restart our hw_queue for us, the
> > > blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient
> > > necessarily sufficient to ensure that the queue will get started again.
> > > In case of global resource outage (-ENOMEM because mapping failure,
> > > because of swiotlb full) our virtqueue may be empty and we can get
> > > stuck with a stopped hw_queue.
> > >
> > > Let us not stop the queue on arbitrary errors, but only on -EONSPC which
> > > indicates a full virtqueue, where the hw_queue is guaranteed to get
> > > started by virtblk_done() before when it makes sense to carry on
> > > submitting requests. Let us also remove a stale comment.
> >
> > The generic solution may be to stop queue only when there is any
> > in-flight request
> > not completed.
> >
>
> I think this is a pretty close to that. The queue is stopped only on
> ENOSPC, which means virtqueue is full.
>
> > Checking -ENOMEM may not be enough, given -EIO can be returned from
> > virtqueue_add()
> > too in case of dma map failure.
>
> I'm not checking on -ENOMEM. So the queue would not be stopped on EIO.
> Maybe I'm misunderstanding something In any case, please have another
> look at the diff, and if your concerns persist please help me understand.

Looks I misread the patch, and this patch is fine:

Reviewed-by: Ming Lei 


Thanks,
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/2] virtio-blk: fix hw_queue stopped on arbitrary error

2020-02-17 Thread Ming Lei

On Thu, Feb 13, 2020 at 8:38 PM Halil Pasic  wrote:
>
> Since nobody else is going to restart our hw_queue for us, the
> blk_mq_start_stopped_hw_queues() is in virtblk_done() is not sufficient
> necessarily sufficient to ensure that the queue will get started again.
> In case of global resource outage (-ENOMEM because mapping failure,
> because of swiotlb full) our virtqueue may be empty and we can get
> stuck with a stopped hw_queue.
>
> Let us not stop the queue on arbitrary errors, but only on -EONSPC which
> indicates a full virtqueue, where the hw_queue is guaranteed to get
> started by virtblk_done() before when it makes sense to carry on
> submitting requests. Let us also remove a stale comment.

The generic solution may be to stop queue only when there is any
in-flight request
not completed.

Checking -ENOMEM may not be enough, given -EIO can be returned from
virtqueue_add()
too in case of dma map failure.

Thanks,
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2015-12-14 Thread Ming Lei

Hi Paolo,

On Mon, Dec 14, 2015 at 6:31 PM, Paolo Bonzini <pbonz...@redhat.com> wrote:
>
>
> On 18/06/2014 06:04, Ming Lei wrote:
>> For virtio-blk, I don't think it is always better to take more queues, and
>> we need to leverage below things in host side:
>>
>> - host storage top performance, generally it reaches that with more
>> than 1 jobs with libaio(suppose it is N, so basically we can use N
>> iothread per device in qemu to try to get top performance)
>>
>> - iothreads' loading(if iothreads are at full loading, increasing
>> queues doesn't help at all)
>>
>> In my test, I only use the current per-dev iothread(x-dataplane)
>> in qemu to handle 2 vqs' notification and precess all I/O from
>> the 2 vqs, and looks it can improve IOPS by ~30%.
>>
>> For virtio-scsi, the current usage doesn't make full use of blk-mq's
>> advantage too because only one vq is active at the same time, so I
>> guess the multi vqs' benefit won't be very much and I'd like to post
>> patches to support that first, then provide test data with
>> more queues(8, 16).
>
> Hi Ming Lei,
>
> would you like to repost these patches now that MQ support is in the kernel?
>
> Also, I changed my mind about moving linux-aio to AioContext.  I now
> think it's a good idea, because it limits the number of io_getevents
> syscalls. O:-)  So I would be happy to review your patches for that as well.

OK, I try to figure out a new version, and it might take a while since it is
close to festival season, :-)

Thanks,
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: kernel BUG at drivers/block/virtio_blk.c:172!

2014-11-11 Thread Ming Lei

On Tue, Nov 11, 2014 at 11:42 PM, Dongsu Park
dongsu.p...@profitbricks.com wrote:
 Hi Ming,

 On 11.11.2014 08:56, Ming Lei wrote:
 On Tue, Nov 11, 2014 at 7:31 AM, Jens Axboe ax...@kernel.dk wrote:
  Known, I'm afraid, Ming is looking into it.

 Actually I had also tried to reproduce this bug, without success.
 But today I happened to know how to trigger the bug, by coincidence,
 during testing other things.

 Try to run xfstests/generic/034. You'll see the crash immediately.
 Tested on a QEMU VM with kernel 3.18-rc4, virtio-blk, dm-flakey and xfs.

 There is one obvious bug which should have been fixed by below
 patch(0001-block-blk-merge-fix-blk_recount_segments.patch):

 http://marc.info/?l=linux-virtualizationm=141562191719405q=p3

 This patch didn't bring anything to me, as Lukas said.

 And there might be another one, I appreciate someone can post
 log which is printed by patch(blk-seg.patch) in below link if the
 bug still can be triggered even with above fix:

 http://marc.info/?l=linux-virtualizationm=141473040618467q=p3

 blk_recount_segments: 1-1-1 vcnt-128 segs-128

 As long as I understood so far, the reason is that bi_phys_segments gets
 sometimes bigger than queue_max_sectors() after blk_recount_segments().
 That happens no matter whether segments are recalculated or not.

I know the problem now, since bi_vcnt can't be used for cloned bio,
and the patch I sent last time is wrong too.

 I'm not completely sure about what to do, but how about the attached patch?
 It seems to work, according to several xfstests runs.

 Cheers,
 Dongsu

 

 From 1db98323931eb9ab430116c4d909d2c16e22 Mon Sep 17 00:00:00 2001
 From: Dongsu Park dongsu.p...@profitbricks.com
 Date: Tue, 11 Nov 2014 13:10:59 +0100
 Subject: [RFC PATCH] blk-merge: make bi_phys_segments consider also
  queue_max_segments()

 When recounting the number of physical segments, the number of max
 segments of request_queue must be also taken into account.
 Otherwise bio-bi_phys_segments could get bigger than
 queue_max_segments(). Then this results in virtio_queue_rq() seeing
 req-nr_phys_segments that is greater than expected. Although the
 initial queue_max_segments was set to (vblk-sg_elems - 2), a request
 comes in with a larger value of nr_phys_segments, which triggers the
 BUG_ON() condition.

 This commit should fix a kernel crash in virtio_blk, which occurs
 especially frequently when it runs with blk-mq, device mapper, and xfs.
 The simplest way to reproduce this bug is to run xfstests/generic/034.
 Note, test 034 requires dm-flakey to be turned on in the kernel config.

 See the kernel trace below:
 [ cut here ]
 kernel BUG at drivers/block/virtio_blk.c:172!
 invalid opcode:  [#1] SMP
 CPU: 1 PID: 3343 Comm: mount Not tainted 3.18.0-rc4+ #55
 RIP: 0010:[81561027]
  [81561027] virtio_queue_rq+0x277/0x280
 Call Trace:
  [8142e908] __blk_mq_run_hw_queue+0x1a8/0x300
  [8142f00d] blk_mq_run_hw_queue+0x6d/0x90
  [8143003e] blk_sq_make_request+0x23e/0x360
  [81422e20] generic_make_request+0xc0/0x110
  [81422ed9] submit_bio+0x69/0x130
  [812f013d] _xfs_buf_ioapply+0x2bd/0x410
  [81315f38] ? xlog_bread_noalign+0xa8/0xe0
  [812f1bd1] xfs_buf_submit_wait+0x61/0x1d0
  [81315f38] xlog_bread_noalign+0xa8/0xe0
  [81316917] xlog_bread+0x27/0x60
  [8131ad11] xlog_find_verify_cycle+0xe1/0x190
  [8131b291] xlog_find_head+0x2d1/0x3c0
  [8131b3ad] xlog_find_tail+0x2d/0x3f0
  [8131b78e] xlog_recover+0x1e/0xf0
  [8130fbac] xfs_log_mount+0x24c/0x2c0
  [813075db] xfs_mountfs+0x44b/0x7a0
  [8130a98a] xfs_fs_fill_super+0x2ba/0x330
  [811cea64] mount_bdev+0x194/0x1d0
  [8130a6d0] ? xfs_parseargs+0xbe0/0xbe0
  [813089a5] xfs_fs_mount+0x15/0x20
  [811cf389] mount_fs+0x39/0x1b0
  [8117bf75] ? __alloc_percpu+0x15/0x20
  [811e9887] vfs_kern_mount+0x67/0x110
  [811ec584] do_mount+0x204/0xad0
  [811ed18b] SyS_mount+0x8b/0xe0
  [81788e12] system_call_fastpath+0x12/0x17
 RIP [81561027] virtio_queue_rq+0x277/0x280
 ---[ end trace ae3ec6426f011b5d ]---

 Signed-off-by: Dongsu Park dongsu.p...@profitbricks.com
 Tested-by: Dongsu Park dongsu.p...@profitbricks.com
 Cc: Ming Lei tom.leim...@gmail.com
 Cc: Jens Axboe ax...@kernel.dk
 Cc: Rusty Russell ru...@rustcorp.com.au
 Cc: Jeff Layton jlay...@poochiereds.net
 Cc: Dave Chinner da...@fromorbit.com
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Lukas Czerner lczer...@redhat.com
 Cc: Christoph Hellwig h...@lst.de
 Cc: virtualization@lists.linux-foundation.org
 ---
  block/blk-merge.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

 diff --git a/block/blk-merge.c b/block/blk-merge.c
 index b3ac40a..d808601 100644
 --- a/block/blk-merge.c
 +++ b/block/blk-merge.c
 @@ -103,13 +103,16 @@ void blk_recount_segments(struct request_queue *q, 
 struct

Re: Re: kernel BUG at drivers/block/virtio_blk.c:172

2014-11-10 Thread Ming Lei

On Mon, Nov 10, 2014 at 7:59 PM, Lukáš Czerner lczer...@redhat.com wrote:

 Hi,

 so I encountered it again on 3.17.0-rc4. This output is from the
 run with your patch.

 I am using libvirt (virt-manager) to configure and run the virtual
 machine, but looking at the xml, I do not think it's passing
 'scsi=off' at all.

 Btw, that xfs file system is a root file system.


 [3.667553] blk_recount_segments: 1-0-1 vcnt-0 segs-128
 [3.668692] blk_recount_segments: 1-0-1 vcnt-0 segs-128
 [3.669897] blk_recount_segments: 1-0-1 vcnt-0 segs-128
 [3.671083] blk_recount_segments: 1-0-1 vcnt-0 segs-128

Hamm, I should have used bi_phys_segments to decide if
merge is needed, and attached patch should fix the problem.


Thanks,
Ming Lei
From bcb4f411c23e114f7c77c0858d7f78f50fda68e9 Mon Sep 17 00:00:00 2001
From: Ming Lei tom.leim...@gmail.com
Date: Mon, 10 Nov 2014 20:10:24 +0800
Subject: [PATCH] block/blk-merge: fix blk_recount_segments

bio-bi_phys_segments should be used for deciding if merge
is needed instead of bio-vcnt which isn't valid for cloned bio.

Signed-off-by: Ming Lei tom.leim...@gmail.com
---
 block/blk-merge.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index b3ac40a..3387fd6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -99,7 +99,7 @@ void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
 	bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
 			q-queue_flags);
-	bool merge_not_need = bio-bi_vcnt  queue_max_segments(q);
+	bool merge_not_need = bio-bi_phys_segments  queue_max_segments(q);
 
 	if (no_sg_merge  !bio_flagged(bio, BIO_CLONED) 
 			merge_not_need)
-- 
1.7.9.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Re: kernel BUG at drivers/block/virtio_blk.c:172

2014-11-10 Thread Ming Lei

On Mon, Nov 10, 2014 at 9:05 PM, Lukáš Czerner lczer...@redhat.com wrote:
 On Mon, 10 Nov 2014, Ming Lei wrote:

 Date: Mon, 10 Nov 2014 20:18:32 +0800
 From: Ming Lei tom.leim...@gmail.com
 To: Lukáš Czerner lczer...@redhat.com
 Cc: Jens Axboe ax...@kernel.dk,
 Linux Virtualization virtualization@lists.linux-foundation.org,
 Christoph Hellwig h...@lst.de
 Subject: Re: Re: kernel BUG at drivers/block/virtio_blk.c:172

 On Mon, Nov 10, 2014 at 7:59 PM, Lukáš Czerner lczer...@redhat.com wrote:

  Hi,

  so I encountered it again on 3.17.0-rc4. This output is from the
  run with your patch.

  I am using libvirt (virt-manager) to configure and run the virtual
  machine, but looking at the xml, I do not think it's passing
  'scsi=off' at all.

  Btw, that xfs file system is a root file system.

  [3.667553] blk_recount_segments: 1-0-1 vcnt-0 segs-128
  [3.668692] blk_recount_segments: 1-0-1 vcnt-0 segs-128
  [3.669897] blk_recount_segments: 1-0-1 vcnt-0 segs-128
  [3.671083] blk_recount_segments: 1-0-1 vcnt-0 segs-128

 Hamm, I should have used bi_phys_segments to decide if
 merge is needed, and attached patch should fix the problem.

 Thanks for the patch, unfortunately it does not fix the issue for
 me. I am willing to try something else though :)

Care to post the log(like blk_recount_segments: 1-0-1
vcnt-0 segs-128) after applying the patch in my last reply?

Thanks,
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: kernel BUG at drivers/block/virtio_blk.c:172!

2014-11-10 Thread Ming Lei

On Tue, Nov 11, 2014 at 7:31 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-11-10 02:59, Rusty Russell wrote:

 Jeff Layton jlay...@poochiereds.net writes:

 In the latest Fedora rawhide kernel in the repos, I'm seeing the
 following oops when mounting xfs. rc2-ish kernels seem to be fine:

 [   64.669633] [ cut here ]
 [   64.670008] kernel BUG at drivers/block/virtio_blk.c:172!


 Hmm, that's:

 BUG_ON(req-nr_phys_segments + 2  vblk-sg_elems);

 But during our probe routine we said:

 /* We can handle whatever the host told us to handle. */
 blk_queue_max_segments(q, vblk-sg_elems-2);

 Jens?


 Known, I'm afraid, Ming is looking into it.

There is one obvious bug which should have been fixed by below
patch(0001-block-blk-merge-fix-blk_recount_segments.patch):

http://marc.info/?l=linux-virtualizationm=141562191719405q=p3

And there might be another one, I appreciate someone can post
log which is printed by patch(blk-seg.patch) in below link if the
bug still can be triggered even with above fix:

http://marc.info/?l=linux-virtualizationm=141473040618467q=p3


Thanks,
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Re: kernel BUG at drivers/block/virtio_blk.c:172

2014-10-30 Thread Ming Lei

On Thu, Oct 30, 2014 at 10:38 PM, Jens Axboe ax...@kernel.dk wrote:
 Forgot to CC you...

  Forwarded Message 
 Subject: Re: kernel BUG at drivers/block/virtio_blk.c:172
 Date: Thu, 30 Oct 2014 08:38:08 -0600
 From: Jens Axboe ax...@kernel.dk
 To: Lukáš Czerner lczer...@redhat.com,
 virtualization@lists.linux-foundation.org
 CC: h...@lst.de

 On 2014-10-30 08:33, Lukáš Czerner wrote:

 Hi,

 I've just hit this BUG at drivers/block/virtio_blk.c when updated to
 the kernel from the top of the Linus git tree.

 commit a7ca10f263d7e673c74d8e0946d6b9993405cc9c

 This is my virtual machine running on RHEL7 guest

 qemu-kvm-1.5.3-60.el7.x86_64

 The last upstream kernel (3.17.0-rc4) worked well. I'll try to
 bisect, but meanwhile this is a backtrace I got very early in the boot.
 The root fs on that guest is xfs and I am using raw disk image and virtio
 driver.

 Let me know if you need more information.

 Ming, looks like this still isn't really fixed. The above upstream
 commit has the latest fixup as well for the segments being wrong, so
 nothing else should be pending.

That looks weird, and I can't reproduce with mkfs.xfs  mount
in my environment.

Lukáš, could you reproduce the issue with attached debug patch
and post the result? BTW, do you pass 'scsi=off' in the qemu
command line for the virito-blk device?

Thanks,
-- 
Ming Lei
diff --git a/block/blk-merge.c b/block/blk-merge.c
index b3ac40a..91f3275 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -113,6 +113,15 @@ void blk_recount_segments(struct request_queue *q, struct bio *bio)
 		bio-bi_next = nxt;
 	}

+	if (no_sg_merge  (bio-bi_phys_segments = queue_max_segments(q)))
+		printk(%s: %d-%d-%d vcnt-%d segs-%d\n,
+__func__,
+no_sg_merge,
+!bio_flagged(bio, BIO_CLONED),
+merge_not_need,
+bio-bi_vcnt,
+bio-bi_phys_segments);
+
 	bio-bi_flags |= (1  BIO_SEG_VALID);
 }
 EXPORT_SYMBOL(blk_recount_segments);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_blk: fix race at module removal

2014-10-23 Thread Ming Lei

On Fri, Oct 24, 2014 at 12:12 AM, Michael S. Tsirkin m...@redhat.com wrote:
 If a device appears while module is being removed,
 driver will get a callback after we've given up
 on the major number.

 In theory this means this major number can get reused
 by something else, resulting in a conflict.

Yes, there is a tiny race window.


 To fix, cleanup in reverse order of initialization.

Reviewed-by: Ming Lei ming@canonical.com

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/block/virtio_blk.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 56aadbc..adfba9f 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -883,8 +883,8 @@ out_destroy_workqueue:

  static void __exit fini(void)
  {
 -   unregister_blkdev(major, virtblk);
 unregister_virtio_driver(virtio_blk);
 +   unregister_blkdev(major, virtblk);
 destroy_workqueue(virtblk_wq);
  }
  module_init(init);

Thanks,
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: blk-mq crash under KVM in multiqueue block code (with virtio-blk and ext4)

2014-09-17 Thread Ming Lei

On Wed, Sep 17, 2014 at 3:59 PM, Christian Borntraeger
borntrae...@de.ibm.com wrote:
 On 09/12/2014 10:09 PM, Christian Borntraeger wrote:
 On 09/12/2014 01:54 PM, Ming Lei wrote:
 On Thu, Sep 11, 2014 at 6:26 PM, Christian Borntraeger
 borntrae...@de.ibm.com wrote:
 Folks,

 we have seen the following bug with 3.16 as a KVM guest. It suspect the 
 blk-mq rework that happened between 3.15 and 3.16, but it can be something 
 completely different.


 Care to share how you reproduce the issue?

 Host with 16GB RAM 32GB swap. 15 guest all with 2 GB RAM (and varying amount 
 of CPUs). All do heavy file I/O.
 It did not happen with 3.15/3.15 in guest/host and does happen with 
 3.16/3.16. So our next step is to check
 3.15/3.16 and 3.16/3.15 to identify if its host memory mgmt or guest block 
 layer.

 The crashed happen pretty randomly, but when they happen it seems that its 
 the same trace as below. This makes memory corruption by host vm less likely 
 and some thing wrong in blk-mq more likely I guess


Maybe you can try these patches because atomic op
can be reordered on S390:

http://marc.info/?l=linux-kernelm=141094730828533w=2
http://marc.info/?l=linux-kernelm=141094730828534w=2

Thanks
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: blk-mq crash under KVM in multiqueue block code (with virtio-blk and ext4)

2014-09-17 Thread Ming Lei

On Wed, 17 Sep 2014 14:00:34 +0200
David Hildenbrand d...@linux.vnet.ibm.com wrote:

   Does anyone have an idea?
   The request itself is completely filled with cc
  
   That is very weird, the 'rq' is got from hctx-tags,  and rq should be
   valid, and rq-q shouldn't have been changed even though it was
   double free or double allocation.
  
   I am currently asking myself if blk_mq_map_request should protect 
   against softirq here but I cant say for sure,as I have never looked 
   into that code before.
  
   No, it needn't the protection.
  
   Thanks,
  
   
   --
   To unsubscribe from this list: send the line unsubscribe kvm in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
  
 
 Digging through the code, I think I found a possible cause:
 
 tags-rqs[..] is not initialized with zeroes (via alloc_pages_node in
 blk-mq.c:blk_mq_init_rq_map()).

Yes, it may cause problem when the request is allocated at the 1st time,
and timeout handler may comes just after the allocation and before its
initialization, then oops triggered because of garbage data in the request. 

--
From ffd0824b7b686074c2d5d70bc4e6bba3ba56a30c Mon Sep 17 00:00:00 2001
From: Ming Lei ming@canonical.com
Date: Wed, 17 Sep 2014 21:00:34 +0800
Subject: [PATCH] blk-mq: initialize request before the 1st allocation

Otherwise the request can be accessed from timeout handler
just after its 1st allocation from tag pool and before
initialization in blk_mq_rq_ctx_init(), so cause oops since
the request is filled up with garbage data.

Signed-off-by: Ming Lei ming@canonical.com
---
 block/blk-mq.c |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4aac826..d24673f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -514,6 +514,10 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, 
unsigned int tag)
 {
struct request *rq = tags-rqs[tag];
 
+   /* uninitialized request */
+   if (!rq-q || rq-tag == -1)
+   return rq;
+
if (!is_flush_request(rq, tag))
return rq;
 
@@ -1401,6 +1405,12 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
left -= to_do * rq_size;
for (j = 0; j  to_do; j++) {
tags-rqs[i] = p;
+
+   /* Avoiding early access from timeout handler */
+   tags-rqs[i]-tag = -1;
+   tags-rqs[i]-q = NULL;
+   tags-rqs[i]-cmd_flags = 0;
+
if (set-ops-init_request) {
if (set-ops-init_request(set-driver_data,
tags-rqs[i], hctx_idx, i,
-- 
1.7.9.5





-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: blk-mq crash under KVM in multiqueue block code (with virtio-blk and ext4)

2014-09-17 Thread Ming Lei

On Wed, Sep 17, 2014 at 10:22 PM, Jens Axboe ax...@kernel.dk wrote:

 Another way would be to ensure that the timeout handler doesn't touch hw_ctx
 or tag_sets that aren't fully initialized yet. But I think this is
 safer/cleaner.

That may not be easy or enough to check if hw_ctx/tag_sets are
fully initialized if you mean all requests have been used one time.

On Wed, Sep 17, 2014 at 10:11 PM, David Hildenbrand
 I was playing with a simple patch that just sets cmd_flags and action_flags to

What is action_flags?

 0. That should already be sufficient to hinder blk_mq_tag_to_rq and the 
 calling
 method to do the wrong thing.

Yes, clearing rq-cmd_flags should be enough.

And looks better to move rq initialization to __blk_mq_free_request()
too, otherwise timeout still may see old cmd_flags and rq-q before
rq's new initialization.


Thanks,
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: blk-mq crash under KVM in multiqueue block code (with virtio-blk and ext4)

2014-09-17 Thread Ming Lei

On Thu, Sep 18, 2014 at 3:09 AM, David Hildenbrand
d...@linux.vnet.ibm.com wrote:
 On Wed, Sep 17, 2014 at 10:22 PM, Jens Axboe ax...@kernel.dk wrote:
 
  Another way would be to ensure that the timeout handler doesn't touch 
  hw_ctx
  or tag_sets that aren't fully initialized yet. But I think this is
  safer/cleaner.

 That may not be easy or enough to check if hw_ctx/tag_sets are
 fully initialized if you mean all requests have been used one time.

 On Wed, Sep 17, 2014 at 10:11 PM, David Hildenbrand
  I was playing with a simple patch that just sets cmd_flags and 
  action_flags to

 What is action_flags?

 atomic_flags, sorry :)

 Otherwise e.g. REQ_ATOM_STARTED could already be set due to the randomness. I
 am not sure if this is really necessary, or if it is completely shielded by 
 the
 tag-handling code, but seemed to be clean for me to do it (and I remember it
 not being set within blk_mq_rq_ctx_init).

You are right, both cmd_flags and atomic_flags should be cleared
in blk_mq_init_rq_map().



  0. That should already be sufficient to hinder blk_mq_tag_to_rq and the 
  calling
  method to do the wrong thing.

 Yes, clearing rq-cmd_flags should be enough.

 And looks better to move rq initialization to __blk_mq_free_request()
 too, otherwise timeout still may see old cmd_flags and rq-q before
 rq's new initialization.

 Yes, __blk_mq_free_request() should also reset at least rq-cmd_flags, and I
 think we can remove the initialization from  __blk_mq_alloc_request().

 David



 Thanks,




-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_blk: merge S/G list entries by default

2014-09-10 Thread Ming Lei

On Wed, Sep 10, 2014 at 11:18 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 07/09/2014 12:32, Ming Lei ha scritto:
 It is a good idea to disable SG merge for vq incapable of indirect because
 there are very limited direct descriptors.

 I think you mean _enabling_ SG merge if indirect descriptors are not there.

You are right, sorry for the typo.

 For vq capable of indirect, it should be better to not do SG merge at default
 because:

 - from hypervisor view, no matter how many segments one req has, all are
 submitted to host kernel by one syscall, such as readv/io_submit

 - host kernel still need to do the same merge again

 Here we agree.

 Paolo

 From my test(virtio-blk over null_blk), looks enabling SG merge may cause
 throughput a little drop(~3%).



Thanks,
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_blk: merge S/G list entries by default

2014-09-09 Thread Ming Lei

On Sun, Sep 7, 2014 at 7:09 AM, Christoph Hellwig h...@lst.de wrote:
 Most virtio setups have a fairly limited number of ring entries available.
 Enable S/G entry merging by default to fit into less of them.  This restores
 the behavior at time of the virtio-blk blk-mq conversion, which was changed
 by commit block: add queue flag for disabling SG merging which made the
 behavior optional, but didn't update the existing drivers to keep their
 previous behavior.

It is a good idea to disable SG merge for vq incapable of indirect because
there are very limited direct descriptors.

For vq capable of indirect, it should be better to not do SG merge at default
because:

- from hypervisor view, no matter how many segments one req has, all are
submitted to host kernel by one syscall, such as readv/io_submit

- host kernel still need to do the same merge again

From my test(virtio-blk over null_blk), looks enabling SG merge may cause
throughput a little drop(~3%).


 Signed-off-by: Christoph Hellwig h...@lst.de
 ---
  drivers/block/virtio_blk.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 0a58140..311b857 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -634,7 +634,7 @@ static int virtblk_probe(struct virtio_device *vdev)
 vblk-tag_set.ops = virtio_mq_ops;
 vblk-tag_set.queue_depth = virtblk_queue_depth;
 vblk-tag_set.numa_node = NUMA_NO_NODE;
 -   vblk-tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 +   vblk-tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
 vblk-tag_set.cmd_size =
 sizeof(struct virtblk_req) +
 sizeof(struct scatterlist) * sg_elems;
 --
 1.9.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


Thanks,
-- 
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 0/2] block: virtio-blk: support multi vq per virtio-blk

2014-06-30 Thread Ming Lei

Hi Jens and Rusty,

On Thu, Jun 26, 2014 at 8:04 PM, Ming Lei ming@canonical.com wrote:
 On Thu, Jun 26, 2014 at 5:41 PM, Ming Lei ming@canonical.com wrote:
 Hi,

 These patches try to support multi virtual queues(multi-vq) in one
 virtio-blk device, and maps each virtual queue(vq) to blk-mq's
 hardware queue.

 With this approach, both scalability and performance on virtio-blk
 device can get improved.

 For verifying the improvement, I implements virtio-blk multi-vq over
 qemu's dataplane feature, and both handling host notification
 from each vq and processing host I/O are still kept in the per-device
 iothread context, the change is based on qemu v2.0.0 release, and
 can be accessed from below tree:

 git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1

 For enabling the multi-vq feature, 'num_queues=N' need to be added into
 '-device virtio-blk-pci ...' of qemu command line, and suggest to pass
 'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
 depends on x-data-plane.

 Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
 verify the improvement.

 I just create a small quadcore VM and run fio inside the VM, and
 num_queues of the virtio-blk device is set as 2, but looks the
 improvement is still obvious. The host is 2 sockets, 8cores(16threads)
 server.

 1), about scalability
 - jobs = 2, thoughput: +33%
 - jobs = 4, thoughput: +100%

 2), about top thoughput: +39%

 So in my test, even for a quad-core VM, if the virtqueue number
 is increased from 1 to 2, both scalability and performance can
 get improved a lot.

 In above qemu implementation of virtio-blk-mq device, only one
 IOthread handles requests from all vqs, and the above throughput
 data has been very close to same fio test in host side with single
 job. So more improvement should be observed once more IOthreads are
 used for handling requests from multi vqs.

 TODO:
 - adjust vq's irq smp_affinity according to blk-mq hw queue's cpumask

 V3:
 - fix use-after-free on vq-name reported by Michael

 V2: (suggestions from Michael and Dave Chinner)
 - allocate virtqueues' pointers dynamically
 - make sure the per-queue spinlock isn't kept in same cache line
 - make each queue's name different

 V1:
 - remove RFC since no one objects
 - add '__u8 unused' for pending as suggested by Rusty
 - use virtio_cread_feature() directly, suggested by Rusty

 Sorry, please add Jens' reviewed-by.

 Reviewed-by: Jens Axboe ax...@kernel.dk

I appreciate very much that one of you may queue these two
patches into your tree so that userspace work can be kicked off,
since Michael has acked both patches and all comments have
been addressed already.


Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-26 Thread Ming Lei

On Thu, Jun 26, 2014 at 3:45 PM, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Jun 26, 2014 at 10:08:46AM +0800, Ming Lei wrote:
 Firstly this patch supports more than one virtual queues for virtio-blk
 device.

 Secondly this patch maps the virtual queue to blk-mq's hardware queue.

 With this approach, both scalability and performance can be improved.

 Signed-off-by: Ming Lei ming@canonical.com
 ---
  drivers/block/virtio_blk.c |  109 
 
  1 file changed, 89 insertions(+), 20 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index f63d358..b0a49a0 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -21,11 +21,14 @@ static DEFINE_IDA(vd_index_ida);

  static struct workqueue_struct *virtblk_wq;

 +struct virtio_blk_vq {
 + struct virtqueue *vq;
 + spinlock_t lock;
 +} cacheline_aligned_in_smp;
 +

 Padding wastes a hot cacheline here.
 What about this patch I sent:

 virtio-blk: move spinlock to vq itself

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

 Rusty didn't respond, try including it as 1/3 in your patchset
 and we'll see if anyone objects?

I think your patch is fine, but I'd like to follow current virtio vq's
lock rule.

Your patch should not be part of this patchset because
you introduce one spinlock inside vq, and you need to
replace other virtio devices' per-vq lock to the builtin lock
too in your patchset.



  struct virtio_blk
  {
   struct virtio_device *vdev;
 - struct virtqueue *vq;
 - spinlock_t vq_lock;

   /* The disk structure for the kernel. */
   struct gendisk *disk;
 @@ -47,6 +50,10 @@ struct virtio_blk

   /* Ida index - used to track minor number allocations. */
   int index;
 +
 + /* num of vqs */
 + int num_vqs;
 + struct virtio_blk_vq *vqs;
  };

  struct virtblk_req
 @@ -133,14 +140,15 @@ static void virtblk_done(struct virtqueue *vq)
  {
   struct virtio_blk *vblk = vq-vdev-priv;
   bool req_done = false;
 + int qid = vq-index;
   struct virtblk_req *vbr;
   unsigned long flags;
   unsigned int len;

 - spin_lock_irqsave(vblk-vq_lock, flags);
 + spin_lock_irqsave(vblk-vqs[qid].lock, flags);
   do {
   virtqueue_disable_cb(vq);
 - while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
 + while ((vbr = virtqueue_get_buf(vblk-vqs[qid].vq, len)) != 
 NULL) {
   blk_mq_complete_request(vbr-req);
   req_done = true;
   }
 @@ -151,7 +159,7 @@ static void virtblk_done(struct virtqueue *vq)
   /* In case queue is stopped waiting for more buffers. */
   if (req_done)
   blk_mq_start_stopped_hw_queues(vblk-disk-queue, true);
 - spin_unlock_irqrestore(vblk-vq_lock, flags);
 + spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
  }

  static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 @@ -160,6 +168,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
 struct request *req)
   struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
   unsigned long flags;
   unsigned int num;
 + int qid = hctx-queue_num;
   const bool last = (req-cmd_flags  REQ_END) != 0;
   int err;
   bool notify = false;
 @@ -202,12 +211,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
 struct request *req)
   vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
   }

 - spin_lock_irqsave(vblk-vq_lock, flags);
 - err = __virtblk_add_req(vblk-vq, vbr, vbr-sg, num);
 + spin_lock_irqsave(vblk-vqs[qid].lock, flags);
 + err = __virtblk_add_req(vblk-vqs[qid].vq, vbr, vbr-sg, num);
   if (err) {
 - virtqueue_kick(vblk-vq);
 + virtqueue_kick(vblk-vqs[qid].vq);
   blk_mq_stop_hw_queue(hctx);
 - spin_unlock_irqrestore(vblk-vq_lock, flags);
 + spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
   /* Out of mem doesn't actually happen, since we fall back
* to direct descriptors */
   if (err == -ENOMEM || err == -ENOSPC)
 @@ -215,12 +224,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
 struct request *req)
   return BLK_MQ_RQ_QUEUE_ERROR;
   }

 - if (last  virtqueue_kick_prepare(vblk-vq))
 + if (last  virtqueue_kick_prepare(vblk-vqs[qid].vq))
   notify = true;
 - spin_unlock_irqrestore(vblk-vq_lock, flags);
 + spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);

   if (notify)
 - virtqueue_notify(vblk-vq);
 + virtqueue_notify(vblk-vqs[qid].vq);
   return BLK_MQ_RQ_QUEUE_OK;
  }

 @@ -377,12 +386,71 @@ static void virtblk_config_changed(struct 
 virtio_device *vdev)
  static int init_vq(struct virtio_blk *vblk)
  {
   int err = 0;
 + int i;
 + vq_callback_t **callbacks;
 + const char **names;
 + char

[PATCH v3 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-26 Thread Ming Lei

Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance can be improved.

Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/block/virtio_blk.c |  104 +++-
 1 file changed, 84 insertions(+), 20 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f63d358..0a58140 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -15,17 +15,22 @@
 #include linux/numa.h
 
 #define PART_BITS 4
+#define VQ_NAME_LEN 16
 
 static int major;
 static DEFINE_IDA(vd_index_ida);
 
 static struct workqueue_struct *virtblk_wq;
 
+struct virtio_blk_vq {
+   struct virtqueue *vq;
+   spinlock_t lock;
+   char name[VQ_NAME_LEN];
+} cacheline_aligned_in_smp;
+
 struct virtio_blk
 {
struct virtio_device *vdev;
-   struct virtqueue *vq;
-   spinlock_t vq_lock;
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -47,6 +52,10 @@ struct virtio_blk
 
/* Ida index - used to track minor number allocations. */
int index;
+
+   /* num of vqs */
+   int num_vqs;
+   struct virtio_blk_vq *vqs;
 };
 
 struct virtblk_req
@@ -133,14 +142,15 @@ static void virtblk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
bool req_done = false;
+   int qid = vq-index;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
+   spin_lock_irqsave(vblk-vqs[qid].lock, flags);
do {
virtqueue_disable_cb(vq);
-   while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
+   while ((vbr = virtqueue_get_buf(vblk-vqs[qid].vq, len)) != 
NULL) {
blk_mq_complete_request(vbr-req);
req_done = true;
}
@@ -151,7 +161,7 @@ static void virtblk_done(struct virtqueue *vq)
/* In case queue is stopped waiting for more buffers. */
if (req_done)
blk_mq_start_stopped_hw_queues(vblk-disk-queue, true);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
 }
 
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
@@ -160,6 +170,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
+   int qid = hctx-queue_num;
const bool last = (req-cmd_flags  REQ_END) != 0;
int err;
bool notify = false;
@@ -202,12 +213,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
}
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
-   err = __virtblk_add_req(vblk-vq, vbr, vbr-sg, num);
+   spin_lock_irqsave(vblk-vqs[qid].lock, flags);
+   err = __virtblk_add_req(vblk-vqs[qid].vq, vbr, vbr-sg, num);
if (err) {
-   virtqueue_kick(vblk-vq);
+   virtqueue_kick(vblk-vqs[qid].vq);
blk_mq_stop_hw_queue(hctx);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
/* Out of mem doesn't actually happen, since we fall back
 * to direct descriptors */
if (err == -ENOMEM || err == -ENOSPC)
@@ -215,12 +226,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
return BLK_MQ_RQ_QUEUE_ERROR;
}
 
-   if (last  virtqueue_kick_prepare(vblk-vq))
+   if (last  virtqueue_kick_prepare(vblk-vqs[qid].vq))
notify = true;
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
 
if (notify)
-   virtqueue_notify(vblk-vq);
+   virtqueue_notify(vblk-vqs[qid].vq);
return BLK_MQ_RQ_QUEUE_OK;
 }
 
@@ -377,12 +388,64 @@ static void virtblk_config_changed(struct virtio_device 
*vdev)
 static int init_vq(struct virtio_blk *vblk)
 {
int err = 0;
+   int i;
+   vq_callback_t **callbacks;
+   const char **names;
+   struct virtqueue **vqs;
+   unsigned short num_vqs;
+   struct virtio_device *vdev = vblk-vdev;
+
+   err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
+  struct virtio_blk_config, num_queues,
+  num_vqs);
+   if (err)
+   num_vqs = 1;
+
+   vblk-vqs = kmalloc(sizeof(*vblk-vqs) * num_vqs, GFP_KERNEL);
+   if (!vblk-vqs) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   names = kmalloc(sizeof

Re: [PATCH v3 0/2] block: virtio-blk: support multi vq per virtio-blk

2014-06-26 Thread Ming Lei

On Thu, Jun 26, 2014 at 5:41 PM, Ming Lei ming@canonical.com wrote:
 Hi,

 These patches try to support multi virtual queues(multi-vq) in one
 virtio-blk device, and maps each virtual queue(vq) to blk-mq's
 hardware queue.

 With this approach, both scalability and performance on virtio-blk
 device can get improved.

 For verifying the improvement, I implements virtio-blk multi-vq over
 qemu's dataplane feature, and both handling host notification
 from each vq and processing host I/O are still kept in the per-device
 iothread context, the change is based on qemu v2.0.0 release, and
 can be accessed from below tree:

 git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1

 For enabling the multi-vq feature, 'num_queues=N' need to be added into
 '-device virtio-blk-pci ...' of qemu command line, and suggest to pass
 'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
 depends on x-data-plane.

 Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
 verify the improvement.

 I just create a small quadcore VM and run fio inside the VM, and
 num_queues of the virtio-blk device is set as 2, but looks the
 improvement is still obvious. The host is 2 sockets, 8cores(16threads)
 server.

 1), about scalability
 - jobs = 2, thoughput: +33%
 - jobs = 4, thoughput: +100%

 2), about top thoughput: +39%

 So in my test, even for a quad-core VM, if the virtqueue number
 is increased from 1 to 2, both scalability and performance can
 get improved a lot.

 In above qemu implementation of virtio-blk-mq device, only one
 IOthread handles requests from all vqs, and the above throughput
 data has been very close to same fio test in host side with single
 job. So more improvement should be observed once more IOthreads are
 used for handling requests from multi vqs.

 TODO:
 - adjust vq's irq smp_affinity according to blk-mq hw queue's cpumask

 V3:
 - fix use-after-free on vq-name reported by Michael

 V2: (suggestions from Michael and Dave Chinner)
 - allocate virtqueues' pointers dynamically
 - make sure the per-queue spinlock isn't kept in same cache line
 - make each queue's name different

 V1:
 - remove RFC since no one objects
 - add '__u8 unused' for pending as suggested by Rusty
 - use virtio_cread_feature() directly, suggested by Rusty

Sorry, please add Jens' reviewed-by.

Reviewed-by: Jens Axboe ax...@kernel.dk

Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

2014-06-25 Thread Ming Lei

Hi,

These patches try to support multi virtual queues(multi-vq) in one
virtio-blk device, and maps each virtual queue(vq) to blk-mq's
hardware queue.

With this approach, both scalability and performance on virtio-blk
device can get improved.

For verifying the improvement, I implements virtio-blk multi-vq over
qemu's dataplane feature, and both handling host notification
from each vq and processing host I/O are still kept in the per-device
iothread context, the change is based on qemu v2.0.0 release, and
can be accessed from below tree:

git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1

For enabling the multi-vq feature, 'num_queues=N' need to be added into
'-device virtio-blk-pci ...' of qemu command line, and suggest to pass
'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
depends on x-data-plane.

Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
verify the improvement.

I just create a small quadcore VM and run fio inside the VM, and
num_queues of the virtio-blk device is set as 2, but looks the
improvement is still obvious.

1), about scalability
- without mutli-vq feature
-- jobs=2, thoughput: 145K iops
-- jobs=4, thoughput: 100K iops
- with mutli-vq feature
-- jobs=2, thoughput: 193K iops
-- jobs=4, thoughput: 202K iops

2), about thoughput
- without mutli-vq feature
-- thoughput: 145K iops
- with mutli-vq feature
-- thoughput: 202K iops

So in my test, even for a quad-core VM, if the virtqueue number
is increased from 1 to 2, both scalability and performance can
get improved a lot.

TODO:
- adjust vq's irq smp_affinity according to blk-mq hw queue's cpumask

V2: (suggestions from Michael and Dave Chinner)
- allocate virtqueues' pointers dynamically
- make sure the per-queue spinlock isn't kept in same cache line
- make each queue's name different

V1:
- remove RFC since no one objects
- add '__u8 unused' for pending as suggested by Rusty
- use virtio_cread_feature() directly, suggested by Rusty

Thanks,
--
Ming Lei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 1/2] include/uapi/linux/virtio_blk.h: introduce feature of VIRTIO_BLK_F_MQ

2014-06-25 Thread Ming Lei

Current virtio-blk spec only supports one virtual queue for transfering
data between VM and host, and inside VM all kinds of operations on
the virtual queue needs to hold one lock, so cause below problems:

- bad scalability
- bad throughput

This patch requests to introduce feature of VIRTIO_BLK_F_MQ
so that more than one virtual queues can be used to virtio-blk
device, then above problems can be solved or eased.

Signed-off-by: Ming Lei ming@canonical.com
---
 include/uapi/linux/virtio_blk.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 6d8e61c..9ad67b2 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -40,6 +40,7 @@
 #define VIRTIO_BLK_F_WCE   9   /* Writeback mode enabled after reset */
 #define VIRTIO_BLK_F_TOPOLOGY  10  /* Topology information is available */
 #define VIRTIO_BLK_F_CONFIG_WCE11  /* Writeback mode available in 
config */
+#define VIRTIO_BLK_F_MQ12  /* support more than one vq */
 
 #ifndef __KERNEL__
 /* Old (deprecated) name for VIRTIO_BLK_F_WCE. */
@@ -77,6 +78,10 @@ struct virtio_blk_config {
 
/* writeback mode (if VIRTIO_BLK_F_CONFIG_WCE) */
__u8 wce;
+   __u8 unused;
+
+   /* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
+   __u16 num_queues;
 } __attribute__((packed));
 
 /*
-- 
1.7.9.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-25 Thread Ming Lei

Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance can be improved.

Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/block/virtio_blk.c |  109 
 1 file changed, 89 insertions(+), 20 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f63d358..b0a49a0 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -21,11 +21,14 @@ static DEFINE_IDA(vd_index_ida);
 
 static struct workqueue_struct *virtblk_wq;
 
+struct virtio_blk_vq {
+   struct virtqueue *vq;
+   spinlock_t lock;
+} cacheline_aligned_in_smp;
+
 struct virtio_blk
 {
struct virtio_device *vdev;
-   struct virtqueue *vq;
-   spinlock_t vq_lock;
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -47,6 +50,10 @@ struct virtio_blk
 
/* Ida index - used to track minor number allocations. */
int index;
+
+   /* num of vqs */
+   int num_vqs;
+   struct virtio_blk_vq *vqs;
 };
 
 struct virtblk_req
@@ -133,14 +140,15 @@ static void virtblk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
bool req_done = false;
+   int qid = vq-index;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
+   spin_lock_irqsave(vblk-vqs[qid].lock, flags);
do {
virtqueue_disable_cb(vq);
-   while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
+   while ((vbr = virtqueue_get_buf(vblk-vqs[qid].vq, len)) != 
NULL) {
blk_mq_complete_request(vbr-req);
req_done = true;
}
@@ -151,7 +159,7 @@ static void virtblk_done(struct virtqueue *vq)
/* In case queue is stopped waiting for more buffers. */
if (req_done)
blk_mq_start_stopped_hw_queues(vblk-disk-queue, true);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
 }
 
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
@@ -160,6 +168,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
+   int qid = hctx-queue_num;
const bool last = (req-cmd_flags  REQ_END) != 0;
int err;
bool notify = false;
@@ -202,12 +211,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
}
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
-   err = __virtblk_add_req(vblk-vq, vbr, vbr-sg, num);
+   spin_lock_irqsave(vblk-vqs[qid].lock, flags);
+   err = __virtblk_add_req(vblk-vqs[qid].vq, vbr, vbr-sg, num);
if (err) {
-   virtqueue_kick(vblk-vq);
+   virtqueue_kick(vblk-vqs[qid].vq);
blk_mq_stop_hw_queue(hctx);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
/* Out of mem doesn't actually happen, since we fall back
 * to direct descriptors */
if (err == -ENOMEM || err == -ENOSPC)
@@ -215,12 +224,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
return BLK_MQ_RQ_QUEUE_ERROR;
}
 
-   if (last  virtqueue_kick_prepare(vblk-vq))
+   if (last  virtqueue_kick_prepare(vblk-vqs[qid].vq))
notify = true;
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vqs[qid].lock, flags);
 
if (notify)
-   virtqueue_notify(vblk-vq);
+   virtqueue_notify(vblk-vqs[qid].vq);
return BLK_MQ_RQ_QUEUE_OK;
 }
 
@@ -377,12 +386,71 @@ static void virtblk_config_changed(struct virtio_device 
*vdev)
 static int init_vq(struct virtio_blk *vblk)
 {
int err = 0;
+   int i;
+   vq_callback_t **callbacks;
+   const char **names;
+   char *name_array;
+   struct virtqueue **vqs;
+   unsigned short num_vqs;
+   struct virtio_device *vdev = vblk-vdev;
 
-   /* We expect one virtqueue, for output. */
-   vblk-vq = virtio_find_single_vq(vblk-vdev, virtblk_done, requests);
-   if (IS_ERR(vblk-vq))
-   err = PTR_ERR(vblk-vq);
+   err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
+  struct virtio_blk_config, num_queues,
+  num_vqs);
+   if (err)
+   num_vqs = 1;
+
+   vblk-vqs = kmalloc(sizeof(*vblk-vqs) * num_vqs, GFP_KERNEL);
+   if (!vblk-vqs

Re: [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

2014-06-25 Thread Ming Lei

On Thu, Jun 26, 2014 at 1:05 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-06-25 20:08, Ming Lei wrote:

 Hi,

 These patches try to support multi virtual queues(multi-vq) in one
 virtio-blk device, and maps each virtual queue(vq) to blk-mq's
 hardware queue.

 With this approach, both scalability and performance on virtio-blk
 device can get improved.

 For verifying the improvement, I implements virtio-blk multi-vq over
 qemu's dataplane feature, and both handling host notification
 from each vq and processing host I/O are still kept in the per-device
 iothread context, the change is based on qemu v2.0.0 release, and
 can be accessed from below tree:

 git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1

 For enabling the multi-vq feature, 'num_queues=N' need to be added into
 '-device virtio-blk-pci ...' of qemu command line, and suggest to pass
 'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
 depends on x-data-plane.

 Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
 verify the improvement.

 I just create a small quadcore VM and run fio inside the VM, and
 num_queues of the virtio-blk device is set as 2, but looks the
 improvement is still obvious.

 1), about scalability
 - without mutli-vq feature
 -- jobs=2, thoughput: 145K iops
 -- jobs=4, thoughput: 100K iops
 - with mutli-vq feature
 -- jobs=2, thoughput: 193K iops
 -- jobs=4, thoughput: 202K iops

 2), about thoughput
 - without mutli-vq feature
 -- thoughput: 145K iops
 - with mutli-vq feature
 -- thoughput: 202K iops


 Of these numbers, I think it's important to highlight that the 2 thread case
 is 33% faster and the 2 - 4 thread case scales linearly (100%) while the
 pre-patch case sees negative scaling going from 2 - 4 threads (-39%).

This is because my qemu implementation on multi vq only uses
single iothread to handle requests from all vqs, and the only iothread
is already at full load now, that said on host side the
same fio test(single job) results is ~200K iops too.


 I haven't run your patches yet, but from looking at the code, it looks good.
 It's pretty straightforward. See feel free to add my reviewed-by.

Thanks a lot.


 Rusty, do you want to ack this (and I'll slurp it up for 3.17) or take this
 yourself? Or something else?

That is great if this can be merged to 3.17.

Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v1 0/2] block: virtio-blk: support multi vq per virtio-blk

2014-06-20 Thread Ming Lei

Hi,

These patches try to support multi virtual queues(multi-vq) in one
virtio-blk device, and maps each virtual queue(vq) to blk-mq's
hardware queue.

With this approach, both scalability and performance on virtio-blk
device can get improved.

For verifying the improvement, I implements virtio-blk multi-vq over
qemu's dataplane feature, and both handling host notification
from each vq and processing host I/O are still kept in the per-device
iothread context, the change is based on qemu v2.0.0 release, and
can be accessed from below tree:

git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1

For enabling the multi-vq feature, 'num_queues=N' need to be added into
'-device virtio-blk-pci ...' of qemu command line, and suggest to pass
'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
depends on x-data-plane.

Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
verify the improvement.

I just create a small quadcore VM and run fio inside the VM, and
num_queues of the virtio-blk device is set as 2, but looks the
improvement is still obvious.

1), about scalability
- without mutli-vq feature
-- jobs=2, thoughput: 145K iops
-- jobs=4, thoughput: 100K iops
- with mutli-vq feature
-- jobs=2, thoughput: 186K iops
-- jobs=4, thoughput: 199K iops

2), about thoughput
- without mutli-vq feature
-- top thoughput: 145K iops
- with mutli-vq feature
-- top thoughput: 199K iops

So in my test, even for a quad-core VM, if the virtqueue number
is increased from 1 to 2, both scalability and performance can
get improved a lot.

V1:
- remove RFC since no one objects
- add '__u8 unused' for pending as suggested by Rusty
- use virtio_cread_feature() directly, suggested by Rusty


Thanks,
--
Ming Lei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v1 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-20 Thread Ming Lei

Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance can be improved.

Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/block/virtio_blk.c |   70 +++-
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f63d358..7c3d686 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -16,6 +16,8 @@
 
 #define PART_BITS 4
 
+#define MAX_NUM_VQ 16
+
 static int major;
 static DEFINE_IDA(vd_index_ida);
 
@@ -24,8 +26,8 @@ static struct workqueue_struct *virtblk_wq;
 struct virtio_blk
 {
struct virtio_device *vdev;
-   struct virtqueue *vq;
-   spinlock_t vq_lock;
+   struct virtqueue *vq[MAX_NUM_VQ];
+   spinlock_t vq_lock[MAX_NUM_VQ];
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -47,6 +49,9 @@ struct virtio_blk
 
/* Ida index - used to track minor number allocations. */
int index;
+
+   /* num of vqs */
+   int num_vqs;
 };
 
 struct virtblk_req
@@ -133,14 +138,15 @@ static void virtblk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
bool req_done = false;
+   int qid = vq-index;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
+   spin_lock_irqsave(vblk-vq_lock[qid], flags);
do {
virtqueue_disable_cb(vq);
-   while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
+   while ((vbr = virtqueue_get_buf(vblk-vq[qid], len)) != NULL) {
blk_mq_complete_request(vbr-req);
req_done = true;
}
@@ -151,7 +157,7 @@ static void virtblk_done(struct virtqueue *vq)
/* In case queue is stopped waiting for more buffers. */
if (req_done)
blk_mq_start_stopped_hw_queues(vblk-disk-queue, true);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vq_lock[qid], flags);
 }
 
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
@@ -160,6 +166,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
+   int qid = hctx-queue_num;
const bool last = (req-cmd_flags  REQ_END) != 0;
int err;
bool notify = false;
@@ -202,12 +209,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
}
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
-   err = __virtblk_add_req(vblk-vq, vbr, vbr-sg, num);
+   spin_lock_irqsave(vblk-vq_lock[qid], flags);
+   err = __virtblk_add_req(vblk-vq[qid], vbr, vbr-sg, num);
if (err) {
-   virtqueue_kick(vblk-vq);
+   virtqueue_kick(vblk-vq[qid]);
blk_mq_stop_hw_queue(hctx);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vq_lock[qid], flags);
/* Out of mem doesn't actually happen, since we fall back
 * to direct descriptors */
if (err == -ENOMEM || err == -ENOSPC)
@@ -215,12 +222,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
return BLK_MQ_RQ_QUEUE_ERROR;
}
 
-   if (last  virtqueue_kick_prepare(vblk-vq))
+   if (last  virtqueue_kick_prepare(vblk-vq[qid]))
notify = true;
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vq_lock[qid], flags);
 
if (notify)
-   virtqueue_notify(vblk-vq);
+   virtqueue_notify(vblk-vq[qid]);
return BLK_MQ_RQ_QUEUE_OK;
 }
 
@@ -377,12 +384,35 @@ static void virtblk_config_changed(struct virtio_device 
*vdev)
 static int init_vq(struct virtio_blk *vblk)
 {
int err = 0;
+   int i;
+   vq_callback_t *callbacks[MAX_NUM_VQ];
+   const char *names[MAX_NUM_VQ];
+   unsigned short num_vqs;
+   struct virtio_device *vdev = vblk-vdev;
 
-   /* We expect one virtqueue, for output. */
-   vblk-vq = virtio_find_single_vq(vblk-vdev, virtblk_done, requests);
-   if (IS_ERR(vblk-vq))
-   err = PTR_ERR(vblk-vq);
+   err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
+  struct virtio_blk_config, num_queues,
+  num_vqs);
+   if (err)
+   num_vqs = 1;
+   if (num_vqs  MAX_NUM_VQ)
+   num_vqs = MAX_NUM_VQ;
 
+   for (i = 0; i  num_vqs; i++) {
+   callbacks[i

Re: [RFC PATCH 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-17 Thread Ming Lei

On Tue, Jun 17, 2014 at 10:40 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Sat, Jun 14, 2014 at 1:29 AM, Ming Lei ming@canonical.com wrote:
 Firstly this patch supports more than one virtual queues for virtio-blk
 device.

 Secondly this patch maps the virtual queue to blk-mq's hardware queue.

 With this approach, both scalability and performance problem can be improved.

 Signed-off-by: Ming Lei ming@canonical.com
 ---
  drivers/block/virtio_blk.c |   75 
 
  1 file changed, 55 insertions(+), 20 deletions(-)

 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index f63d358..e0d077d 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -16,6 +16,8 @@

  #define PART_BITS 4

 +#define MAX_NUM_VQ 16

 It would be nice to allocate virtqueues dynamically instead of
 hardcoding the limit.  virtio-scsi also allocates virtqueues
 dynamically.

virtio-scsi may have lots of LUN, but virtio-blk only has one disk
which needn't lots of hardware queues.

Also it doesn't matter since it isn't part of ABI.

If change on virtio_blk_config is agreed, both host side and
guest side can choose to support dynamic length or pre-defined
length freely.

Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-17 Thread Ming Lei

On Tue, Jun 17, 2014 at 11:53 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/06/2014 17:50, Ming Lei ha scritto:

  It would be nice to allocate virtqueues dynamically instead of
  hardcoding the limit.  virtio-scsi also allocates virtqueues
  dynamically.

 virtio-scsi may have lots of LUN, but virtio-blk only has one disk
 which needn't lots of hardware queues.


 If you want to do queue steering based on the guest VCPU number, the number
 of queues must be = to the number of VCPUs shouldn't it?

 I tried using a divisor of the number of VCPUs, but couldn't get the block
 layer to deliver interrupts to the right VCPU.

For blk-mq's hardware queue, that won't be necessary to equal to
VCPUs number, and irq affinity per hw queue can be simply set as
blk_mq_hw_ctx-cpumask.

Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-17 Thread Ming Lei

On Wed, Jun 18, 2014 at 12:34 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 17/06/2014 18:00, Ming Lei ha scritto:

  If you want to do queue steering based on the guest VCPU number, the
  number
  of queues must be = to the number of VCPUs shouldn't it?
 
  I tried using a divisor of the number of VCPUs, but couldn't get the
  block
  layer to deliver interrupts to the right VCPU.

 For blk-mq's hardware queue, that won't be necessary to equal to
 VCPUs number, and irq affinity per hw queue can be simply set as
 blk_mq_hw_ctx-cpumask.


 Yes, but on top of that you want to have each request processed exactly by
 the CPU that sent it.  Unless the cpumasks are singletons, most of the
 benefit went away in my virtio-scsi tests.  Perhaps blk-mq is smarter.

 Can you try benchmarking a 16 VCPU guest with 8 and 16 queues?

From VM side, it might be better to use one hardware queue per vCPU,
since in theory it can remove vq lock contention.

But from host side, there is still disadvantage with more queues, since
more queues means more notify times, in my virtio-blk test, even with
ioeventfd, one notification may take ~3us averagely on qemu-system-x86_64.

For virtio-blk, I don't think it is always better to take more queues, and
we need to leverage below things in host side:

- host storage top performance, generally it reaches that with more
than 1 jobs with libaio(suppose it is N, so basically we can use N
iothread per device in qemu to try to get top performance)

- iothreads' loading(if iothreads are at full loading, increasing
queues doesn't help at all)

In my test, I only use the current per-dev iothread(x-dataplane)
in qemu to handle 2 vqs' notification and precess all I/O from
the 2 vqs, and looks it can improve IOPS by ~30%.

For virtio-scsi, the current usage doesn't make full use of blk-mq's
advantage too because only one vq is active at the same time, so I
guess the multi vqs' benefit won't be very much and I'd like to post
patches to support that first, then provide test data with
more queues(8, 16).


Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 1/2] include/uapi/linux/virtio_blk.h: introduce feature of VIRTIO_BLK_F_MQ

2014-06-13 Thread Ming Lei

Current virtio-blk spec only supports one virtual queue for transfering
data between VM and host, and inside VM all kinds of operations on
the virtual queue needs to hold one lock, so cause below problems:

- no scalability
- bad throughput

So this patch requests to introduce feature of VIRTIO_BLK_F_MQ
so that more than one virtual queues can be used to virtio-blk
device, then above problems can be solved or eased.

Signed-off-by: Ming Lei ming@canonical.com
---
 include/uapi/linux/virtio_blk.h |4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 6d8e61c..c5a2751 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -40,6 +40,7 @@
 #define VIRTIO_BLK_F_WCE   9   /* Writeback mode enabled after reset */
 #define VIRTIO_BLK_F_TOPOLOGY  10  /* Topology information is available */
 #define VIRTIO_BLK_F_CONFIG_WCE11  /* Writeback mode available in 
config */
+#define VIRTIO_BLK_F_MQ12  /* support more than one vq */
 
 #ifndef __KERNEL__
 /* Old (deprecated) name for VIRTIO_BLK_F_WCE. */
@@ -77,6 +78,9 @@ struct virtio_blk_config {
 
/* writeback mode (if VIRTIO_BLK_F_CONFIG_WCE) */
__u8 wce;
+
+   /* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
+   __u16 num_queues;
 } __attribute__((packed));
 
 /*
-- 
1.7.9.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

2014-06-13 Thread Ming Lei

Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance problem can be improved.

Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/block/virtio_blk.c |   75 
 1 file changed, 55 insertions(+), 20 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f63d358..e0d077d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -16,6 +16,8 @@
 
 #define PART_BITS 4
 
+#define MAX_NUM_VQ 16
+
 static int major;
 static DEFINE_IDA(vd_index_ida);
 
@@ -24,8 +26,8 @@ static struct workqueue_struct *virtblk_wq;
 struct virtio_blk
 {
struct virtio_device *vdev;
-   struct virtqueue *vq;
-   spinlock_t vq_lock;
+   struct virtqueue *vq[MAX_NUM_VQ];
+   spinlock_t vq_lock[MAX_NUM_VQ];
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -47,6 +49,9 @@ struct virtio_blk
 
/* Ida index - used to track minor number allocations. */
int index;
+
+   /* num of vqs */
+   int num_vqs;
 };
 
 struct virtblk_req
@@ -133,14 +138,15 @@ static void virtblk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
bool req_done = false;
+   int qid = vq-index;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
+   spin_lock_irqsave(vblk-vq_lock[qid], flags);
do {
virtqueue_disable_cb(vq);
-   while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
+   while ((vbr = virtqueue_get_buf(vblk-vq[qid], len)) != NULL) {
blk_mq_complete_request(vbr-req);
req_done = true;
}
@@ -151,7 +157,7 @@ static void virtblk_done(struct virtqueue *vq)
/* In case queue is stopped waiting for more buffers. */
if (req_done)
blk_mq_start_stopped_hw_queues(vblk-disk-queue, true);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vq_lock[qid], flags);
 }
 
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
@@ -160,6 +166,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
+   int qid = hctx-queue_num;
const bool last = (req-cmd_flags  REQ_END) != 0;
int err;
bool notify = false;
@@ -202,12 +209,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
vbr-out_hdr.type |= VIRTIO_BLK_T_IN;
}
 
-   spin_lock_irqsave(vblk-vq_lock, flags);
-   err = __virtblk_add_req(vblk-vq, vbr, vbr-sg, num);
+   spin_lock_irqsave(vblk-vq_lock[qid], flags);
+   err = __virtblk_add_req(vblk-vq[qid], vbr, vbr-sg, num);
if (err) {
-   virtqueue_kick(vblk-vq);
+   virtqueue_kick(vblk-vq[qid]);
blk_mq_stop_hw_queue(hctx);
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vq_lock[qid], flags);
/* Out of mem doesn't actually happen, since we fall back
 * to direct descriptors */
if (err == -ENOMEM || err == -ENOSPC)
@@ -215,12 +222,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
return BLK_MQ_RQ_QUEUE_ERROR;
}
 
-   if (last  virtqueue_kick_prepare(vblk-vq))
+   if (last  virtqueue_kick_prepare(vblk-vq[qid]))
notify = true;
-   spin_unlock_irqrestore(vblk-vq_lock, flags);
+   spin_unlock_irqrestore(vblk-vq_lock[qid], flags);
 
if (notify)
-   virtqueue_notify(vblk-vq);
+   virtqueue_notify(vblk-vq[qid]);
return BLK_MQ_RQ_QUEUE_OK;
 }
 
@@ -377,12 +384,40 @@ static void virtblk_config_changed(struct virtio_device 
*vdev)
 static int init_vq(struct virtio_blk *vblk)
 {
int err = 0;
+   int i;
+   vq_callback_t *callbacks[MAX_NUM_VQ];
+   const char *names[MAX_NUM_VQ];
+   unsigned short num_vqs;
+   struct virtio_device *vdev = vblk-vdev;
 
-   /* We expect one virtqueue, for output. */
-   vblk-vq = virtio_find_single_vq(vblk-vdev, virtblk_done, requests);
-   if (IS_ERR(vblk-vq))
-   err = PTR_ERR(vblk-vq);
+   if (virtio_has_feature(vdev, VIRTIO_BLK_F_MQ))
+   err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
+  struct virtio_blk_config, num_queues,
+  num_vqs);
+   else
+   num_vqs = 1;
+
+   if (err)
+   goto out;
 
+   if (num_vqs  MAX_NUM_VQ

Re: [PATCH] block: virtio_blk: don't hold spin lock during world switch

2014-06-02 Thread Ming Lei

On Mon, Jun 2, 2014 at 9:23 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Jens Axboe ax...@kernel.dk writes:
 On 2014-05-30 00:10, Rusty Russell wrote:
 Jens Axboe ax...@kernel.dk writes:
 If Rusty agrees, I'd like to add it for 3.16 with a stable marker.

 Really stable?  It improves performance, which is nice.  But every patch
 which goes into the kernel fixes a bug, improves clarity, improves
 performance or adds a feature.  I've now seen all four cases get CC'd
 into stable.

 Including some of mine explicitly not marked stable which get swept up
 by enthusiastic stable maintainers :(

 Is now there *any* patch short of a major rewrite which shouldn't get
 cc: stable?

 I agree that there's sometimes an unfortunate trend there. I didn't
 check, but my assumption was that this is a regression after the blk-mq
 conversion, in which case I do think it belongs in stable.

 No, it's always been that way.  In the original driver the entire issue
 requests function was under the lock.

 It was your mq conversion which made this optimization possible, and
 also made it an optimization: now other the queues can continue while
 this one is going.

 But in any case, I think the patch is obviously correct and the wins are
 sufficiently large to warrant a stable inclusion even if it isn't a
 regression.

 If you're running SMP under an emulator where exits are expensive, then
 this wins.  Under KVM it's marginal at best.

Both my tests on arm64 and x86 are under KVM, and looks the
patch can improve performance a lot. IMO, even though under
KVM, virtio-blk performance still depends how well hypervisor(
qemu, ...) emulates the device, and basically speaking, it is
expensive to switch from guest to host and let host handle the
notification.


Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] block: virtio_blk: don't hold spin lock during world switch

2014-05-30 Thread Ming Lei

Firstly, it isn't necessary to hold lock of vblk-vq_lock
when notifying hypervisor about queued I/O.

Secondly, virtqueue_notify() will cause world switch and
it may take long time on some hypervisors(such as, qemu-arm),
so it isn't good to hold the lock and block other vCPUs.

On arm64 quad core VM(qemu-kvm), the patch can increase I/O
performance a lot with VIRTIO_RING_F_EVENT_IDX enabled:
- without the patch: 14K IOPS
- with the patch: 34K IOPS

fio script:
[global]
direct=1
bsrange=4k-4k
timeout=10
numjobs=4
ioengine=libaio
iodepth=64

filename=/dev/vdc
group_reporting=1

[f1]
rw=randread

Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Michael S. Tsirkin m...@redhat.com
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Ming Lei ming@canonical.com
---
 drivers/block/virtio_blk.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 9f340fa..a6f5424 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -162,6 +162,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
unsigned int num;
const bool last = (req-cmd_flags  REQ_END) != 0;
int err;
+   bool notify = false;
 
BUG_ON(req-nr_phys_segments + 2  vblk-sg_elems);
 
@@ -214,10 +215,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, 
struct request *req)
return BLK_MQ_RQ_QUEUE_ERROR;
}
 
-   if (last)
-   virtqueue_kick(vblk-vq);
-
+   if (last  virtqueue_kick_prepare(vblk-vq))
+   notify = true;
spin_unlock_irqrestore(vblk-vq_lock, flags);
+
+   if (notify)
+   virtqueue_notify(vblk-vq);
return BLK_MQ_RQ_QUEUE_OK;
 }
 
-- 
1.7.9.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] block: virtio_blk: don't hold spin lock during world switch

2014-05-30 Thread Ming Lei

On Fri, May 30, 2014 at 11:19 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-05-29 20:49, Ming Lei wrote:

 Firstly, it isn't necessary to hold lock of vblk-vq_lock
 when notifying hypervisor about queued I/O.

 Secondly, virtqueue_notify() will cause world switch and
 it may take long time on some hypervisors(such as, qemu-arm),
 so it isn't good to hold the lock and block other vCPUs.

 On arm64 quad core VM(qemu-kvm), the patch can increase I/O
 performance a lot with VIRTIO_RING_F_EVENT_IDX enabled:
 - without the patch: 14K IOPS
 - with the patch: 34K IOPS


 Patch looks good to me. I don't see a hit on my qemu-kvm testing, but it
 definitely makes sense and I can see it hurting in other places.

It isn't easy to observe the improvement on x86 VM, especially
with few vCPUs, because qemu-system-x86_64 only takes
several microseconds to handle the notification, but on arm64, it
may take hundreds of microseconds, so the improvement is
obvious on arm VM.

I hope this patch can be merged, at least arm VM can benefit
from it.


Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] block: virtio_blk: don't hold spin lock during world switch

2014-05-30 Thread Ming Lei

On Fri, May 30, 2014 at 11:35 AM, Jens Axboe ax...@kernel.dk wrote:
 On 2014-05-29 21:34, Ming Lei wrote:

 On Fri, May 30, 2014 at 11:19 AM, Jens Axboe ax...@kernel.dk wrote:

 On 2014-05-29 20:49, Ming Lei wrote:


 Firstly, it isn't necessary to hold lock of vblk-vq_lock
 when notifying hypervisor about queued I/O.

 Secondly, virtqueue_notify() will cause world switch and
 it may take long time on some hypervisors(such as, qemu-arm),
 so it isn't good to hold the lock and block other vCPUs.

 On arm64 quad core VM(qemu-kvm), the patch can increase I/O
 performance a lot with VIRTIO_RING_F_EVENT_IDX enabled:
  - without the patch: 14K IOPS
  - with the patch: 34K IOPS



 Patch looks good to me. I don't see a hit on my qemu-kvm testing, but it
 definitely makes sense and I can see it hurting in other places.


 It isn't easy to observe the improvement on x86 VM, especially
 with few vCPUs, because qemu-system-x86_64 only takes
 several microseconds to handle the notification, but on arm64, it
 may take hundreds of microseconds, so the improvement is
 obvious on arm VM.

 I hope this patch can be merged, at least arm VM can benefit
 from it.


 If Rusty agrees, I'd like to add it for 3.16 with a stable marker.

Interesting, even on x86, I still can observe the improvement
when the numjobs is set as 2 in the fio script(see commit log), but
when numjobs is set as 4, 8, 12, the difference isn't obvious between
patched kernel and non-patched kernel.

1, environment
- host: 2sockets, each CPU(4cores, 2 threads), total 16 logical cores
- guest: 16cores, 8GB ram
- guest kernel: 3.15-rc7-next with patch[1]
- fio: the script in commit log with numjobs set as 2

2, result
- without the patch: ~104K IOPS
- with the patch: ~140K IOPS


Rusty, considered the same trick has been applied in virt-scsi,
do you agree to take the same approach in virt-blk too?


[1], http://marc.info/?l=linux-kernelm=140135041423441w=2

Thanks,
--
Ming Lei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

68 matches

Mail list logo