Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Wei Wang

On 08/02/2018 07:47 PM, Michal Hocko wrote:

On Thu 02-08-18 18:32:44, Wei Wang wrote:

On 08/01/2018 07:34 PM, Michal Hocko wrote:

On Wed 01-08-18 19:12:25, Wei Wang wrote:

On 07/30/2018 05:00 PM, Michal Hocko wrote:

On Fri 27-07-18 17:24:55, Wei Wang wrote:

The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

It would be great to document the replacement. This is not a small
change...

OK. I plan to document the following to the commit log:

The OOM notifier is getting deprecated to use for the reasons:
  - As a callout from the oom context, it is too subtle and easy to
generate bugs and corner cases which are hard to track;
  - It is called too late (after the reclaiming has been performed).
Drivers with large amuont of reclaimable memory is expected to be
released them at an early age of memory pressure;
  - The notifier callback isn't aware of the oom contrains;
  Link: https://lkml.org/lkml/2018/7/12/314

  This patch replaces the virtio-balloon oom notifier with a shrinker
  to release balloon pages on memory pressure. Users can set the amount of
  memory pages to release each time a shrinker_scan is called via the
  module parameter balloon_pages_to_shrink, and the default amount is 256
  pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
  been used to release balloon pages on OOM. We continue to use this
  feature bit for the shrinker, so the shrinker is only registered when
  this feature bit has been negotiated with host.

Do you have any numbers for how does this work in practice?

It works in this way: for example, we can set the parameter,
balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called.
Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is
called, the balloon driver will get back 1GB memory and give them back to
mm, then the ballooned memory becomes 6GB.

When the shrinker scan is called the second time, another 1GB will be given
back to mm. So the ballooned pages are given back to mm gradually.


Let's say
you have a medium page cache workload which triggers kswapd to do a
light reclaim? Hardcoded shrinking sounds quite dubious to me but I have
no idea how people expect this to work. Shouldn't this be more
adaptive? How precious are those pages anyway?

Those pages are given to host to use usually because the guest has enough
free memory, and host doesn't want to waste those pieces of memory as they
are not used by this guest. When the guest needs them, it is reasonable that
the guest has higher priority to take them back.
But I'm not sure if there would be a more adaptive approach than "gradually
giving back as the guest wants more".

I am not sure I follow. Let me be more specific. Say you have a trivial
stream IO triggering reclaim to recycle clean page cache. This will
invoke slab shrinkers as well. Do you really want to drop your batch of
pages on each invocation? Doesn't that remove them very quickly? Just
try to dd if=large_file of=/dev/null and see how your pages are
disappearing. Shrinkers usually scale the number of objects they are
going to reclaim based on the memory pressure (aka targer to be
reclaimed).


Thanks, I think it looks better to make it more adaptive. I'll send out 
a new version for review.


Best,
Wei


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Toshiaki Makita
On 2018/08/03 14:07, Jason Wang wrote:
> On 2018年08月03日 12:04, Tonghao Zhang wrote:
>> On Fri, Aug 3, 2018 at 11:43 AM Jason Wang  wrote:
>>>
>>> On 2018年08月03日 11:24, Tonghao Zhang wrote:
 On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:
> On 2018年08月03日 10:51, Tonghao Zhang wrote:
>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang 
>> wrote:
>>> On 2018年08月02日 16:41, Toshiaki Makita wrote:
 On 2018/08/02 17:18, Jason Wang wrote:
> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>>> +static void vhost_net_busy_poll_check(struct vhost_net *net,
>>> +   struct vhost_virtqueue *rvq,
>>> +   struct vhost_virtqueue *tvq,
>>> +   bool rx)
>>> +{
>>> + struct socket *sock = rvq->private_data;
>>> +
>>> + if (rx)
>>> + vhost_net_busy_poll_try_queue(net, tvq);
>>> + else if (sock && sk_has_rx_data(sock->sk))
>>> + vhost_net_busy_poll_try_queue(net, rvq);
>>> + else {
>>> + /* On tx here, sock has no rx data, so we
>>> +  * will wait for sock wakeup for rx, and
>>> +  * vhost_enable_notify() is not needed. */
>> A possible case is we do have rx data but guest does not
>> refill the rx
>> queue. In this case we may lose notifications from guest.
> Yes, should consider this case. thanks.
 I'm a bit confused. Isn't this covered by the previous
 "else if (sock && sk_has_rx_data(...))" block?
>>> The problem is it does nothing if vhost_vq_avail_empty() is true and
>>> vhost_enble_notify() is false.
>>>
 +
 + cpu_relax();
 + }
 +
 + preempt_enable();
 +
 + if (!rx)
 + vhost_net_enable_vq(net, vq);
>>> No need to enable rx virtqueue, if we are sure handle_rx()
>>> will be
>>> called soon.
>> If we disable rx virtqueue in handle_tx and don't send packets
>> from
>> guest anymore(handle_tx is not called), so we can wake up for
>> sock rx.
>> so the network is broken.
> Not sure I understand here. I mean is we schedule work for
> handle_rx(),
> there's no need to enable it since handle_rx() will do this for
> us.
 Looks like in the last "else" block in
 vhost_net_busy_poll_check() we
 need to enable vq since in that case we have no rx data and
 handle_rx()
 is not scheduled.

>>> Yes.
>> So we will use the vhost_has_work() to check whether or not the
>> handle_rx is scheduled ?
>> If we use the vhost_has_work(), the work in the dev work_list may be
>> rx work, or tx work, right ?
> Yes. We can add a boolean to record whether or not we've called
> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
> it was true.
 so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during
 busypoll"
 may not consider the case: work is tx work in the dev work list.
>>> So two kinds of work, tx kick or tx wakeup.
>>>
>>> For tx kick, we check vhost_vq_avail_empty() and avoid unnecessary kicks
>>> by not enabling kick if we found something is pending on txq. For tx
>>> wakeup, yes, the commit does not consider it. And that's why we want to
>>> disable tx wakeups during busy polling.
>> And in the handle_rx but not busy polling, the tx can wakeup anytime
>> and the tx work will be added to dev work list. In that case, if we
>> add
>> the rx poll to the queue, it is necessary ? the commit be294a51a may
>> check whether the rx work is in the dev work list.
> 
> I think the point this we don't poll rx during tx at that time. So if rx
> poll is interrupted, we should reschedule handle_rx(). After we poll rx
> on handle_tx(), we can try to optimize this on top.

That's true. We may be able to skip poll_queue in handle_rx/tx after
rx/tx busypolling is unified by this patch set.

-- 
Toshiaki Makita

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月03日 12:04, Tonghao Zhang wrote:

On Fri, Aug 3, 2018 at 11:43 AM Jason Wang  wrote:


On 2018年08月03日 11:24, Tonghao Zhang wrote:

On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:

On 2018年08月03日 10:51, Tonghao Zhang wrote:

On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:

On 2018年08月02日 16:41, Toshiaki Makita wrote:

On 2018/08/02 17:18, Jason Wang wrote:

On 2018年08月01日 17:52, Tonghao Zhang wrote:

+static void vhost_net_busy_poll_check(struct vhost_net *net,
+   struct vhost_virtqueue *rvq,
+   struct vhost_virtqueue *tvq,
+   bool rx)
+{
+ struct socket *sock = rvq->private_data;
+
+ if (rx)
+ vhost_net_busy_poll_try_queue(net, tvq);
+ else if (sock && sk_has_rx_data(sock->sk))
+ vhost_net_busy_poll_try_queue(net, rvq);
+ else {
+ /* On tx here, sock has no rx data, so we
+  * will wait for sock wakeup for rx, and
+  * vhost_enable_notify() is not needed. */

A possible case is we do have rx data but guest does not refill the rx
queue. In this case we may lose notifications from guest.

Yes, should consider this case. thanks.

I'm a bit confused. Isn't this covered by the previous
"else if (sock && sk_has_rx_data(...))" block?

The problem is it does nothing if vhost_vq_avail_empty() is true and
vhost_enble_notify() is false.


+
+ cpu_relax();
+ }
+
+ preempt_enable();
+
+ if (!rx)
+ vhost_net_enable_vq(net, vq);

No need to enable rx virtqueue, if we are sure handle_rx() will be
called soon.

If we disable rx virtqueue in handle_tx and don't send packets from
guest anymore(handle_tx is not called), so we can wake up for sock rx.
so the network is broken.

Not sure I understand here. I mean is we schedule work for handle_rx(),
there's no need to enable it since handle_rx() will do this for us.

Looks like in the last "else" block in vhost_net_busy_poll_check() we
need to enable vq since in that case we have no rx data and handle_rx()
is not scheduled.


Yes.

So we will use the vhost_has_work() to check whether or not the
handle_rx is scheduled ?
If we use the vhost_has_work(), the work in the dev work_list may be
rx work, or tx work, right ?

Yes. We can add a boolean to record whether or not we've called
vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
it was true.

so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll"
may not consider the case: work is tx work in the dev work list.

So two kinds of work, tx kick or tx wakeup.

For tx kick, we check vhost_vq_avail_empty() and avoid unnecessary kicks
by not enabling kick if we found something is pending on txq. For tx
wakeup, yes, the commit does not consider it. And that's why we want to
disable tx wakeups during busy polling.

And in the handle_rx but not busy polling, the tx can wakeup anytime
and the tx work will be added to dev work list. In that case, if we
add
the rx poll to the queue, it is necessary ? the commit be294a51a may
check whether the rx work is in the dev work list.


I think the point this we don't poll rx during tx at that time. So if rx 
poll is interrupted, we should reschedule handle_rx(). After we poll rx 
on handle_tx(), we can try to optimize this on top.


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Toshiaki Makita
On 2018/08/03 13:14, Tonghao Zhang wrote:
> On Fri, Aug 3, 2018 at 11:40 AM Toshiaki Makita
>  wrote:
>>
>> On 2018/08/03 12:24, Tonghao Zhang wrote:
>>> On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:
 On 2018年08月03日 10:51, Tonghao Zhang wrote:
> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:
>> On 2018年08月02日 16:41, Toshiaki Makita wrote:
>>> On 2018/08/02 17:18, Jason Wang wrote:
 On 2018年08月01日 17:52, Tonghao Zhang wrote:
>> +static void vhost_net_busy_poll_check(struct vhost_net *net,
>> +   struct vhost_virtqueue *rvq,
>> +   struct vhost_virtqueue *tvq,
>> +   bool rx)
>> +{
>> + struct socket *sock = rvq->private_data;
>> +
>> + if (rx)
>> + vhost_net_busy_poll_try_queue(net, tvq);
>> + else if (sock && sk_has_rx_data(sock->sk))
>> + vhost_net_busy_poll_try_queue(net, rvq);
>> + else {
>> + /* On tx here, sock has no rx data, so we
>> +  * will wait for sock wakeup for rx, and
>> +  * vhost_enable_notify() is not needed. */
> A possible case is we do have rx data but guest does not refill the rx
> queue. In this case we may lose notifications from guest.
 Yes, should consider this case. thanks.
>>> I'm a bit confused. Isn't this covered by the previous
>>> "else if (sock && sk_has_rx_data(...))" block?
>> The problem is it does nothing if vhost_vq_avail_empty() is true and
>> vhost_enble_notify() is false.
>>
>>> +
>>> + cpu_relax();
>>> + }
>>> +
>>> + preempt_enable();
>>> +
>>> + if (!rx)
>>> + vhost_net_enable_vq(net, vq);
>> No need to enable rx virtqueue, if we are sure handle_rx() will be
>> called soon.
> If we disable rx virtqueue in handle_tx and don't send packets from
> guest anymore(handle_tx is not called), so we can wake up for sock rx.
> so the network is broken.
 Not sure I understand here. I mean is we schedule work for handle_rx(),
 there's no need to enable it since handle_rx() will do this for us.
>>> Looks like in the last "else" block in vhost_net_busy_poll_check() we
>>> need to enable vq since in that case we have no rx data and handle_rx()
>>> is not scheduled.
>>>
>> Yes.
> So we will use the vhost_has_work() to check whether or not the
> handle_rx is scheduled ?
> If we use the vhost_has_work(), the work in the dev work_list may be
> rx work, or tx work, right ?

 Yes. We can add a boolean to record whether or not we've called
 vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
 it was true.
>>> so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during 
>>> busypoll"
>>> may not consider the case: work is tx work in the dev work list.
>>
>> Not sure what you are concerned but what I can say is that we need to
>> poll rx work if vhost_has_work() detects tx work in
>> vhost_net_rx_peek_head_len() since rx busypoll exits prematurely in that
>> case.
> In the handle_rx, when we busy poll, the vhost_has_work() return true,
> because the tx but not rx work is in the dev work list.
> and it is the most case, because tx work may be added to dev work list
> anytime(not during busy poll) when guest kick the vhost-net.
> so it is not necessary to add it., right ?

I'm lost.
What is the part you think is not needed?

1. When there is a tx work we exit rx busypoll.
2. When we exit rx busypoll by tx work, we poll rx work (so that we can
   continue rx busypoll later).

-- 
Toshiaki Makita

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Tonghao Zhang
On Fri, Aug 3, 2018 at 11:40 AM Toshiaki Makita
 wrote:
>
> On 2018/08/03 12:24, Tonghao Zhang wrote:
> > On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:
> >> On 2018年08月03日 10:51, Tonghao Zhang wrote:
> >>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:
>  On 2018年08月02日 16:41, Toshiaki Makita wrote:
> > On 2018/08/02 17:18, Jason Wang wrote:
> >> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>  +static void vhost_net_busy_poll_check(struct vhost_net *net,
>  +   struct vhost_virtqueue *rvq,
>  +   struct vhost_virtqueue *tvq,
>  +   bool rx)
>  +{
>  + struct socket *sock = rvq->private_data;
>  +
>  + if (rx)
>  + vhost_net_busy_poll_try_queue(net, tvq);
>  + else if (sock && sk_has_rx_data(sock->sk))
>  + vhost_net_busy_poll_try_queue(net, rvq);
>  + else {
>  + /* On tx here, sock has no rx data, so we
>  +  * will wait for sock wakeup for rx, and
>  +  * vhost_enable_notify() is not needed. */
> >>> A possible case is we do have rx data but guest does not refill the rx
> >>> queue. In this case we may lose notifications from guest.
> >> Yes, should consider this case. thanks.
> > I'm a bit confused. Isn't this covered by the previous
> > "else if (sock && sk_has_rx_data(...))" block?
>  The problem is it does nothing if vhost_vq_avail_empty() is true and
>  vhost_enble_notify() is false.
> 
> > +
> > + cpu_relax();
> > + }
> > +
> > + preempt_enable();
> > +
> > + if (!rx)
> > + vhost_net_enable_vq(net, vq);
>  No need to enable rx virtqueue, if we are sure handle_rx() will be
>  called soon.
> >>> If we disable rx virtqueue in handle_tx and don't send packets from
> >>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
> >>> so the network is broken.
> >> Not sure I understand here. I mean is we schedule work for handle_rx(),
> >> there's no need to enable it since handle_rx() will do this for us.
> > Looks like in the last "else" block in vhost_net_busy_poll_check() we
> > need to enable vq since in that case we have no rx data and handle_rx()
> > is not scheduled.
> >
>  Yes.
> >>> So we will use the vhost_has_work() to check whether or not the
> >>> handle_rx is scheduled ?
> >>> If we use the vhost_has_work(), the work in the dev work_list may be
> >>> rx work, or tx work, right ?
> >>
> >> Yes. We can add a boolean to record whether or not we've called
> >> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
> >> it was true.
> > so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during 
> > busypoll"
> > may not consider the case: work is tx work in the dev work list.
>
> Not sure what you are concerned but what I can say is that we need to
> poll rx work if vhost_has_work() detects tx work in
> vhost_net_rx_peek_head_len() since rx busypoll exits prematurely in that
> case.
In the handle_rx, when we busy poll, the vhost_has_work() return true,
because the tx but not rx work is in the dev work list.
and it is the most case, because tx work may be added to dev work list
anytime(not during busy poll) when guest kick the vhost-net.
so it is not necessary to add it., right ?

> >> So here's the needed changes:
> >>
> >> 1) Split the wakeup disabling to another patch
> >> 2) Squash the vhost_net_busy_poll_try_queue() and
> >> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce
> >> duplicated checks.
> >> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to
> >> poll_rx, this makes code more readable.
> > OK
> >> Thanks
> >>
>  Thanks
> >>
> >
> >
>
> --
> Toshiaki Makita
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Tonghao Zhang
On Fri, Aug 3, 2018 at 11:43 AM Jason Wang  wrote:
>
>
>
> On 2018年08月03日 11:24, Tonghao Zhang wrote:
> > On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:
> >>
> >>
> >> On 2018年08月03日 10:51, Tonghao Zhang wrote:
> >>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:
> 
>  On 2018年08月02日 16:41, Toshiaki Makita wrote:
> > On 2018/08/02 17:18, Jason Wang wrote:
> >> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>  +static void vhost_net_busy_poll_check(struct vhost_net *net,
>  +   struct vhost_virtqueue *rvq,
>  +   struct vhost_virtqueue *tvq,
>  +   bool rx)
>  +{
>  + struct socket *sock = rvq->private_data;
>  +
>  + if (rx)
>  + vhost_net_busy_poll_try_queue(net, tvq);
>  + else if (sock && sk_has_rx_data(sock->sk))
>  + vhost_net_busy_poll_try_queue(net, rvq);
>  + else {
>  + /* On tx here, sock has no rx data, so we
>  +  * will wait for sock wakeup for rx, and
>  +  * vhost_enable_notify() is not needed. */
> >>> A possible case is we do have rx data but guest does not refill the rx
> >>> queue. In this case we may lose notifications from guest.
> >> Yes, should consider this case. thanks.
> > I'm a bit confused. Isn't this covered by the previous
> > "else if (sock && sk_has_rx_data(...))" block?
>  The problem is it does nothing if vhost_vq_avail_empty() is true and
>  vhost_enble_notify() is false.
> 
> > +
> > + cpu_relax();
> > + }
> > +
> > + preempt_enable();
> > +
> > + if (!rx)
> > + vhost_net_enable_vq(net, vq);
>  No need to enable rx virtqueue, if we are sure handle_rx() will be
>  called soon.
> >>> If we disable rx virtqueue in handle_tx and don't send packets from
> >>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
> >>> so the network is broken.
> >> Not sure I understand here. I mean is we schedule work for handle_rx(),
> >> there's no need to enable it since handle_rx() will do this for us.
> > Looks like in the last "else" block in vhost_net_busy_poll_check() we
> > need to enable vq since in that case we have no rx data and handle_rx()
> > is not scheduled.
> >
>  Yes.
> >>> So we will use the vhost_has_work() to check whether or not the
> >>> handle_rx is scheduled ?
> >>> If we use the vhost_has_work(), the work in the dev work_list may be
> >>> rx work, or tx work, right ?
> >> Yes. We can add a boolean to record whether or not we've called
> >> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
> >> it was true.
> > so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during 
> > busypoll"
> > may not consider the case: work is tx work in the dev work list.
>
> So two kinds of work, tx kick or tx wakeup.
>
> For tx kick, we check vhost_vq_avail_empty() and avoid unnecessary kicks
> by not enabling kick if we found something is pending on txq. For tx
> wakeup, yes, the commit does not consider it. And that's why we want to
> disable tx wakeups during busy polling.
And in the handle_rx but not busy polling, the tx can wakeup anytime
and the tx work will be added to dev work list. In that case, if we
add
the rx poll to the queue, it is necessary ? the commit be294a51a may
check whether the rx work is in the dev work list.
> Thanks
>
> >
> >> So here's the needed changes:
> >>
> >> 1) Split the wakeup disabling to another patch
> >> 2) Squash the vhost_net_busy_poll_try_queue() and
> >> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce
> >> duplicated checks.
> >> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to
> >> poll_rx, this makes code more readable.
> > OK
> >> Thanks
> >>
>  Thanks
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月03日 11:32, Toshiaki Makita wrote:

On 2018/08/03 12:07, Jason Wang wrote:

On 2018年08月02日 17:23, Jason Wang wrote:

No need to enable rx virtqueue, if we are sure handle_rx() will be
called soon.

If we disable rx virtqueue in handle_tx and don't send packets from
guest anymore(handle_tx is not called), so we can wake up for sock rx.
so the network is broken.

Not sure I understand here. I mean is we schedule work for handle_rx(),
there's no need to enable it since handle_rx() will do this for us.

Looks like in the last "else" block in vhost_net_busy_poll_check() we
need to enable vq since in that case we have no rx data and handle_rx()
is not scheduled.


Rethink about this, looks not. We enable rx wakeups in this case, so if
there's pending data, handle_rx() will be schedule after
vhost_net_enable_vq().

You are right, but what I wanted to say is vhost_net_enable_vq() should
be needed (I was talking about what would happen if
vhost_net_enable_vq() were removed). Also, I think we should move
vhost_net_enable_vq() from vhost_net_busy_poll() to this last "else"
block because this is the case where rx wakeups is required.
Anyway this part will be refactored so let's see what this code will
look like in next version.



I get your point.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Toshiaki Makita
On 2018/08/03 12:24, Tonghao Zhang wrote:
> On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:
>> On 2018年08月03日 10:51, Tonghao Zhang wrote:
>>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:
 On 2018年08月02日 16:41, Toshiaki Makita wrote:
> On 2018/08/02 17:18, Jason Wang wrote:
>> On 2018年08月01日 17:52, Tonghao Zhang wrote:
 +static void vhost_net_busy_poll_check(struct vhost_net *net,
 +   struct vhost_virtqueue *rvq,
 +   struct vhost_virtqueue *tvq,
 +   bool rx)
 +{
 + struct socket *sock = rvq->private_data;
 +
 + if (rx)
 + vhost_net_busy_poll_try_queue(net, tvq);
 + else if (sock && sk_has_rx_data(sock->sk))
 + vhost_net_busy_poll_try_queue(net, rvq);
 + else {
 + /* On tx here, sock has no rx data, so we
 +  * will wait for sock wakeup for rx, and
 +  * vhost_enable_notify() is not needed. */
>>> A possible case is we do have rx data but guest does not refill the rx
>>> queue. In this case we may lose notifications from guest.
>> Yes, should consider this case. thanks.
> I'm a bit confused. Isn't this covered by the previous
> "else if (sock && sk_has_rx_data(...))" block?
 The problem is it does nothing if vhost_vq_avail_empty() is true and
 vhost_enble_notify() is false.

> +
> + cpu_relax();
> + }
> +
> + preempt_enable();
> +
> + if (!rx)
> + vhost_net_enable_vq(net, vq);
 No need to enable rx virtqueue, if we are sure handle_rx() will be
 called soon.
>>> If we disable rx virtqueue in handle_tx and don't send packets from
>>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
>>> so the network is broken.
>> Not sure I understand here. I mean is we schedule work for handle_rx(),
>> there's no need to enable it since handle_rx() will do this for us.
> Looks like in the last "else" block in vhost_net_busy_poll_check() we
> need to enable vq since in that case we have no rx data and handle_rx()
> is not scheduled.
>
 Yes.
>>> So we will use the vhost_has_work() to check whether or not the
>>> handle_rx is scheduled ?
>>> If we use the vhost_has_work(), the work in the dev work_list may be
>>> rx work, or tx work, right ?
>>
>> Yes. We can add a boolean to record whether or not we've called
>> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
>> it was true.
> so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll"
> may not consider the case: work is tx work in the dev work list.

Not sure what you are concerned but what I can say is that we need to
poll rx work if vhost_has_work() detects tx work in
vhost_net_rx_peek_head_len() since rx busypoll exits prematurely in that
case.

>> So here's the needed changes:
>>
>> 1) Split the wakeup disabling to another patch
>> 2) Squash the vhost_net_busy_poll_try_queue() and
>> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce
>> duplicated checks.
>> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to
>> poll_rx, this makes code more readable.
> OK
>> Thanks
>>
 Thanks
>>
> 
> 

-- 
Toshiaki Makita

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Toshiaki Makita
On 2018/08/03 12:07, Jason Wang wrote:
> On 2018年08月02日 17:23, Jason Wang wrote:
>>>
>> No need to enable rx virtqueue, if we are sure handle_rx() will be
>> called soon.
> If we disable rx virtqueue in handle_tx and don't send packets from
> guest anymore(handle_tx is not called), so we can wake up for sock rx.
> so the network is broken.
 Not sure I understand here. I mean is we schedule work for handle_rx(),
 there's no need to enable it since handle_rx() will do this for us.
>>> Looks like in the last "else" block in vhost_net_busy_poll_check() we
>>> need to enable vq since in that case we have no rx data and handle_rx()
>>> is not scheduled.
>>>
> Rethink about this, looks not. We enable rx wakeups in this case, so if
> there's pending data, handle_rx() will be schedule after
> vhost_net_enable_vq().

You are right, but what I wanted to say is vhost_net_enable_vq() should
be needed (I was talking about what would happen if
vhost_net_enable_vq() were removed). Also, I think we should move
vhost_net_enable_vq() from vhost_net_busy_poll() to this last "else"
block because this is the case where rx wakeups is required.
Anyway this part will be refactored so let's see what this code will
look like in next version.

-- 
Toshiaki Makita

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Tonghao Zhang
On Fri, Aug 3, 2018 at 11:07 AM Jason Wang  wrote:
>
>
>
> On 2018年08月03日 10:51, Tonghao Zhang wrote:
> > On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:
> >>
> >>
> >> On 2018年08月02日 16:41, Toshiaki Makita wrote:
> >>> On 2018/08/02 17:18, Jason Wang wrote:
>  On 2018年08月01日 17:52, Tonghao Zhang wrote:
> >> +static void vhost_net_busy_poll_check(struct vhost_net *net,
> >> +   struct vhost_virtqueue *rvq,
> >> +   struct vhost_virtqueue *tvq,
> >> +   bool rx)
> >> +{
> >> + struct socket *sock = rvq->private_data;
> >> +
> >> + if (rx)
> >> + vhost_net_busy_poll_try_queue(net, tvq);
> >> + else if (sock && sk_has_rx_data(sock->sk))
> >> + vhost_net_busy_poll_try_queue(net, rvq);
> >> + else {
> >> + /* On tx here, sock has no rx data, so we
> >> +  * will wait for sock wakeup for rx, and
> >> +  * vhost_enable_notify() is not needed. */
> > A possible case is we do have rx data but guest does not refill the rx
> > queue. In this case we may lose notifications from guest.
>  Yes, should consider this case. thanks.
> >>> I'm a bit confused. Isn't this covered by the previous
> >>> "else if (sock && sk_has_rx_data(...))" block?
> >> The problem is it does nothing if vhost_vq_avail_empty() is true and
> >> vhost_enble_notify() is false.
> >>
> >>> +
> >>> + cpu_relax();
> >>> + }
> >>> +
> >>> + preempt_enable();
> >>> +
> >>> + if (!rx)
> >>> + vhost_net_enable_vq(net, vq);
> >> No need to enable rx virtqueue, if we are sure handle_rx() will be
> >> called soon.
> > If we disable rx virtqueue in handle_tx and don't send packets from
> > guest anymore(handle_tx is not called), so we can wake up for sock rx.
> > so the network is broken.
>  Not sure I understand here. I mean is we schedule work for handle_rx(),
>  there's no need to enable it since handle_rx() will do this for us.
> >>> Looks like in the last "else" block in vhost_net_busy_poll_check() we
> >>> need to enable vq since in that case we have no rx data and handle_rx()
> >>> is not scheduled.
> >>>
> >> Yes.
> > So we will use the vhost_has_work() to check whether or not the
> > handle_rx is scheduled ?
> > If we use the vhost_has_work(), the work in the dev work_list may be
> > rx work, or tx work, right ?
>
> Yes. We can add a boolean to record whether or not we've called
> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if
> it was true.
so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll"
may not consider the case: work is tx work in the dev work list.

> So here's the needed changes:
>
> 1) Split the wakeup disabling to another patch
> 2) Squash the vhost_net_busy_poll_try_queue() and
> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce
> duplicated checks.
> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to
> poll_rx, this makes code more readable.
OK
> Thanks
>
> >> Thanks
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月03日 10:51, Tonghao Zhang wrote:

On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:



On 2018年08月02日 16:41, Toshiaki Makita wrote:

On 2018/08/02 17:18, Jason Wang wrote:

On 2018年08月01日 17:52, Tonghao Zhang wrote:

+static void vhost_net_busy_poll_check(struct vhost_net *net,
+   struct vhost_virtqueue *rvq,
+   struct vhost_virtqueue *tvq,
+   bool rx)
+{
+ struct socket *sock = rvq->private_data;
+
+ if (rx)
+ vhost_net_busy_poll_try_queue(net, tvq);
+ else if (sock && sk_has_rx_data(sock->sk))
+ vhost_net_busy_poll_try_queue(net, rvq);
+ else {
+ /* On tx here, sock has no rx data, so we
+  * will wait for sock wakeup for rx, and
+  * vhost_enable_notify() is not needed. */

A possible case is we do have rx data but guest does not refill the rx
queue. In this case we may lose notifications from guest.

Yes, should consider this case. thanks.

I'm a bit confused. Isn't this covered by the previous
"else if (sock && sk_has_rx_data(...))" block?

The problem is it does nothing if vhost_vq_avail_empty() is true and
vhost_enble_notify() is false.


+
+ cpu_relax();
+ }
+
+ preempt_enable();
+
+ if (!rx)
+ vhost_net_enable_vq(net, vq);

No need to enable rx virtqueue, if we are sure handle_rx() will be
called soon.

If we disable rx virtqueue in handle_tx and don't send packets from
guest anymore(handle_tx is not called), so we can wake up for sock rx.
so the network is broken.

Not sure I understand here. I mean is we schedule work for handle_rx(),
there's no need to enable it since handle_rx() will do this for us.

Looks like in the last "else" block in vhost_net_busy_poll_check() we
need to enable vq since in that case we have no rx data and handle_rx()
is not scheduled.


Yes.

So we will use the vhost_has_work() to check whether or not the
handle_rx is scheduled ?
If we use the vhost_has_work(), the work in the dev work_list may be
rx work, or tx work, right ?


Yes. We can add a boolean to record whether or not we've called 
vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if 
it was true.


So here's the needed changes:

1) Split the wakeup disabling to another patch
2) Squash the vhost_net_busy_poll_try_queue() and 
vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce 
duplicated checks.
3) If possible, rename the boolean rx in vhost_net_busy_poll() to 
poll_rx, this makes code more readable.


Thanks


Thanks


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月02日 17:23, Jason Wang wrote:



No need to enable rx virtqueue, if we are sure handle_rx() will be
called soon.

If we disable rx virtqueue in handle_tx and don't send packets from
guest anymore(handle_tx is not called), so we can wake up for sock rx.
so the network is broken.

Not sure I understand here. I mean is we schedule work for handle_rx(),
there's no need to enable it since handle_rx() will do this for us.

Looks like in the last "else" block in vhost_net_busy_poll_check() we
need to enable vq since in that case we have no rx data and handle_rx()
is not scheduled.



Yes.

Thanks 


Rethink about this, looks not. We enable rx wakeups in this case, so if 
there's pending data, handle_rx() will be schedule after 
vhost_net_enable_vq().


Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Tonghao Zhang
On Thu, Aug 2, 2018 at 5:23 PM Jason Wang  wrote:
>
>
>
> On 2018年08月02日 16:41, Toshiaki Makita wrote:
> > On 2018/08/02 17:18, Jason Wang wrote:
> >> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>  +static void vhost_net_busy_poll_check(struct vhost_net *net,
>  +   struct vhost_virtqueue *rvq,
>  +   struct vhost_virtqueue *tvq,
>  +   bool rx)
>  +{
>  + struct socket *sock = rvq->private_data;
>  +
>  + if (rx)
>  + vhost_net_busy_poll_try_queue(net, tvq);
>  + else if (sock && sk_has_rx_data(sock->sk))
>  + vhost_net_busy_poll_try_queue(net, rvq);
>  + else {
>  + /* On tx here, sock has no rx data, so we
>  +  * will wait for sock wakeup for rx, and
>  +  * vhost_enable_notify() is not needed. */
> >>> A possible case is we do have rx data but guest does not refill the rx
> >>> queue. In this case we may lose notifications from guest.
> >> Yes, should consider this case. thanks.
> > I'm a bit confused. Isn't this covered by the previous
> > "else if (sock && sk_has_rx_data(...))" block?
>
> The problem is it does nothing if vhost_vq_avail_empty() is true and
> vhost_enble_notify() is false.
>
> >
> > +
> > + cpu_relax();
> > + }
> > +
> > + preempt_enable();
> > +
> > + if (!rx)
> > + vhost_net_enable_vq(net, vq);
>  No need to enable rx virtqueue, if we are sure handle_rx() will be
>  called soon.
> >>> If we disable rx virtqueue in handle_tx and don't send packets from
> >>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
> >>> so the network is broken.
> >> Not sure I understand here. I mean is we schedule work for handle_rx(),
> >> there's no need to enable it since handle_rx() will do this for us.
> > Looks like in the last "else" block in vhost_net_busy_poll_check() we
> > need to enable vq since in that case we have no rx data and handle_rx()
> > is not scheduled.
> >
>
> Yes.
So we will use the vhost_has_work() to check whether or not the
handle_rx is scheduled ?
If we use the vhost_has_work(), the work in the dev work_list may be
rx work, or tx work, right ?

> Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Jason Wang



On 2018年08月03日 04:55, Michael S. Tsirkin wrote:

On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:

This patch series is the follow up on the discussions we had before about
the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
for virito devices (https://patchwork.kernel.org/patch/10417371/). There
were suggestions about doing away with two different paths of transactions
with the host/QEMU, first being the direct GPA and the other being the DMA
API based translations.

First patch attempts to create a direct GPA mapping based DMA operations
structure called 'virtio_direct_dma_ops' with exact same implementation
of the direct GPA path which virtio core currently has but just wrapped in
a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
existing semantics. The second patch does exactly that inside the function
virtio_finalize_features(). The third patch removes the default direct GPA
path from virtio core forcing it to use DMA API callbacks for all devices.
Now with that change, every device must have a DMA operations structure
associated with it. The fourth patch adds an additional hook which gives
the platform an opportunity to do yet another override if required. This
platform hook can be used on POWER Ultravisor based protected guests to
load up SWIOTLB DMA callbacks to do the required (as discussed previously
in the above mentioned thread how host is allowed to access only parts of
the guest GPA range) bounce buffering into the shared memory for all I/O
scatter gather buffers to be consumed on the host side.

Please go through these patches and review whether this approach broadly
makes sense. I will appreciate suggestions, inputs, comments regarding
the patches or the approach in general. Thank you.

Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.


The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) 
in guest and measure PPS on tap on host.


Thanks



I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.



Anshuman Khandual (4):
   virtio: Define virtio_direct_dma_ops structure
   virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
   virtio: Force virtio core to use DMA API callbacks for all virtio devices
   virtio: Add platform specific DMA API translation for virito devices

  arch/powerpc/include/asm/dma-mapping.h |  6 +++
  arch/powerpc/platforms/pseries/iommu.c |  6 +++
  drivers/virtio/virtio.c| 72 ++
  drivers/virtio/virtio_pci_common.h |  3 ++
  drivers/virtio/virtio_ring.c   | 65 +-
  5 files changed, 89 insertions(+), 63 deletions(-)

--
2.9.3


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月02日 17:57, Toshiaki Makita wrote:

On 2018/08/02 18:23, Jason Wang wrote:

On 2018年08月02日 16:41, Toshiaki Makita wrote:

On 2018/08/02 17:18, Jason Wang wrote:

On 2018年08月01日 17:52, Tonghao Zhang wrote:

+static void vhost_net_busy_poll_check(struct vhost_net *net,
+   struct vhost_virtqueue *rvq,
+   struct vhost_virtqueue *tvq,
+   bool rx)
+{
+ struct socket *sock = rvq->private_data;
+
+ if (rx)
+ vhost_net_busy_poll_try_queue(net, tvq);
+ else if (sock && sk_has_rx_data(sock->sk))
+ vhost_net_busy_poll_try_queue(net, rvq);
+ else {
+ /* On tx here, sock has no rx data, so we
+  * will wait for sock wakeup for rx, and
+  * vhost_enable_notify() is not needed. */

A possible case is we do have rx data but guest does not refill the rx
queue. In this case we may lose notifications from guest.

Yes, should consider this case. thanks.

I'm a bit confused. Isn't this covered by the previous
"else if (sock && sk_has_rx_data(...))" block?

The problem is it does nothing if vhost_vq_avail_empty() is true and
vhost_enble_notify() is false.

If vhost_enable_notify() is false, guest will eventually kicks vq, no?



Yes, you are right.

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > > at the details of the patch in a while but that's what I told him to
> > > implement ;-) :
> > > 
> > >   - Make virtio always use DMA ops to simplify the code path (with a set
> > > of "transparent" ops for legacy)
> > > 
> > >   and
> > > 
> > >   -  Provide an arch hook allowing us to "override" those "transparent"
> > > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > > 
> > > Cheers,
> > > Ben.
> > > 
> > 
> > Right but as I tried to say doing that brings us to a bunch of issues
> > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> > guest to hypervisor communication.
> 
> I'm not sure I see the problem, see below
> 
> > When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> > bunch of overhead which we need to get rid of if we are to switch to
> > PLATFORM_IOMMU by default.  We need to fix that.
> 
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.

Well actually it's the other way around. An iommu in theory doesn't need
to bring overhead if you set it in bypass mode.  Which does imply the
iommu supports bypass mode. Is that universally the case?  DMA API does
see Christoph's list of things it does some of which add overhead.

> At the moment, virtio does this all over the place:
> 
>   if (use_dma_api)
>   dma_map/alloc_something(...)
>   else
>   use_pa
> 
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
> 
>  1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
> 
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?

Seems to be :( Jason reports about 4%. I wonder whether we can support
map_sg and friends being NULL, then use that when mapping is an
identity. A conditional branch there is likely very cheap.

Would this cover all platforms with kvm (which is where we care
most about performance)?

> This change stands alone, and imho "cleans" up virtio by avoiding all
> that if/else "2 path" and unless it adds a measurable overhead, should
> probably be done.
> 
>  2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.
> 
> This benefits from -1- by making us just plumb in a different set of
> DMA ops we would have cooked up specifically for virtio in our arch
> code (or in virtio itself but build arch-conditionally in a separate
> file). But it doesn't strictly need it -1-:
> 
> Now, -2- doesn't strictly needs -1-. We could have just done another
> xen-like hack that forces the DMA API "ON" for virtio when running in a
> secure VM.
> 
> The problem if we do that however is that we also then need the arch
> PCI code to make sure it hooks up the virtio PCI devices with the
> special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
> not the same as other PCI devices. So it will have to play games such
> as checking vendor/device IDs for virtio, checking the IOMMU flag,
> etc... from the arch code which really bloody sucks when assigning PCI
> DMA ops.
> 
> However, if we do it the way we plan here, on top of -1-, with a hook
> called from virtio into the arch to "override" the virtio DMA ops, then
> we avoid the problem completely: The arch hook would only be called by
> virtio if the IOMMU flag is *not* set. IE only when using that special
> "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
> PCI dma ops as usual.
> 
> That way, we have a very clear semantic: This hook is purely about
> replacing those "null" DMA ops that just return PA introduced in -1-
> with some arch provided specially cooked up DMA ops for non-IOMMU
> virtio that know about the arch special requirements. For us bounce
> buffering.
> 
> Is there something I'm missing ?
> 
> Cheers,
> Ben.

Right so I was trying to write it up in a systematic way, but just to
give you one example, if there is a system where DMA API handles
coherency issues, or flushing of some buffers, then our PLATFORM_IOMMU
flag causes that to happen.

And we kinda worked around this without the IOMMU by basically saying
"ok we do not really need DMA API so let's just bypass it" and
it was kind of ok except now everyone is switching to vIOMMU
just in case. So now people do want some parts of what DMA API does,
such as the bounce 

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > at the details of the patch in a while but that's what I told him to
> > implement ;-) :
> > 
> >   - Make virtio always use DMA ops to simplify the code path (with a set
> > of "transparent" ops for legacy)
> > 
> >   and
> > 
> >   -  Provide an arch hook allowing us to "override" those "transparent"
> > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > 
> > Cheers,
> > Ben.
> > 
> 
> Right but as I tried to say doing that brings us to a bunch of issues
> with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> guest to hypervisor communication.

I'm not sure I see the problem, see below

> When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> bunch of overhead which we need to get rid of if we are to switch to
> PLATFORM_IOMMU by default.  We need to fix that.

So let's differenciate the two problems of having an IOMMU (real or
emulated) which indeeds adds overhead etc... and using the DMA API.

At the moment, virtio does this all over the place:

if (use_dma_api)
dma_map/alloc_something(...)
else
use_pa

The idea of the patch set is to do two, somewhat orthogonal, changes
that together achieve what we want. Let me know where you think there
is "a bunch of issues" because I'm missing it:

 1- Replace the above if/else constructs with just calling the DMA API,
and have virtio, at initialization, hookup its own dma_ops that just
"return pa" (roughly) when the IOMMU stuff isn't used.

This adds an indirect function call to the path that previously didn't
have one (the else case above). Is that a significant/measurable
overhead ?

This change stands alone, and imho "cleans" up virtio by avoiding all
that if/else "2 path" and unless it adds a measurable overhead, should
probably be done.

 2- Make virtio use the DMA API with our custom platform-provided
swiotlb callbacks when needed, that is when not using IOMMU *and*
running on a secure VM in our case.

This benefits from -1- by making us just plumb in a different set of
DMA ops we would have cooked up specifically for virtio in our arch
code (or in virtio itself but build arch-conditionally in a separate
file). But it doesn't strictly need it -1-:

Now, -2- doesn't strictly needs -1-. We could have just done another
xen-like hack that forces the DMA API "ON" for virtio when running in a
secure VM.

The problem if we do that however is that we also then need the arch
PCI code to make sure it hooks up the virtio PCI devices with the
special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
not the same as other PCI devices. So it will have to play games such
as checking vendor/device IDs for virtio, checking the IOMMU flag,
etc... from the arch code which really bloody sucks when assigning PCI
DMA ops.

However, if we do it the way we plan here, on top of -1-, with a hook
called from virtio into the arch to "override" the virtio DMA ops, then
we avoid the problem completely: The arch hook would only be called by
virtio if the IOMMU flag is *not* set. IE only when using that special
"hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
PCI dma ops as usual.

That way, we have a very clear semantic: This hook is purely about
replacing those "null" DMA ops that just return PA introduced in -1-
with some arch provided specially cooked up DMA ops for non-IOMMU
virtio that know about the arch special requirements. For us bounce
buffering.

Is there something I'm missing ?

Cheers,
Ben.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 02:04:12PM -0700, Nambiar, Amritha wrote:
> On 8/1/2018 5:11 PM, Andrei Vagin wrote:
> > On Tue, Jul 10, 2018 at 07:28:49PM -0700, Nambiar, Amritha wrote:
> >> With this patch series, I introduced static_key for XPS maps
> >> (xps_needed), so static_key_slow_inc() is used to switch branches. The
> >> definition of static_key_slow_inc() has cpus_read_lock in place. In the
> >> virtio_net driver, XPS queues are initialized after setting the
> >> queue:cpu affinity in virtnet_set_affinity() which is already protected
> >> within cpus_read_lock. Hence, the warning here trying to acquire
> >> cpus_read_lock when it is already held.
> >>
> >> A quick fix for this would be to just extract netif_set_xps_queue() out
> >> of the lock by simply wrapping it with another put/get_online_cpus
> >> (unlock right before and hold lock right after).
> > 
> > virtnet_set_affinity() is called from virtnet_cpu_online(), which is
> > called under cpus_read_lock too.
> > 
> > It looks like now we can't call netif_set_xps_queue() from cpu hotplug
> > callbacks.
> > 
> > I can suggest a very straightforward fix for this problem. The patch is
> > attached.
> > 
> 
> Thanks for looking into this. I was thinking of fixing this in the
> virtio_net driver by moving the XPS initialization (and have a new
> get_affinity utility) in the ndo_open (so it is together with other tx
> preparation) instead of probe. Your patch solves this in general for
> setting up cpu hotplug callbacks which is under cpus_read_lock.


I like this too. Could you repost in a standard way
(inline, with your signoff etc) so we can ack this for
net-next?

> >> But this may not a
> >> clean solution. It'd help if I can get suggestions on what would be a
> >> clean option to fix this without extensively changing the code in
> >> virtio_net. Is it mandatory to protect the affinitization with
> >> read_lock? I don't see similar lock in other drivers while setting the
> >> affinity. I understand this warning should go away, but isn't it safe to
> >> have multiple readers.
> >>
> >>> On Fri, Jun 29, 2018 at 09:27:07PM -0700, Amritha Nambiar wrote:
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> This patch series is the follow up on the discussions we had before about
> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> were suggestions about doing away with two different paths of transactions
> with the host/QEMU, first being the direct GPA and the other being the DMA
> API based translations.
> 
> First patch attempts to create a direct GPA mapping based DMA operations
> structure called 'virtio_direct_dma_ops' with exact same implementation
> of the direct GPA path which virtio core currently has but just wrapped in
> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
> existing semantics. The second patch does exactly that inside the function
> virtio_finalize_features(). The third patch removes the default direct GPA
> path from virtio core forcing it to use DMA API callbacks for all devices.
> Now with that change, every device must have a DMA operations structure
> associated with it. The fourth patch adds an additional hook which gives
> the platform an opportunity to do yet another override if required. This
> platform hook can be used on POWER Ultravisor based protected guests to
> load up SWIOTLB DMA callbacks to do the required (as discussed previously
> in the above mentioned thread how host is allowed to access only parts of
> the guest GPA range) bounce buffering into the shared memory for all I/O
> scatter gather buffers to be consumed on the host side.
> 
> Please go through these patches and review whether this approach broadly
> makes sense. I will appreciate suggestions, inputs, comments regarding
> the patches or the approach in general. Thank you.

Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.

I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.


> Anshuman Khandual (4):
>   virtio: Define virtio_direct_dma_ops structure
>   virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
>   virtio: Force virtio core to use DMA API callbacks for all virtio devices
>   virtio: Add platform specific DMA API translation for virito devices
> 
>  arch/powerpc/include/asm/dma-mapping.h |  6 +++
>  arch/powerpc/platforms/pseries/iommu.c |  6 +++
>  drivers/virtio/virtio.c| 72 
> ++
>  drivers/virtio/virtio_pci_common.h |  3 ++
>  drivers/virtio/virtio_ring.c   | 65 +-
>  5 files changed, 89 insertions(+), 63 deletions(-)
> 
> -- 
> 2.9.3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 10:33:05AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > > but it's not, VMs are
> > > created in "legacy" mode all the times and we don't know at VM creation
> > > time whether it will become a secure VM or not. The way our secure VMs
> > > work is that they start as a normal VM, load a secure "payload" and
> > > call the Ultravisor to "become" secure.
> > > 
> > > So we're in a bit of a bind here. We need that one-liner optional arch
> > > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > > 
> > > Ben.
> > 
> > And just to make sure I understand, on your platform DMA APIs do include
> > some of the cache flushing tricks and this is why you don't want to
> > declare iommu support in the hypervisor?
> 
> I'm not sure I parse what you mean.
> 
> We don't need cache flushing tricks.

You don't but do real devices on same platform need them?

> The problem we have with our
> "secure" VMs is that:
> 
>  - At VM creation time we have no idea it's going to become a secure
> VM, qemu doesn't know anything about it, and thus qemu (or other
> management tools, libvirt etc...) are going to create "legacy" (ie
> iommu bypass) virtio devices.
> 
>  - Once the VM goes secure (early during boot but too late for qemu),
> it will need to make virtio do bounce-buffering via swiotlb because
> qemu cannot physically access most VM pages (blocked by HW security
> features), we need to bounce buffer using some unsecure pages that are
> accessible to qemu.
> 
> That said, I wouldn't object for us to more generally switch long run
> to changing qemu so that virtio on powerpc starts using the IOMMU as a
> default provided we fix our guest firmware to understand it (it
> currently doesn't), and provided we verify that the performance impact
> on things like vhost is negligible.
> 
> Cheers,
> Ben.
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 12:53:41PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote:
> > 
> > I see. So yes, given that device does not know or care, using
> > virtio features is an awkward fit.
> > 
> > So let's say as a quick fix for you maybe we could generalize the
> > xen_domain hack, instead of just checking xen_domain check some static
> > branch.  Then teach xen and others to enable that.>
> 
> > OK but problem then becomes this: if you do this and virtio device appears
> > behind a vIOMMU and it does not advertize the IOMMU flag, the
> > code will try to use the vIOMMU mappings and fail.
> >
> > It does look like even with trick above, you need a special version of
> > DMA ops that does just swiotlb but not any of the other things DMA API
> > might do.
> > 
> > Thoughts?
> 
> Yes, this is the purpose of Anshuman original patch (I haven't looked
> at the details of the patch in a while but that's what I told him to
> implement ;-) :
> 
>  - Make virtio always use DMA ops to simplify the code path (with a set
> of "transparent" ops for legacy)
> 
>  and
> 
>  -  Provide an arch hook allowing us to "override" those "transparent"
> DMA ops with some custom ones that do the appropriate swiotlb gunk.
> 
> Cheers,
> Ben.
> 

Right but as I tried to say doing that brings us to a bunch of issues
with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
guest to hypervisor communication.

When we do (as is the case with PLATFORM_IOMMU right now) this adds a
bunch of overhead which we need to get rid of if we are to switch to
PLATFORM_IOMMU by default.  We need to fix that.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote:
> 
> I see. So yes, given that device does not know or care, using
> virtio features is an awkward fit.
> 
> So let's say as a quick fix for you maybe we could generalize the
> xen_domain hack, instead of just checking xen_domain check some static
> branch.  Then teach xen and others to enable that.>

> OK but problem then becomes this: if you do this and virtio device appears
> behind a vIOMMU and it does not advertize the IOMMU flag, the
> code will try to use the vIOMMU mappings and fail.
>
> It does look like even with trick above, you need a special version of
> DMA ops that does just swiotlb but not any of the other things DMA API
> might do.
> 
> Thoughts?

Yes, this is the purpose of Anshuman original patch (I haven't looked
at the details of the patch in a while but that's what I told him to
implement ;-) :

 - Make virtio always use DMA ops to simplify the code path (with a set
of "transparent" ops for legacy)

 and

 -  Provide an arch hook allowing us to "override" those "transparent"
DMA ops with some custom ones that do the appropriate swiotlb gunk.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 11:01:26AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote:
> > 
> > > I don't completely agree:
> > > 
> > > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> > > for example. It indicates that the peer bypasses the normal platform
> > > iommu. The platform code in the guest has no real way to know that this
> > > is the case, this is a specific "feature" of the qemu implementation.
> > > 
> > > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> > > property of the guest platform itself (not qemu), there's no way the
> > > "peer" can advertize it via the virtio negociated flags. At least for
> > > us. I don't know for sure whether that would be workable for the ARM
> > > case. In our case, qemu has no idea at VM creation time that the VM
> > > will turn itself into a secure VM and thus will require bounce
> > > buffering for IOs (including virtio).
> > > 
> > > So unless we have another hook for the arch code to set
> > > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> > > guest itself, I don't see that as a way to deal with it.
> > > 
> > > >  The other issue is VIRTIO_F_IO_BARRIER
> > > > which is very vaguely defined, and which needs a better definition.
> > > > And last but not least we'll need some text explaining the challenges
> > > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + 
> > > > VIRTIO_F_IO_BARRIER
> > > > is what would basically cover them, but a good description including
> > > > an explanation of why these matter.
> > > 
> > > Ben.
> > > 
> > 
> > So is it true that from qemu point of view there is nothing special
> > going on?  You pass in a PA, host writes there.
> 
> Yes, qemu doesn't see a different. It's the guest that will bounce the
> pages via a pool of "insecure" pages that qemu can access. Normal pages
> in a secure VM come from PAs that qemu cannot physically access.
> 
> Cheers,
> Ben.
> 

I see. So yes, given that device does not know or care, using
virtio features is an awkward fit.

So let's say as a quick fix for you maybe we could generalize the
xen_domain hack, instead of just checking xen_domain check some static
branch.  Then teach xen and others to enable that.
OK but problem then becomes this: if you do this and virtio device appears
behind a vIOMMU and it does not advertize the IOMMU flag, the
code will try to use the vIOMMU mappings and fail.

It does look like even with trick above, you need a special version of
DMA ops that does just swiotlb but not any of the other things DMA API
might do.

Thoughts?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote:
> 
> > I don't completely agree:
> > 
> > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> > for example. It indicates that the peer bypasses the normal platform
> > iommu. The platform code in the guest has no real way to know that this
> > is the case, this is a specific "feature" of the qemu implementation.
> > 
> > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> > property of the guest platform itself (not qemu), there's no way the
> > "peer" can advertize it via the virtio negociated flags. At least for
> > us. I don't know for sure whether that would be workable for the ARM
> > case. In our case, qemu has no idea at VM creation time that the VM
> > will turn itself into a secure VM and thus will require bounce
> > buffering for IOs (including virtio).
> > 
> > So unless we have another hook for the arch code to set
> > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> > guest itself, I don't see that as a way to deal with it.
> > 
> > >  The other issue is VIRTIO_F_IO_BARRIER
> > > which is very vaguely defined, and which needs a better definition.
> > > And last but not least we'll need some text explaining the challenges
> > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > > is what would basically cover them, but a good description including
> > > an explanation of why these matter.
> > 
> > Ben.
> > 
> 
> So is it true that from qemu point of view there is nothing special
> going on?  You pass in a PA, host writes there.

Yes, qemu doesn't see a different. It's the guest that will bounce the
pages via a pool of "insecure" pages that qemu can access. Normal pages
in a secure VM come from PAs that qemu cannot physically access.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 10:24:57AM -0500, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote:
> > We just need to figure out how to deal with devices that deviate
> > from the default.  One things is that VIRTIO_F_IOMMU_PLATFORM really
> > should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu
> > dma tweaks (offsets, cache flushing), which seems well in spirit of
> > the original design. 
> 
> I don't completely agree:
> 
> 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
> for example. It indicates that the peer bypasses the normal platform
> iommu. The platform code in the guest has no real way to know that this
> is the case, this is a specific "feature" of the qemu implementation.
> 
> 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
> property of the guest platform itself (not qemu), there's no way the
> "peer" can advertize it via the virtio negociated flags. At least for
> us. I don't know for sure whether that would be workable for the ARM
> case. In our case, qemu has no idea at VM creation time that the VM
> will turn itself into a secure VM and thus will require bounce
> buffering for IOs (including virtio).
> 
> So unless we have another hook for the arch code to set
> VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
> guest itself, I don't see that as a way to deal with it.
> 
> >  The other issue is VIRTIO_F_IO_BARRIER
> > which is very vaguely defined, and which needs a better definition.
> > And last but not least we'll need some text explaining the challenges
> > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> > is what would basically cover them, but a good description including
> > an explanation of why these matter.
> 
> Ben.
> 

So is it true that from qemu point of view there is nothing special
going on?  You pass in a PA, host writes there.


-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > but it's not, VMs are
> > created in "legacy" mode all the times and we don't know at VM creation
> > time whether it will become a secure VM or not. The way our secure VMs
> > work is that they start as a normal VM, load a secure "payload" and
> > call the Ultravisor to "become" secure.
> > 
> > So we're in a bit of a bind here. We need that one-liner optional arch
> > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > 
> > Ben.
> 
> And just to make sure I understand, on your platform DMA APIs do include
> some of the cache flushing tricks and this is why you don't want to
> declare iommu support in the hypervisor?

I'm not sure I parse what you mean.

We don't need cache flushing tricks. The problem we have with our
"secure" VMs is that:

 - At VM creation time we have no idea it's going to become a secure
VM, qemu doesn't know anything about it, and thus qemu (or other
management tools, libvirt etc...) are going to create "legacy" (ie
iommu bypass) virtio devices.

 - Once the VM goes secure (early during boot but too late for qemu),
it will need to make virtio do bounce-buffering via swiotlb because
qemu cannot physically access most VM pages (blocked by HW security
features), we need to bounce buffer using some unsecure pages that are
accessible to qemu.

That said, I wouldn't object for us to more generally switch long run
to changing qemu so that virtio on powerpc starts using the IOMMU as a
default provided we fix our guest firmware to understand it (it
currently doesn't), and provided we verify that the performance impact
on things like vhost is negligible.

Cheers,
Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-02 Thread Benjamin Herrenschmidt
On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote:
> We just need to figure out how to deal with devices that deviate
> from the default.  One things is that VIRTIO_F_IOMMU_PLATFORM really
> should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu
> dma tweaks (offsets, cache flushing), which seems well in spirit of
> the original design. 

I don't completely agree:

1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu
for example. It indicates that the peer bypasses the normal platform
iommu. The platform code in the guest has no real way to know that this
is the case, this is a specific "feature" of the qemu implementation.

2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a
property of the guest platform itself (not qemu), there's no way the
"peer" can advertize it via the virtio negociated flags. At least for
us. I don't know for sure whether that would be workable for the ARM
case. In our case, qemu has no idea at VM creation time that the VM
will turn itself into a secure VM and thus will require bounce
buffering for IOs (including virtio).

So unless we have another hook for the arch code to set
VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the
guest itself, I don't see that as a way to deal with it.

>  The other issue is VIRTIO_F_IO_BARRIER
> which is very vaguely defined, and which needs a better definition.
> And last but not least we'll need some text explaining the challenges
> of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER
> is what would basically cover them, but a good description including
> an explanation of why these matter.

Ben.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Michael S. Tsirkin
On Thu, Aug 02, 2018 at 06:32:44PM +0800, Wei Wang wrote:
> On 08/01/2018 07:34 PM, Michal Hocko wrote:
> > On Wed 01-08-18 19:12:25, Wei Wang wrote:
> > > On 07/30/2018 05:00 PM, Michal Hocko wrote:
> > > > On Fri 27-07-18 17:24:55, Wei Wang wrote:
> > > > > The OOM notifier is getting deprecated to use for the reasons 
> > > > > mentioned
> > > > > here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314
> > > > > 
> > > > > This patch replaces the virtio-balloon oom notifier with a shrinker
> > > > > to release balloon pages on memory pressure.
> > > > It would be great to document the replacement. This is not a small
> > > > change...
> > > OK. I plan to document the following to the commit log:
> > > 
> > >The OOM notifier is getting deprecated to use for the reasons:
> > >  - As a callout from the oom context, it is too subtle and easy to
> > >generate bugs and corner cases which are hard to track;
> > >  - It is called too late (after the reclaiming has been performed).
> > >Drivers with large amuont of reclaimable memory is expected to be
> > >released them at an early age of memory pressure;
> > >  - The notifier callback isn't aware of the oom contrains;
> > >  Link: https://lkml.org/lkml/2018/7/12/314
> > > 
> > >  This patch replaces the virtio-balloon oom notifier with a shrinker
> > >  to release balloon pages on memory pressure. Users can set the 
> > > amount of
> > >  memory pages to release each time a shrinker_scan is called via the
> > >  module parameter balloon_pages_to_shrink, and the default amount is 
> > > 256
> > >  pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
> > >  been used to release balloon pages on OOM. We continue to use this
> > >  feature bit for the shrinker, so the shrinker is only registered when
> > >  this feature bit has been negotiated with host.
> > Do you have any numbers for how does this work in practice?
> 
> It works in this way: for example, we can set the parameter,
> balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called.
> Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is
> called, the balloon driver will get back 1GB memory and give them back to
> mm, then the ballooned memory becomes 6GB.
> 
> When the shrinker scan is called the second time, another 1GB will be given
> back to mm. So the ballooned pages are given back to mm gradually.

I think what's being asked here is a description of tests that
were run. Which workloads see improved behaviour?

Our behaviour under memory pressure isn't great, in particular it is not
clear when it's safe to re-inflate the balloon, if host attempts to
re-inflate it too soon then we still get OOM. It would be better
if VIRTIO_BALLOON_F_DEFLATE_ON_OOM would somehow mean
"it's ok to ask for almost all of memory, if guest needs memory from
balloon for apps to function it can take it from the balloon".


-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Michal Hocko
On Thu 02-08-18 18:32:44, Wei Wang wrote:
> On 08/01/2018 07:34 PM, Michal Hocko wrote:
> > On Wed 01-08-18 19:12:25, Wei Wang wrote:
> > > On 07/30/2018 05:00 PM, Michal Hocko wrote:
> > > > On Fri 27-07-18 17:24:55, Wei Wang wrote:
> > > > > The OOM notifier is getting deprecated to use for the reasons 
> > > > > mentioned
> > > > > here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314
> > > > > 
> > > > > This patch replaces the virtio-balloon oom notifier with a shrinker
> > > > > to release balloon pages on memory pressure.
> > > > It would be great to document the replacement. This is not a small
> > > > change...
> > > OK. I plan to document the following to the commit log:
> > > 
> > >The OOM notifier is getting deprecated to use for the reasons:
> > >  - As a callout from the oom context, it is too subtle and easy to
> > >generate bugs and corner cases which are hard to track;
> > >  - It is called too late (after the reclaiming has been performed).
> > >Drivers with large amuont of reclaimable memory is expected to be
> > >released them at an early age of memory pressure;
> > >  - The notifier callback isn't aware of the oom contrains;
> > >  Link: https://lkml.org/lkml/2018/7/12/314
> > > 
> > >  This patch replaces the virtio-balloon oom notifier with a shrinker
> > >  to release balloon pages on memory pressure. Users can set the 
> > > amount of
> > >  memory pages to release each time a shrinker_scan is called via the
> > >  module parameter balloon_pages_to_shrink, and the default amount is 
> > > 256
> > >  pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
> > >  been used to release balloon pages on OOM. We continue to use this
> > >  feature bit for the shrinker, so the shrinker is only registered when
> > >  this feature bit has been negotiated with host.
> > Do you have any numbers for how does this work in practice?
> 
> It works in this way: for example, we can set the parameter,
> balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called.
> Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is
> called, the balloon driver will get back 1GB memory and give them back to
> mm, then the ballooned memory becomes 6GB.
> 
> When the shrinker scan is called the second time, another 1GB will be given
> back to mm. So the ballooned pages are given back to mm gradually.
> 
> > Let's say
> > you have a medium page cache workload which triggers kswapd to do a
> > light reclaim? Hardcoded shrinking sounds quite dubious to me but I have
> > no idea how people expect this to work. Shouldn't this be more
> > adaptive? How precious are those pages anyway?
> 
> Those pages are given to host to use usually because the guest has enough
> free memory, and host doesn't want to waste those pieces of memory as they
> are not used by this guest. When the guest needs them, it is reasonable that
> the guest has higher priority to take them back.
> But I'm not sure if there would be a more adaptive approach than "gradually
> giving back as the guest wants more".

I am not sure I follow. Let me be more specific. Say you have a trivial
stream IO triggering reclaim to recycle clean page cache. This will
invoke slab shrinkers as well. Do you really want to drop your batch of
pages on each invocation? Doesn't that remove them very quickly? Just
try to dd if=large_file of=/dev/null and see how your pages are
disappearing. Shrinkers usually scale the number of objects they are
going to reclaim based on the memory pressure (aka targer to be
reclaimed).
-- 
Michal Hocko
SUSE Labs
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Tetsuo Handa
On 2018/08/02 19:32, Wei Wang wrote:
> On 08/01/2018 07:34 PM, Michal Hocko wrote:
>> Do you have any numbers for how does this work in practice?
> 
> It works in this way: for example, we can set the parameter, 
> balloon_pages_to_shrink,
> to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, 
> and we balloon
> out 7GB. When shrink scan is called, the balloon driver will get back 1GB 
> memory and give
> them back to mm, then the ballooned memory becomes 6GB.

Since shrinker might be called concurrently (am I correct?), the balloon might 
deflate
far more than needed if it releases such much memory. If shrinker is used, 
releasing 256
pages might be sufficient.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


CFP ICEIS 2019 - 21st Int.l Conf. on Enterprise Information Systems (Heraklion, Crete/Greece)

2018-08-02 Thread ic...@insticc.info
SUBMISSION DEADLINE 

21st International Conference on Enterprise Information Systems

Submission Deadline: December 10, 2018

http://www.iceis.org/

May 3 - 5, 2019
Heraklion, Crete, Greece.

 ICEIS is organized in 6 major tracks:

 - Databases and Information Systems Integration
 - Artificial Intelligence and Decision Support Systems
 - Information Systems Analysis and Specification
 - Software Agents and Internet Computing
 - Human-Computer Interaction
 - Enterprise Architecture


Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, 
SCOPUS and Semantic Scholar. 
 
A short list of presented papers will be selected so that revised and extended 
versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the 
SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
ICEIS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 184
Fax: +351 265 520 186
Web: http://www.iceis.org/
e-mail: iceis.secretar...@insticc.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

CFP CLOSER 2019 - 8th Int.l Conf. on Cloud Computing and Services Science (Heraklion, Crete/Greece)

2018-08-02 Thread clo...@insticc.info
SUBMISSION DEADLINE 

8th International Conference on Cloud Computing and Services Science

Submission Deadline: December 10, 2018

http://closer.scitevents.org

May 2 - 4, 2019
Heraklion, Crete, Greece.

 CLOSER is organized in 9 major tracks:

 - Services Science
 - Data as a Service
 - Cloud Operations
 - Edge Cloud and Fog Computing
 - Service Modelling and Analytics
 - Mobile  Cloud  Computing  
 - Cloud Computing Fundamentals
 - Cloud Computing Platforms and Applications
 - Cloud Computing Enabling Technology


Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, 
SCOPUS and Semantic Scholar. 
 
A short list of presented papers will be selected so that revised and extended 
versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the 
SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
CLOSER Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 184
Fax: +351 265 520 186
Web: http://closer.scitevents.org
e-mail: closer.secretar...@insticc.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

CFP VEHITS 2019 - 5th Int.l Conf. on Vehicle Technology and Intelligent Transport Systems (Heraklion, Crete/Greece)

2018-08-02 Thread veh...@insticc.info
SUBMISSION DEADLINE 

5th International Conference on Vehicle Technology and Intelligent Transport 
Systems

Submission Deadline: December 10, 2018

http://www.vehits.org/

May 3 - 5, 2019
Heraklion, Crete, Greece.

 VEHITS is organized in 5 major tracks:

 - Intelligent Vehicle Technologies
 - Intelligent Transport Systems and Infrastructure
 - Connected Vehicles
 - Sustainable Transport 
 - Data Analytics


Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, 
SCOPUS and Semantic Scholar. 
 
With the presence of internationally distinguished keynote speakers:
Javier Sánchez-Medina, University of Las Palmas de Gran Canaria, Spain
Jeroen Ploeg, 2getthere B.V., Utrecht, The Netherlands Lead Cooperative Driving 
Eindhoven University of Technology, Eindhoven, The Netherlands Associate 
professor (part time), faculty of Mechanical Engineering, Dynamics and Control 
group, Netherlands


A short list of presented papers will be selected so that revised and extended 
versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the 
SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
VEHITS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 185
Fax: +351 265 520 186
Web: http://www.vehits.org/
e-mail: vehits.secretar...@insticc.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

CFP SMARTGREENS 2019 - 8th Int.l Conf. on Smart Cities and Green ICT Systems (Heraklion, Crete/Greece)

2018-08-02 Thread smartgre...@insticc.info
SUBMISSION DEADLINE 

8th International Conference on Smart Cities and Green ICT Systems

Submission Deadline: December 10, 2018

http://www.smartgreens.org/

May 3 - 5, 2019
Heraklion, Crete, Greece.

 SMARTGREENS is organized in 5 major tracks:

 - Energy-Aware Systems and Technologies
 - Sustainable Computing and Systems
 - Smart Cities and Smart Buildings
 - Demos and Use-Cases
 - Smart and Digital Services


Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, 
SCOPUS and Semantic Scholar. 
 
With the presence of internationally distinguished keynote speakers:
Norbert Streitz, Founder and Scientific Director, Smart Future Initiative, 
Germany
Rudolf Giffinger, Vienna University of Technology, Austria


A short list of presented papers will be selected so that revised and extended 
versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the 
SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
SMARTGREENS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 520 185
Fax: +351 265 520 186
Web: http://www.smartgreens.org/
e-mail: smartgreens.secretar...@insticc.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

CFP IoTBDS 2019 - 3rd Int.l Conf. on Internet of Things, Big Data and Security (Heraklion, Crete/Greece)

2018-08-02 Thread iot...@insticc.info
SUBMISSION DEADLINE 

3rd International Conference on Internet of Things, Big Data and Security

Submission Deadline: December 10, 2018

http://iotbds.org/

May 2 - 4, 2019
Heraklion, Crete, Greece.

 IoTBDS is organized in 7 major tracks:

 - Big Data Research 
 - Emerging Services and Analytics 
 - Internet of Things (IoT) Fundamentals
 - Internet of Things (IoT) Applications
 - Big Data for Multi-discipline Services
 - Security, Privacy and Trust
 - IoT Technologies


Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, 
SCOPUS and Semantic Scholar. 
 
With the presence of internationally distinguished keynote speakers:
Francisco Herrera, University of Granada, Spain

A short list of presented papers will be selected so that revised and extended 
versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the 
SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
IoTBDS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq. 
2910-595 Setubal, Portugal
Tel: +351 265 520 184 
Fax: +351 265 520 186
Web: http://iotbds.org/
e-mail: iotbds.secretar...@insticc.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

CFP SENSORNETS 2019 - 8th Int.l Conf. on Sensor Networks (Prague/Czech Republic)

2018-08-02 Thread sensorn...@insticc.info
SUBMISSION DEADLINE 

8th International Conference on Sensor Networks

Submission Deadline: October 1, 2018

http://www.sensornets.org/

February 26 - 27, 2019
Prague, Czech Republic.

 SENSORNETS is organized in 5 major tracks:

 - Sensor Networks Software, Architectures and Applications
 - Wireless Sensor Networks
 - Energy and Environment
 - Intelligent Data Analysis and Processing
 - Security and Privacy in Sensor Networks


Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, 
SCOPUS and Semantic Scholar. 
 
A short list of presented papers will be selected so that revised and extended 
versions of these papers will be published by Springer.
 
All papers presented at the congress venue will also be available at the 
SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/).
  
Should you have any question please don’t hesitate contacting me.
 

Kind regards,
SENSORNETS Secretariat

Address: Av. D. Manuel I, 27A, 2º esq.
2910-595 Setubal, Portugal
Tel: +351 265 100 033
Fax: +351 265 520 186
Web: http://www.sensornets.org/
e-mail: sensornets.secretar...@insticc.org

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Michal Hocko
On Thu 02-08-18 19:27:40, Wei Wang wrote:
> On 08/02/2018 07:00 PM, Tetsuo Handa wrote:
> > On 2018/08/02 19:32, Wei Wang wrote:
> > > On 08/01/2018 07:34 PM, Michal Hocko wrote:
> > > > Do you have any numbers for how does this work in practice?
> > > It works in this way: for example, we can set the parameter, 
> > > balloon_pages_to_shrink,
> > > to shrink 1GB memory once shrink scan is called. Now, we have a 8GB 
> > > guest, and we balloon
> > > out 7GB. When shrink scan is called, the balloon driver will get back 1GB 
> > > memory and give
> > > them back to mm, then the ballooned memory becomes 6GB.
> > Since shrinker might be called concurrently (am I correct?),
> 
> Not sure about it being concurrently, but I think it would be called
> repeatedly as should_continue_reclaim() returns true.

Multiple direct reclaimers might indeed invoke it concurrently.
-- 
Michal Hocko
SUSE Labs
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Wei Wang

On 08/02/2018 07:00 PM, Tetsuo Handa wrote:

On 2018/08/02 19:32, Wei Wang wrote:

On 08/01/2018 07:34 PM, Michal Hocko wrote:

Do you have any numbers for how does this work in practice?

It works in this way: for example, we can set the parameter, 
balloon_pages_to_shrink,
to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, and 
we balloon
out 7GB. When shrink scan is called, the balloon driver will get back 1GB 
memory and give
them back to mm, then the ballooned memory becomes 6GB.

Since shrinker might be called concurrently (am I correct?),


Not sure about it being concurrently, but I think it would be called 
repeatedly as should_continue_reclaim() returns true.



Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker

2018-08-02 Thread Wei Wang

On 08/01/2018 07:34 PM, Michal Hocko wrote:

On Wed 01-08-18 19:12:25, Wei Wang wrote:

On 07/30/2018 05:00 PM, Michal Hocko wrote:

On Fri 27-07-18 17:24:55, Wei Wang wrote:

The OOM notifier is getting deprecated to use for the reasons mentioned
here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314

This patch replaces the virtio-balloon oom notifier with a shrinker
to release balloon pages on memory pressure.

It would be great to document the replacement. This is not a small
change...

OK. I plan to document the following to the commit log:

   The OOM notifier is getting deprecated to use for the reasons:
 - As a callout from the oom context, it is too subtle and easy to
   generate bugs and corner cases which are hard to track;
 - It is called too late (after the reclaiming has been performed).
   Drivers with large amuont of reclaimable memory is expected to be
   released them at an early age of memory pressure;
 - The notifier callback isn't aware of the oom contrains;
 Link: https://lkml.org/lkml/2018/7/12/314

 This patch replaces the virtio-balloon oom notifier with a shrinker
 to release balloon pages on memory pressure. Users can set the amount of
 memory pages to release each time a shrinker_scan is called via the
 module parameter balloon_pages_to_shrink, and the default amount is 256
 pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has
 been used to release balloon pages on OOM. We continue to use this
 feature bit for the shrinker, so the shrinker is only registered when
 this feature bit has been negotiated with host.

Do you have any numbers for how does this work in practice?


It works in this way: for example, we can set the parameter, 
balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is 
called. Now, we have a 8GB guest, and we balloon out 7GB. When shrink 
scan is called, the balloon driver will get back 1GB memory and give 
them back to mm, then the ballooned memory becomes 6GB.


When the shrinker scan is called the second time, another 1GB will be 
given back to mm. So the ballooned pages are given back to mm gradually.



Let's say
you have a medium page cache workload which triggers kswapd to do a
light reclaim? Hardcoded shrinking sounds quite dubious to me but I have
no idea how people expect this to work. Shouldn't this be more
adaptive? How precious are those pages anyway?


Those pages are given to host to use usually because the guest has 
enough free memory, and host doesn't want to waste those pieces of 
memory as they are not used by this guest. When the guest needs them, it 
is reasonable that the guest has higher priority to take them back.
But I'm not sure if there would be a more adaptive approach than 
"gradually giving back as the guest wants more".


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Toshiaki Makita
On 2018/08/02 18:23, Jason Wang wrote:
> On 2018年08月02日 16:41, Toshiaki Makita wrote:
>> On 2018/08/02 17:18, Jason Wang wrote:
>>> On 2018年08月01日 17:52, Tonghao Zhang wrote:
> +static void vhost_net_busy_poll_check(struct vhost_net *net,
> +   struct vhost_virtqueue *rvq,
> +   struct vhost_virtqueue *tvq,
> +   bool rx)
> +{
> + struct socket *sock = rvq->private_data;
> +
> + if (rx)
> + vhost_net_busy_poll_try_queue(net, tvq);
> + else if (sock && sk_has_rx_data(sock->sk))
> + vhost_net_busy_poll_try_queue(net, rvq);
> + else {
> + /* On tx here, sock has no rx data, so we
> +  * will wait for sock wakeup for rx, and
> +  * vhost_enable_notify() is not needed. */
 A possible case is we do have rx data but guest does not refill the rx
 queue. In this case we may lose notifications from guest.
>>> Yes, should consider this case. thanks.
>> I'm a bit confused. Isn't this covered by the previous
>> "else if (sock && sk_has_rx_data(...))" block?
> 
> The problem is it does nothing if vhost_vq_avail_empty() is true and
> vhost_enble_notify() is false.

If vhost_enable_notify() is false, guest will eventually kicks vq, no?

-- 
Toshiaki Makita

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月02日 16:41, Toshiaki Makita wrote:

On 2018/08/02 17:18, Jason Wang wrote:

On 2018年08月01日 17:52, Tonghao Zhang wrote:

+static void vhost_net_busy_poll_check(struct vhost_net *net,
+   struct vhost_virtqueue *rvq,
+   struct vhost_virtqueue *tvq,
+   bool rx)
+{
+ struct socket *sock = rvq->private_data;
+
+ if (rx)
+ vhost_net_busy_poll_try_queue(net, tvq);
+ else if (sock && sk_has_rx_data(sock->sk))
+ vhost_net_busy_poll_try_queue(net, rvq);
+ else {
+ /* On tx here, sock has no rx data, so we
+  * will wait for sock wakeup for rx, and
+  * vhost_enable_notify() is not needed. */

A possible case is we do have rx data but guest does not refill the rx
queue. In this case we may lose notifications from guest.

Yes, should consider this case. thanks.

I'm a bit confused. Isn't this covered by the previous
"else if (sock && sk_has_rx_data(...))" block?


The problem is it does nothing if vhost_vq_avail_empty() is true and 
vhost_enble_notify() is false.





+
+ cpu_relax();
+ }
+
+ preempt_enable();
+
+ if (!rx)
+ vhost_net_enable_vq(net, vq);

No need to enable rx virtqueue, if we are sure handle_rx() will be
called soon.

If we disable rx virtqueue in handle_tx and don't send packets from
guest anymore(handle_tx is not called), so we can wake up for sock rx.
so the network is broken.

Not sure I understand here. I mean is we schedule work for handle_rx(),
there's no need to enable it since handle_rx() will do this for us.

Looks like in the last "else" block in vhost_net_busy_poll_check() we
need to enable vq since in that case we have no rx data and handle_rx()
is not scheduled.



Yes.

Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Toshiaki Makita
On 2018/08/02 17:18, Jason Wang wrote:
> On 2018年08月01日 17:52, Tonghao Zhang wrote:
>>> +static void vhost_net_busy_poll_check(struct vhost_net *net,
>>> +   struct vhost_virtqueue *rvq,
>>> +   struct vhost_virtqueue *tvq,
>>> +   bool rx)
>>> +{
>>> + struct socket *sock = rvq->private_data;
>>> +
>>> + if (rx)
>>> + vhost_net_busy_poll_try_queue(net, tvq);
>>> + else if (sock && sk_has_rx_data(sock->sk))
>>> + vhost_net_busy_poll_try_queue(net, rvq);
>>> + else {
>>> + /* On tx here, sock has no rx data, so we
>>> +  * will wait for sock wakeup for rx, and
>>> +  * vhost_enable_notify() is not needed. */
>>
>> A possible case is we do have rx data but guest does not refill the rx
>> queue. In this case we may lose notifications from guest.
> Yes, should consider this case. thanks.

I'm a bit confused. Isn't this covered by the previous
"else if (sock && sk_has_rx_data(...))" block?

 +
 + cpu_relax();
 + }
 +
 + preempt_enable();
 +
 + if (!rx)
 + vhost_net_enable_vq(net, vq);
>>> No need to enable rx virtqueue, if we are sure handle_rx() will be
>>> called soon.
>> If we disable rx virtqueue in handle_tx and don't send packets from
>> guest anymore(handle_tx is not called), so we can wake up for sock rx.
>> so the network is broken.
> 
> Not sure I understand here. I mean is we schedule work for handle_rx(),
> there's no need to enable it since handle_rx() will do this for us.

Looks like in the last "else" block in vhost_net_busy_poll_check() we
need to enable vq since in that case we have no rx data and handle_rx()
is not scheduled.

-- 
Toshiaki Makita

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()

2018-08-02 Thread Jason Wang



On 2018年08月01日 17:52, Tonghao Zhang wrote:

+
+ cpu_relax();
+ }
+
+ preempt_enable();
+
+ if (!rx)
+ vhost_net_enable_vq(net, vq);

No need to enable rx virtqueue, if we are sure handle_rx() will be
called soon.

If we disable rx virtqueue in handle_tx and don't send packets from
guest anymore(handle_tx is not called), so we can wake up for sock rx.
so the network is broken.


Not sure I understand here. I mean is we schedule work for handle_rx(), 
there's no need to enable it since handle_rx() will do this for us.


Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization