Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On 08/02/2018 07:47 PM, Michal Hocko wrote: On Thu 02-08-18 18:32:44, Wei Wang wrote: On 08/01/2018 07:34 PM, Michal Hocko wrote: On Wed 01-08-18 19:12:25, Wei Wang wrote: On 07/30/2018 05:00 PM, Michal Hocko wrote: On Fri 27-07-18 17:24:55, Wei Wang wrote: The OOM notifier is getting deprecated to use for the reasons mentioned here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314 This patch replaces the virtio-balloon oom notifier with a shrinker to release balloon pages on memory pressure. It would be great to document the replacement. This is not a small change... OK. I plan to document the following to the commit log: The OOM notifier is getting deprecated to use for the reasons: - As a callout from the oom context, it is too subtle and easy to generate bugs and corner cases which are hard to track; - It is called too late (after the reclaiming has been performed). Drivers with large amuont of reclaimable memory is expected to be released them at an early age of memory pressure; - The notifier callback isn't aware of the oom contrains; Link: https://lkml.org/lkml/2018/7/12/314 This patch replaces the virtio-balloon oom notifier with a shrinker to release balloon pages on memory pressure. Users can set the amount of memory pages to release each time a shrinker_scan is called via the module parameter balloon_pages_to_shrink, and the default amount is 256 pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used to release balloon pages on OOM. We continue to use this feature bit for the shrinker, so the shrinker is only registered when this feature bit has been negotiated with host. Do you have any numbers for how does this work in practice? It works in this way: for example, we can set the parameter, balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is called, the balloon driver will get back 1GB memory and give them back to mm, then the ballooned memory becomes 6GB. When the shrinker scan is called the second time, another 1GB will be given back to mm. So the ballooned pages are given back to mm gradually. Let's say you have a medium page cache workload which triggers kswapd to do a light reclaim? Hardcoded shrinking sounds quite dubious to me but I have no idea how people expect this to work. Shouldn't this be more adaptive? How precious are those pages anyway? Those pages are given to host to use usually because the guest has enough free memory, and host doesn't want to waste those pieces of memory as they are not used by this guest. When the guest needs them, it is reasonable that the guest has higher priority to take them back. But I'm not sure if there would be a more adaptive approach than "gradually giving back as the guest wants more". I am not sure I follow. Let me be more specific. Say you have a trivial stream IO triggering reclaim to recycle clean page cache. This will invoke slab shrinkers as well. Do you really want to drop your batch of pages on each invocation? Doesn't that remove them very quickly? Just try to dd if=large_file of=/dev/null and see how your pages are disappearing. Shrinkers usually scale the number of objects they are going to reclaim based on the memory pressure (aka targer to be reclaimed). Thanks, I think it looks better to make it more adaptive. I'll send out a new version for review. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018/08/03 14:07, Jason Wang wrote: > On 2018年08月03日 12:04, Tonghao Zhang wrote: >> On Fri, Aug 3, 2018 at 11:43 AM Jason Wang wrote: >>> >>> On 2018年08月03日 11:24, Tonghao Zhang wrote: On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: > On 2018年08月03日 10:51, Tonghao Zhang wrote: >> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang >> wrote: >>> On 2018年08月02日 16:41, Toshiaki Makita wrote: On 2018/08/02 17:18, Jason Wang wrote: > On 2018年08月01日 17:52, Tonghao Zhang wrote: >>> +static void vhost_net_busy_poll_check(struct vhost_net *net, >>> + struct vhost_virtqueue *rvq, >>> + struct vhost_virtqueue *tvq, >>> + bool rx) >>> +{ >>> + struct socket *sock = rvq->private_data; >>> + >>> + if (rx) >>> + vhost_net_busy_poll_try_queue(net, tvq); >>> + else if (sock && sk_has_rx_data(sock->sk)) >>> + vhost_net_busy_poll_try_queue(net, rvq); >>> + else { >>> + /* On tx here, sock has no rx data, so we >>> + * will wait for sock wakeup for rx, and >>> + * vhost_enable_notify() is not needed. */ >> A possible case is we do have rx data but guest does not >> refill the rx >> queue. In this case we may lose notifications from guest. > Yes, should consider this case. thanks. I'm a bit confused. Isn't this covered by the previous "else if (sock && sk_has_rx_data(...))" block? >>> The problem is it does nothing if vhost_vq_avail_empty() is true and >>> vhost_enble_notify() is false. >>> + + cpu_relax(); + } + + preempt_enable(); + + if (!rx) + vhost_net_enable_vq(net, vq); >>> No need to enable rx virtqueue, if we are sure handle_rx() >>> will be >>> called soon. >> If we disable rx virtqueue in handle_tx and don't send packets >> from >> guest anymore(handle_tx is not called), so we can wake up for >> sock rx. >> so the network is broken. > Not sure I understand here. I mean is we schedule work for > handle_rx(), > there's no need to enable it since handle_rx() will do this for > us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. >>> Yes. >> So we will use the vhost_has_work() to check whether or not the >> handle_rx is scheduled ? >> If we use the vhost_has_work(), the work in the dev work_list may be >> rx work, or tx work, right ? > Yes. We can add a boolean to record whether or not we've called > vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if > it was true. so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll" may not consider the case: work is tx work in the dev work list. >>> So two kinds of work, tx kick or tx wakeup. >>> >>> For tx kick, we check vhost_vq_avail_empty() and avoid unnecessary kicks >>> by not enabling kick if we found something is pending on txq. For tx >>> wakeup, yes, the commit does not consider it. And that's why we want to >>> disable tx wakeups during busy polling. >> And in the handle_rx but not busy polling, the tx can wakeup anytime >> and the tx work will be added to dev work list. In that case, if we >> add >> the rx poll to the queue, it is necessary ? the commit be294a51a may >> check whether the rx work is in the dev work list. > > I think the point this we don't poll rx during tx at that time. So if rx > poll is interrupted, we should reschedule handle_rx(). After we poll rx > on handle_tx(), we can try to optimize this on top. That's true. We may be able to skip poll_queue in handle_rx/tx after rx/tx busypolling is unified by this patch set. -- Toshiaki Makita ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月03日 12:04, Tonghao Zhang wrote: On Fri, Aug 3, 2018 at 11:43 AM Jason Wang wrote: On 2018年08月03日 11:24, Tonghao Zhang wrote: On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: On 2018年08月03日 10:51, Tonghao Zhang wrote: On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: On 2018年08月02日 16:41, Toshiaki Makita wrote: On 2018/08/02 17:18, Jason Wang wrote: On 2018年08月01日 17:52, Tonghao Zhang wrote: +static void vhost_net_busy_poll_check(struct vhost_net *net, + struct vhost_virtqueue *rvq, + struct vhost_virtqueue *tvq, + bool rx) +{ + struct socket *sock = rvq->private_data; + + if (rx) + vhost_net_busy_poll_try_queue(net, tvq); + else if (sock && sk_has_rx_data(sock->sk)) + vhost_net_busy_poll_try_queue(net, rvq); + else { + /* On tx here, sock has no rx data, so we + * will wait for sock wakeup for rx, and + * vhost_enable_notify() is not needed. */ A possible case is we do have rx data but guest does not refill the rx queue. In this case we may lose notifications from guest. Yes, should consider this case. thanks. I'm a bit confused. Isn't this covered by the previous "else if (sock && sk_has_rx_data(...))" block? The problem is it does nothing if vhost_vq_avail_empty() is true and vhost_enble_notify() is false. + + cpu_relax(); + } + + preempt_enable(); + + if (!rx) + vhost_net_enable_vq(net, vq); No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. If we disable rx virtqueue in handle_tx and don't send packets from guest anymore(handle_tx is not called), so we can wake up for sock rx. so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. Yes. So we will use the vhost_has_work() to check whether or not the handle_rx is scheduled ? If we use the vhost_has_work(), the work in the dev work_list may be rx work, or tx work, right ? Yes. We can add a boolean to record whether or not we've called vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if it was true. so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll" may not consider the case: work is tx work in the dev work list. So two kinds of work, tx kick or tx wakeup. For tx kick, we check vhost_vq_avail_empty() and avoid unnecessary kicks by not enabling kick if we found something is pending on txq. For tx wakeup, yes, the commit does not consider it. And that's why we want to disable tx wakeups during busy polling. And in the handle_rx but not busy polling, the tx can wakeup anytime and the tx work will be added to dev work list. In that case, if we add the rx poll to the queue, it is necessary ? the commit be294a51a may check whether the rx work is in the dev work list. I think the point this we don't poll rx during tx at that time. So if rx poll is interrupted, we should reschedule handle_rx(). After we poll rx on handle_tx(), we can try to optimize this on top. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018/08/03 13:14, Tonghao Zhang wrote: > On Fri, Aug 3, 2018 at 11:40 AM Toshiaki Makita > wrote: >> >> On 2018/08/03 12:24, Tonghao Zhang wrote: >>> On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: On 2018年08月03日 10:51, Tonghao Zhang wrote: > On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: >> On 2018年08月02日 16:41, Toshiaki Makita wrote: >>> On 2018/08/02 17:18, Jason Wang wrote: On 2018年08月01日 17:52, Tonghao Zhang wrote: >> +static void vhost_net_busy_poll_check(struct vhost_net *net, >> + struct vhost_virtqueue *rvq, >> + struct vhost_virtqueue *tvq, >> + bool rx) >> +{ >> + struct socket *sock = rvq->private_data; >> + >> + if (rx) >> + vhost_net_busy_poll_try_queue(net, tvq); >> + else if (sock && sk_has_rx_data(sock->sk)) >> + vhost_net_busy_poll_try_queue(net, rvq); >> + else { >> + /* On tx here, sock has no rx data, so we >> + * will wait for sock wakeup for rx, and >> + * vhost_enable_notify() is not needed. */ > A possible case is we do have rx data but guest does not refill the rx > queue. In this case we may lose notifications from guest. Yes, should consider this case. thanks. >>> I'm a bit confused. Isn't this covered by the previous >>> "else if (sock && sk_has_rx_data(...))" block? >> The problem is it does nothing if vhost_vq_avail_empty() is true and >> vhost_enble_notify() is false. >> >>> + >>> + cpu_relax(); >>> + } >>> + >>> + preempt_enable(); >>> + >>> + if (!rx) >>> + vhost_net_enable_vq(net, vq); >> No need to enable rx virtqueue, if we are sure handle_rx() will be >> called soon. > If we disable rx virtqueue in handle_tx and don't send packets from > guest anymore(handle_tx is not called), so we can wake up for sock rx. > so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. >>> Looks like in the last "else" block in vhost_net_busy_poll_check() we >>> need to enable vq since in that case we have no rx data and handle_rx() >>> is not scheduled. >>> >> Yes. > So we will use the vhost_has_work() to check whether or not the > handle_rx is scheduled ? > If we use the vhost_has_work(), the work in the dev work_list may be > rx work, or tx work, right ? Yes. We can add a boolean to record whether or not we've called vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if it was true. >>> so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during >>> busypoll" >>> may not consider the case: work is tx work in the dev work list. >> >> Not sure what you are concerned but what I can say is that we need to >> poll rx work if vhost_has_work() detects tx work in >> vhost_net_rx_peek_head_len() since rx busypoll exits prematurely in that >> case. > In the handle_rx, when we busy poll, the vhost_has_work() return true, > because the tx but not rx work is in the dev work list. > and it is the most case, because tx work may be added to dev work list > anytime(not during busy poll) when guest kick the vhost-net. > so it is not necessary to add it., right ? I'm lost. What is the part you think is not needed? 1. When there is a tx work we exit rx busypoll. 2. When we exit rx busypoll by tx work, we poll rx work (so that we can continue rx busypoll later). -- Toshiaki Makita ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On Fri, Aug 3, 2018 at 11:40 AM Toshiaki Makita wrote: > > On 2018/08/03 12:24, Tonghao Zhang wrote: > > On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: > >> On 2018年08月03日 10:51, Tonghao Zhang wrote: > >>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: > On 2018年08月02日 16:41, Toshiaki Makita wrote: > > On 2018/08/02 17:18, Jason Wang wrote: > >> On 2018年08月01日 17:52, Tonghao Zhang wrote: > +static void vhost_net_busy_poll_check(struct vhost_net *net, > + struct vhost_virtqueue *rvq, > + struct vhost_virtqueue *tvq, > + bool rx) > +{ > + struct socket *sock = rvq->private_data; > + > + if (rx) > + vhost_net_busy_poll_try_queue(net, tvq); > + else if (sock && sk_has_rx_data(sock->sk)) > + vhost_net_busy_poll_try_queue(net, rvq); > + else { > + /* On tx here, sock has no rx data, so we > + * will wait for sock wakeup for rx, and > + * vhost_enable_notify() is not needed. */ > >>> A possible case is we do have rx data but guest does not refill the rx > >>> queue. In this case we may lose notifications from guest. > >> Yes, should consider this case. thanks. > > I'm a bit confused. Isn't this covered by the previous > > "else if (sock && sk_has_rx_data(...))" block? > The problem is it does nothing if vhost_vq_avail_empty() is true and > vhost_enble_notify() is false. > > > + > > + cpu_relax(); > > + } > > + > > + preempt_enable(); > > + > > + if (!rx) > > + vhost_net_enable_vq(net, vq); > No need to enable rx virtqueue, if we are sure handle_rx() will be > called soon. > >>> If we disable rx virtqueue in handle_tx and don't send packets from > >>> guest anymore(handle_tx is not called), so we can wake up for sock rx. > >>> so the network is broken. > >> Not sure I understand here. I mean is we schedule work for handle_rx(), > >> there's no need to enable it since handle_rx() will do this for us. > > Looks like in the last "else" block in vhost_net_busy_poll_check() we > > need to enable vq since in that case we have no rx data and handle_rx() > > is not scheduled. > > > Yes. > >>> So we will use the vhost_has_work() to check whether or not the > >>> handle_rx is scheduled ? > >>> If we use the vhost_has_work(), the work in the dev work_list may be > >>> rx work, or tx work, right ? > >> > >> Yes. We can add a boolean to record whether or not we've called > >> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if > >> it was true. > > so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during > > busypoll" > > may not consider the case: work is tx work in the dev work list. > > Not sure what you are concerned but what I can say is that we need to > poll rx work if vhost_has_work() detects tx work in > vhost_net_rx_peek_head_len() since rx busypoll exits prematurely in that > case. In the handle_rx, when we busy poll, the vhost_has_work() return true, because the tx but not rx work is in the dev work list. and it is the most case, because tx work may be added to dev work list anytime(not during busy poll) when guest kick the vhost-net. so it is not necessary to add it., right ? > >> So here's the needed changes: > >> > >> 1) Split the wakeup disabling to another patch > >> 2) Squash the vhost_net_busy_poll_try_queue() and > >> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce > >> duplicated checks. > >> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to > >> poll_rx, this makes code more readable. > > OK > >> Thanks > >> > Thanks > >> > > > > > > -- > Toshiaki Makita > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On Fri, Aug 3, 2018 at 11:43 AM Jason Wang wrote: > > > > On 2018年08月03日 11:24, Tonghao Zhang wrote: > > On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: > >> > >> > >> On 2018年08月03日 10:51, Tonghao Zhang wrote: > >>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: > > On 2018年08月02日 16:41, Toshiaki Makita wrote: > > On 2018/08/02 17:18, Jason Wang wrote: > >> On 2018年08月01日 17:52, Tonghao Zhang wrote: > +static void vhost_net_busy_poll_check(struct vhost_net *net, > + struct vhost_virtqueue *rvq, > + struct vhost_virtqueue *tvq, > + bool rx) > +{ > + struct socket *sock = rvq->private_data; > + > + if (rx) > + vhost_net_busy_poll_try_queue(net, tvq); > + else if (sock && sk_has_rx_data(sock->sk)) > + vhost_net_busy_poll_try_queue(net, rvq); > + else { > + /* On tx here, sock has no rx data, so we > + * will wait for sock wakeup for rx, and > + * vhost_enable_notify() is not needed. */ > >>> A possible case is we do have rx data but guest does not refill the rx > >>> queue. In this case we may lose notifications from guest. > >> Yes, should consider this case. thanks. > > I'm a bit confused. Isn't this covered by the previous > > "else if (sock && sk_has_rx_data(...))" block? > The problem is it does nothing if vhost_vq_avail_empty() is true and > vhost_enble_notify() is false. > > > + > > + cpu_relax(); > > + } > > + > > + preempt_enable(); > > + > > + if (!rx) > > + vhost_net_enable_vq(net, vq); > No need to enable rx virtqueue, if we are sure handle_rx() will be > called soon. > >>> If we disable rx virtqueue in handle_tx and don't send packets from > >>> guest anymore(handle_tx is not called), so we can wake up for sock rx. > >>> so the network is broken. > >> Not sure I understand here. I mean is we schedule work for handle_rx(), > >> there's no need to enable it since handle_rx() will do this for us. > > Looks like in the last "else" block in vhost_net_busy_poll_check() we > > need to enable vq since in that case we have no rx data and handle_rx() > > is not scheduled. > > > Yes. > >>> So we will use the vhost_has_work() to check whether or not the > >>> handle_rx is scheduled ? > >>> If we use the vhost_has_work(), the work in the dev work_list may be > >>> rx work, or tx work, right ? > >> Yes. We can add a boolean to record whether or not we've called > >> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if > >> it was true. > > so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during > > busypoll" > > may not consider the case: work is tx work in the dev work list. > > So two kinds of work, tx kick or tx wakeup. > > For tx kick, we check vhost_vq_avail_empty() and avoid unnecessary kicks > by not enabling kick if we found something is pending on txq. For tx > wakeup, yes, the commit does not consider it. And that's why we want to > disable tx wakeups during busy polling. And in the handle_rx but not busy polling, the tx can wakeup anytime and the tx work will be added to dev work list. In that case, if we add the rx poll to the queue, it is necessary ? the commit be294a51a may check whether the rx work is in the dev work list. > Thanks > > > > >> So here's the needed changes: > >> > >> 1) Split the wakeup disabling to another patch > >> 2) Squash the vhost_net_busy_poll_try_queue() and > >> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce > >> duplicated checks. > >> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to > >> poll_rx, this makes code more readable. > > OK > >> Thanks > >> > Thanks > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月03日 11:32, Toshiaki Makita wrote: On 2018/08/03 12:07, Jason Wang wrote: On 2018年08月02日 17:23, Jason Wang wrote: No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. If we disable rx virtqueue in handle_tx and don't send packets from guest anymore(handle_tx is not called), so we can wake up for sock rx. so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. Rethink about this, looks not. We enable rx wakeups in this case, so if there's pending data, handle_rx() will be schedule after vhost_net_enable_vq(). You are right, but what I wanted to say is vhost_net_enable_vq() should be needed (I was talking about what would happen if vhost_net_enable_vq() were removed). Also, I think we should move vhost_net_enable_vq() from vhost_net_busy_poll() to this last "else" block because this is the case where rx wakeups is required. Anyway this part will be refactored so let's see what this code will look like in next version. I get your point. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018/08/03 12:24, Tonghao Zhang wrote: > On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: >> On 2018年08月03日 10:51, Tonghao Zhang wrote: >>> On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: On 2018年08月02日 16:41, Toshiaki Makita wrote: > On 2018/08/02 17:18, Jason Wang wrote: >> On 2018年08月01日 17:52, Tonghao Zhang wrote: +static void vhost_net_busy_poll_check(struct vhost_net *net, + struct vhost_virtqueue *rvq, + struct vhost_virtqueue *tvq, + bool rx) +{ + struct socket *sock = rvq->private_data; + + if (rx) + vhost_net_busy_poll_try_queue(net, tvq); + else if (sock && sk_has_rx_data(sock->sk)) + vhost_net_busy_poll_try_queue(net, rvq); + else { + /* On tx here, sock has no rx data, so we + * will wait for sock wakeup for rx, and + * vhost_enable_notify() is not needed. */ >>> A possible case is we do have rx data but guest does not refill the rx >>> queue. In this case we may lose notifications from guest. >> Yes, should consider this case. thanks. > I'm a bit confused. Isn't this covered by the previous > "else if (sock && sk_has_rx_data(...))" block? The problem is it does nothing if vhost_vq_avail_empty() is true and vhost_enble_notify() is false. > + > + cpu_relax(); > + } > + > + preempt_enable(); > + > + if (!rx) > + vhost_net_enable_vq(net, vq); No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. >>> If we disable rx virtqueue in handle_tx and don't send packets from >>> guest anymore(handle_tx is not called), so we can wake up for sock rx. >>> so the network is broken. >> Not sure I understand here. I mean is we schedule work for handle_rx(), >> there's no need to enable it since handle_rx() will do this for us. > Looks like in the last "else" block in vhost_net_busy_poll_check() we > need to enable vq since in that case we have no rx data and handle_rx() > is not scheduled. > Yes. >>> So we will use the vhost_has_work() to check whether or not the >>> handle_rx is scheduled ? >>> If we use the vhost_has_work(), the work in the dev work_list may be >>> rx work, or tx work, right ? >> >> Yes. We can add a boolean to record whether or not we've called >> vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if >> it was true. > so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll" > may not consider the case: work is tx work in the dev work list. Not sure what you are concerned but what I can say is that we need to poll rx work if vhost_has_work() detects tx work in vhost_net_rx_peek_head_len() since rx busypoll exits prematurely in that case. >> So here's the needed changes: >> >> 1) Split the wakeup disabling to another patch >> 2) Squash the vhost_net_busy_poll_try_queue() and >> vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce >> duplicated checks. >> 3) If possible, rename the boolean rx in vhost_net_busy_poll() to >> poll_rx, this makes code more readable. > OK >> Thanks >> Thanks >> > > -- Toshiaki Makita ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018/08/03 12:07, Jason Wang wrote: > On 2018年08月02日 17:23, Jason Wang wrote: >>> >> No need to enable rx virtqueue, if we are sure handle_rx() will be >> called soon. > If we disable rx virtqueue in handle_tx and don't send packets from > guest anymore(handle_tx is not called), so we can wake up for sock rx. > so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. >>> Looks like in the last "else" block in vhost_net_busy_poll_check() we >>> need to enable vq since in that case we have no rx data and handle_rx() >>> is not scheduled. >>> > Rethink about this, looks not. We enable rx wakeups in this case, so if > there's pending data, handle_rx() will be schedule after > vhost_net_enable_vq(). You are right, but what I wanted to say is vhost_net_enable_vq() should be needed (I was talking about what would happen if vhost_net_enable_vq() were removed). Also, I think we should move vhost_net_enable_vq() from vhost_net_busy_poll() to this last "else" block because this is the case where rx wakeups is required. Anyway this part will be refactored so let's see what this code will look like in next version. -- Toshiaki Makita ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On Fri, Aug 3, 2018 at 11:07 AM Jason Wang wrote: > > > > On 2018年08月03日 10:51, Tonghao Zhang wrote: > > On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: > >> > >> > >> On 2018年08月02日 16:41, Toshiaki Makita wrote: > >>> On 2018/08/02 17:18, Jason Wang wrote: > On 2018年08月01日 17:52, Tonghao Zhang wrote: > >> +static void vhost_net_busy_poll_check(struct vhost_net *net, > >> + struct vhost_virtqueue *rvq, > >> + struct vhost_virtqueue *tvq, > >> + bool rx) > >> +{ > >> + struct socket *sock = rvq->private_data; > >> + > >> + if (rx) > >> + vhost_net_busy_poll_try_queue(net, tvq); > >> + else if (sock && sk_has_rx_data(sock->sk)) > >> + vhost_net_busy_poll_try_queue(net, rvq); > >> + else { > >> + /* On tx here, sock has no rx data, so we > >> + * will wait for sock wakeup for rx, and > >> + * vhost_enable_notify() is not needed. */ > > A possible case is we do have rx data but guest does not refill the rx > > queue. In this case we may lose notifications from guest. > Yes, should consider this case. thanks. > >>> I'm a bit confused. Isn't this covered by the previous > >>> "else if (sock && sk_has_rx_data(...))" block? > >> The problem is it does nothing if vhost_vq_avail_empty() is true and > >> vhost_enble_notify() is false. > >> > >>> + > >>> + cpu_relax(); > >>> + } > >>> + > >>> + preempt_enable(); > >>> + > >>> + if (!rx) > >>> + vhost_net_enable_vq(net, vq); > >> No need to enable rx virtqueue, if we are sure handle_rx() will be > >> called soon. > > If we disable rx virtqueue in handle_tx and don't send packets from > > guest anymore(handle_tx is not called), so we can wake up for sock rx. > > so the network is broken. > Not sure I understand here. I mean is we schedule work for handle_rx(), > there's no need to enable it since handle_rx() will do this for us. > >>> Looks like in the last "else" block in vhost_net_busy_poll_check() we > >>> need to enable vq since in that case we have no rx data and handle_rx() > >>> is not scheduled. > >>> > >> Yes. > > So we will use the vhost_has_work() to check whether or not the > > handle_rx is scheduled ? > > If we use the vhost_has_work(), the work in the dev work_list may be > > rx work, or tx work, right ? > > Yes. We can add a boolean to record whether or not we've called > vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if > it was true. so, the commit be294a51a "vhost_net: Avoid rx queue wake-ups during busypoll" may not consider the case: work is tx work in the dev work list. > So here's the needed changes: > > 1) Split the wakeup disabling to another patch > 2) Squash the vhost_net_busy_poll_try_queue() and > vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce > duplicated checks. > 3) If possible, rename the boolean rx in vhost_net_busy_poll() to > poll_rx, this makes code more readable. OK > Thanks > > >> Thanks > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月03日 10:51, Tonghao Zhang wrote: On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: On 2018年08月02日 16:41, Toshiaki Makita wrote: On 2018/08/02 17:18, Jason Wang wrote: On 2018年08月01日 17:52, Tonghao Zhang wrote: +static void vhost_net_busy_poll_check(struct vhost_net *net, + struct vhost_virtqueue *rvq, + struct vhost_virtqueue *tvq, + bool rx) +{ + struct socket *sock = rvq->private_data; + + if (rx) + vhost_net_busy_poll_try_queue(net, tvq); + else if (sock && sk_has_rx_data(sock->sk)) + vhost_net_busy_poll_try_queue(net, rvq); + else { + /* On tx here, sock has no rx data, so we + * will wait for sock wakeup for rx, and + * vhost_enable_notify() is not needed. */ A possible case is we do have rx data but guest does not refill the rx queue. In this case we may lose notifications from guest. Yes, should consider this case. thanks. I'm a bit confused. Isn't this covered by the previous "else if (sock && sk_has_rx_data(...))" block? The problem is it does nothing if vhost_vq_avail_empty() is true and vhost_enble_notify() is false. + + cpu_relax(); + } + + preempt_enable(); + + if (!rx) + vhost_net_enable_vq(net, vq); No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. If we disable rx virtqueue in handle_tx and don't send packets from guest anymore(handle_tx is not called), so we can wake up for sock rx. so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. Yes. So we will use the vhost_has_work() to check whether or not the handle_rx is scheduled ? If we use the vhost_has_work(), the work in the dev work_list may be rx work, or tx work, right ? Yes. We can add a boolean to record whether or not we've called vhost_poll_queue() for rvq. And avoid calling vhost_net_enable_vq() if it was true. So here's the needed changes: 1) Split the wakeup disabling to another patch 2) Squash the vhost_net_busy_poll_try_queue() and vhost_net_busy_poll_check() into vhost_net_busy_poll() and reduce duplicated checks. 3) If possible, rename the boolean rx in vhost_net_busy_poll() to poll_rx, this makes code more readable. Thanks Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月02日 17:23, Jason Wang wrote: No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. If we disable rx virtqueue in handle_tx and don't send packets from guest anymore(handle_tx is not called), so we can wake up for sock rx. so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. Yes. Thanks Rethink about this, looks not. We enable rx wakeups in this case, so if there's pending data, handle_rx() will be schedule after vhost_net_enable_vq(). Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On Thu, Aug 2, 2018 at 5:23 PM Jason Wang wrote: > > > > On 2018年08月02日 16:41, Toshiaki Makita wrote: > > On 2018/08/02 17:18, Jason Wang wrote: > >> On 2018年08月01日 17:52, Tonghao Zhang wrote: > +static void vhost_net_busy_poll_check(struct vhost_net *net, > + struct vhost_virtqueue *rvq, > + struct vhost_virtqueue *tvq, > + bool rx) > +{ > + struct socket *sock = rvq->private_data; > + > + if (rx) > + vhost_net_busy_poll_try_queue(net, tvq); > + else if (sock && sk_has_rx_data(sock->sk)) > + vhost_net_busy_poll_try_queue(net, rvq); > + else { > + /* On tx here, sock has no rx data, so we > + * will wait for sock wakeup for rx, and > + * vhost_enable_notify() is not needed. */ > >>> A possible case is we do have rx data but guest does not refill the rx > >>> queue. In this case we may lose notifications from guest. > >> Yes, should consider this case. thanks. > > I'm a bit confused. Isn't this covered by the previous > > "else if (sock && sk_has_rx_data(...))" block? > > The problem is it does nothing if vhost_vq_avail_empty() is true and > vhost_enble_notify() is false. > > > > > + > > + cpu_relax(); > > + } > > + > > + preempt_enable(); > > + > > + if (!rx) > > + vhost_net_enable_vq(net, vq); > No need to enable rx virtqueue, if we are sure handle_rx() will be > called soon. > >>> If we disable rx virtqueue in handle_tx and don't send packets from > >>> guest anymore(handle_tx is not called), so we can wake up for sock rx. > >>> so the network is broken. > >> Not sure I understand here. I mean is we schedule work for handle_rx(), > >> there's no need to enable it since handle_rx() will do this for us. > > Looks like in the last "else" block in vhost_net_busy_poll_check() we > > need to enable vq since in that case we have no rx data and handle_rx() > > is not scheduled. > > > > Yes. So we will use the vhost_has_work() to check whether or not the handle_rx is scheduled ? If we use the vhost_has_work(), the work in the dev work_list may be rx work, or tx work, right ? > Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On 2018年08月03日 04:55, Michael S. Tsirkin wrote: On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote: This patch series is the follow up on the discussions we had before about the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation for virito devices (https://patchwork.kernel.org/patch/10417371/). There were suggestions about doing away with two different paths of transactions with the host/QEMU, first being the direct GPA and the other being the DMA API based translations. First patch attempts to create a direct GPA mapping based DMA operations structure called 'virtio_direct_dma_ops' with exact same implementation of the direct GPA path which virtio core currently has but just wrapped in a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the existing semantics. The second patch does exactly that inside the function virtio_finalize_features(). The third patch removes the default direct GPA path from virtio core forcing it to use DMA API callbacks for all devices. Now with that change, every device must have a DMA operations structure associated with it. The fourth patch adds an additional hook which gives the platform an opportunity to do yet another override if required. This platform hook can be used on POWER Ultravisor based protected guests to load up SWIOTLB DMA callbacks to do the required (as discussed previously in the above mentioned thread how host is allowed to access only parts of the guest GPA range) bounce buffering into the shared memory for all I/O scatter gather buffers to be consumed on the host side. Please go through these patches and review whether this approach broadly makes sense. I will appreciate suggestions, inputs, comments regarding the patches or the approach in general. Thank you. Jason did some work on profiling this. Unfortunately he reports about 4% extra overhead from this switch on x86 with no vIOMMU. The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) in guest and measure PPS on tap on host. Thanks I expect he's writing up the data in more detail, but just wanted to let you know this would be one more thing to debug before we can just switch to DMA APIs. Anshuman Khandual (4): virtio: Define virtio_direct_dma_ops structure virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively virtio: Force virtio core to use DMA API callbacks for all virtio devices virtio: Add platform specific DMA API translation for virito devices arch/powerpc/include/asm/dma-mapping.h | 6 +++ arch/powerpc/platforms/pseries/iommu.c | 6 +++ drivers/virtio/virtio.c| 72 ++ drivers/virtio/virtio_pci_common.h | 3 ++ drivers/virtio/virtio_ring.c | 65 +- 5 files changed, 89 insertions(+), 63 deletions(-) -- 2.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月02日 17:57, Toshiaki Makita wrote: On 2018/08/02 18:23, Jason Wang wrote: On 2018年08月02日 16:41, Toshiaki Makita wrote: On 2018/08/02 17:18, Jason Wang wrote: On 2018年08月01日 17:52, Tonghao Zhang wrote: +static void vhost_net_busy_poll_check(struct vhost_net *net, + struct vhost_virtqueue *rvq, + struct vhost_virtqueue *tvq, + bool rx) +{ + struct socket *sock = rvq->private_data; + + if (rx) + vhost_net_busy_poll_try_queue(net, tvq); + else if (sock && sk_has_rx_data(sock->sk)) + vhost_net_busy_poll_try_queue(net, rvq); + else { + /* On tx here, sock has no rx data, so we + * will wait for sock wakeup for rx, and + * vhost_enable_notify() is not needed. */ A possible case is we do have rx data but guest does not refill the rx queue. In this case we may lose notifications from guest. Yes, should consider this case. thanks. I'm a bit confused. Isn't this covered by the previous "else if (sock && sk_has_rx_data(...))" block? The problem is it does nothing if vhost_vq_avail_empty() is true and vhost_enble_notify() is false. If vhost_enable_notify() is false, guest will eventually kicks vq, no? Yes, you are right. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote: > On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote: > > > Yes, this is the purpose of Anshuman original patch (I haven't looked > > > at the details of the patch in a while but that's what I told him to > > > implement ;-) : > > > > > > - Make virtio always use DMA ops to simplify the code path (with a set > > > of "transparent" ops for legacy) > > > > > > and > > > > > > - Provide an arch hook allowing us to "override" those "transparent" > > > DMA ops with some custom ones that do the appropriate swiotlb gunk. > > > > > > Cheers, > > > Ben. > > > > > > > Right but as I tried to say doing that brings us to a bunch of issues > > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for > > guest to hypervisor communication. > > I'm not sure I see the problem, see below > > > When we do (as is the case with PLATFORM_IOMMU right now) this adds a > > bunch of overhead which we need to get rid of if we are to switch to > > PLATFORM_IOMMU by default. We need to fix that. > > So let's differenciate the two problems of having an IOMMU (real or > emulated) which indeeds adds overhead etc... and using the DMA API. Well actually it's the other way around. An iommu in theory doesn't need to bring overhead if you set it in bypass mode. Which does imply the iommu supports bypass mode. Is that universally the case? DMA API does see Christoph's list of things it does some of which add overhead. > At the moment, virtio does this all over the place: > > if (use_dma_api) > dma_map/alloc_something(...) > else > use_pa > > The idea of the patch set is to do two, somewhat orthogonal, changes > that together achieve what we want. Let me know where you think there > is "a bunch of issues" because I'm missing it: > > 1- Replace the above if/else constructs with just calling the DMA API, > and have virtio, at initialization, hookup its own dma_ops that just > "return pa" (roughly) when the IOMMU stuff isn't used. > > This adds an indirect function call to the path that previously didn't > have one (the else case above). Is that a significant/measurable > overhead ? Seems to be :( Jason reports about 4%. I wonder whether we can support map_sg and friends being NULL, then use that when mapping is an identity. A conditional branch there is likely very cheap. Would this cover all platforms with kvm (which is where we care most about performance)? > This change stands alone, and imho "cleans" up virtio by avoiding all > that if/else "2 path" and unless it adds a measurable overhead, should > probably be done. > > 2- Make virtio use the DMA API with our custom platform-provided > swiotlb callbacks when needed, that is when not using IOMMU *and* > running on a secure VM in our case. > > This benefits from -1- by making us just plumb in a different set of > DMA ops we would have cooked up specifically for virtio in our arch > code (or in virtio itself but build arch-conditionally in a separate > file). But it doesn't strictly need it -1-: > > Now, -2- doesn't strictly needs -1-. We could have just done another > xen-like hack that forces the DMA API "ON" for virtio when running in a > secure VM. > > The problem if we do that however is that we also then need the arch > PCI code to make sure it hooks up the virtio PCI devices with the > special "magic" DMA ops that avoid the iommu but still do swiotlb, ie, > not the same as other PCI devices. So it will have to play games such > as checking vendor/device IDs for virtio, checking the IOMMU flag, > etc... from the arch code which really bloody sucks when assigning PCI > DMA ops. > > However, if we do it the way we plan here, on top of -1-, with a hook > called from virtio into the arch to "override" the virtio DMA ops, then > we avoid the problem completely: The arch hook would only be called by > virtio if the IOMMU flag is *not* set. IE only when using that special > "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal > PCI dma ops as usual. > > That way, we have a very clear semantic: This hook is purely about > replacing those "null" DMA ops that just return PA introduced in -1- > with some arch provided specially cooked up DMA ops for non-IOMMU > virtio that know about the arch special requirements. For us bounce > buffering. > > Is there something I'm missing ? > > Cheers, > Ben. Right so I was trying to write it up in a systematic way, but just to give you one example, if there is a system where DMA API handles coherency issues, or flushing of some buffers, then our PLATFORM_IOMMU flag causes that to happen. And we kinda worked around this without the IOMMU by basically saying "ok we do not really need DMA API so let's just bypass it" and it was kind of ok except now everyone is switching to vIOMMU just in case. So now people do want some parts of what DMA API does, such as the bounce
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote: > > Yes, this is the purpose of Anshuman original patch (I haven't looked > > at the details of the patch in a while but that's what I told him to > > implement ;-) : > > > > - Make virtio always use DMA ops to simplify the code path (with a set > > of "transparent" ops for legacy) > > > > and > > > > - Provide an arch hook allowing us to "override" those "transparent" > > DMA ops with some custom ones that do the appropriate swiotlb gunk. > > > > Cheers, > > Ben. > > > > Right but as I tried to say doing that brings us to a bunch of issues > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for > guest to hypervisor communication. I'm not sure I see the problem, see below > When we do (as is the case with PLATFORM_IOMMU right now) this adds a > bunch of overhead which we need to get rid of if we are to switch to > PLATFORM_IOMMU by default. We need to fix that. So let's differenciate the two problems of having an IOMMU (real or emulated) which indeeds adds overhead etc... and using the DMA API. At the moment, virtio does this all over the place: if (use_dma_api) dma_map/alloc_something(...) else use_pa The idea of the patch set is to do two, somewhat orthogonal, changes that together achieve what we want. Let me know where you think there is "a bunch of issues" because I'm missing it: 1- Replace the above if/else constructs with just calling the DMA API, and have virtio, at initialization, hookup its own dma_ops that just "return pa" (roughly) when the IOMMU stuff isn't used. This adds an indirect function call to the path that previously didn't have one (the else case above). Is that a significant/measurable overhead ? This change stands alone, and imho "cleans" up virtio by avoiding all that if/else "2 path" and unless it adds a measurable overhead, should probably be done. 2- Make virtio use the DMA API with our custom platform-provided swiotlb callbacks when needed, that is when not using IOMMU *and* running on a secure VM in our case. This benefits from -1- by making us just plumb in a different set of DMA ops we would have cooked up specifically for virtio in our arch code (or in virtio itself but build arch-conditionally in a separate file). But it doesn't strictly need it -1-: Now, -2- doesn't strictly needs -1-. We could have just done another xen-like hack that forces the DMA API "ON" for virtio when running in a secure VM. The problem if we do that however is that we also then need the arch PCI code to make sure it hooks up the virtio PCI devices with the special "magic" DMA ops that avoid the iommu but still do swiotlb, ie, not the same as other PCI devices. So it will have to play games such as checking vendor/device IDs for virtio, checking the IOMMU flag, etc... from the arch code which really bloody sucks when assigning PCI DMA ops. However, if we do it the way we plan here, on top of -1-, with a hook called from virtio into the arch to "override" the virtio DMA ops, then we avoid the problem completely: The arch hook would only be called by virtio if the IOMMU flag is *not* set. IE only when using that special "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal PCI dma ops as usual. That way, we have a very clear semantic: This hook is purely about replacing those "null" DMA ops that just return PA introduced in -1- with some arch provided specially cooked up DMA ops for non-IOMMU virtio that know about the arch special requirements. For us bounce buffering. Is there something I'm missing ? Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next, v6, 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
On Thu, Aug 02, 2018 at 02:04:12PM -0700, Nambiar, Amritha wrote: > On 8/1/2018 5:11 PM, Andrei Vagin wrote: > > On Tue, Jul 10, 2018 at 07:28:49PM -0700, Nambiar, Amritha wrote: > >> With this patch series, I introduced static_key for XPS maps > >> (xps_needed), so static_key_slow_inc() is used to switch branches. The > >> definition of static_key_slow_inc() has cpus_read_lock in place. In the > >> virtio_net driver, XPS queues are initialized after setting the > >> queue:cpu affinity in virtnet_set_affinity() which is already protected > >> within cpus_read_lock. Hence, the warning here trying to acquire > >> cpus_read_lock when it is already held. > >> > >> A quick fix for this would be to just extract netif_set_xps_queue() out > >> of the lock by simply wrapping it with another put/get_online_cpus > >> (unlock right before and hold lock right after). > > > > virtnet_set_affinity() is called from virtnet_cpu_online(), which is > > called under cpus_read_lock too. > > > > It looks like now we can't call netif_set_xps_queue() from cpu hotplug > > callbacks. > > > > I can suggest a very straightforward fix for this problem. The patch is > > attached. > > > > Thanks for looking into this. I was thinking of fixing this in the > virtio_net driver by moving the XPS initialization (and have a new > get_affinity utility) in the ndo_open (so it is together with other tx > preparation) instead of probe. Your patch solves this in general for > setting up cpu hotplug callbacks which is under cpus_read_lock. I like this too. Could you repost in a standard way (inline, with your signoff etc) so we can ack this for net-next? > >> But this may not a > >> clean solution. It'd help if I can get suggestions on what would be a > >> clean option to fix this without extensively changing the code in > >> virtio_net. Is it mandatory to protect the affinitization with > >> read_lock? I don't see similar lock in other drivers while setting the > >> affinity. I understand this warning should go away, but isn't it safe to > >> have multiple readers. > >> > >>> On Fri, Jun 29, 2018 at 09:27:07PM -0700, Amritha Nambiar wrote: ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote: > This patch series is the follow up on the discussions we had before about > the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation > for virito devices (https://patchwork.kernel.org/patch/10417371/). There > were suggestions about doing away with two different paths of transactions > with the host/QEMU, first being the direct GPA and the other being the DMA > API based translations. > > First patch attempts to create a direct GPA mapping based DMA operations > structure called 'virtio_direct_dma_ops' with exact same implementation > of the direct GPA path which virtio core currently has but just wrapped in > a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of > the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the > existing semantics. The second patch does exactly that inside the function > virtio_finalize_features(). The third patch removes the default direct GPA > path from virtio core forcing it to use DMA API callbacks for all devices. > Now with that change, every device must have a DMA operations structure > associated with it. The fourth patch adds an additional hook which gives > the platform an opportunity to do yet another override if required. This > platform hook can be used on POWER Ultravisor based protected guests to > load up SWIOTLB DMA callbacks to do the required (as discussed previously > in the above mentioned thread how host is allowed to access only parts of > the guest GPA range) bounce buffering into the shared memory for all I/O > scatter gather buffers to be consumed on the host side. > > Please go through these patches and review whether this approach broadly > makes sense. I will appreciate suggestions, inputs, comments regarding > the patches or the approach in general. Thank you. Jason did some work on profiling this. Unfortunately he reports about 4% extra overhead from this switch on x86 with no vIOMMU. I expect he's writing up the data in more detail, but just wanted to let you know this would be one more thing to debug before we can just switch to DMA APIs. > Anshuman Khandual (4): > virtio: Define virtio_direct_dma_ops structure > virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively > virtio: Force virtio core to use DMA API callbacks for all virtio devices > virtio: Add platform specific DMA API translation for virito devices > > arch/powerpc/include/asm/dma-mapping.h | 6 +++ > arch/powerpc/platforms/pseries/iommu.c | 6 +++ > drivers/virtio/virtio.c| 72 > ++ > drivers/virtio/virtio_pci_common.h | 3 ++ > drivers/virtio/virtio_ring.c | 65 +- > 5 files changed, 89 insertions(+), 63 deletions(-) > > -- > 2.9.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, Aug 02, 2018 at 10:33:05AM -0500, Benjamin Herrenschmidt wrote: > On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote: > > > but it's not, VMs are > > > created in "legacy" mode all the times and we don't know at VM creation > > > time whether it will become a secure VM or not. The way our secure VMs > > > work is that they start as a normal VM, load a secure "payload" and > > > call the Ultravisor to "become" secure. > > > > > > So we're in a bit of a bind here. We need that one-liner optional arch > > > hook to make virtio use swiotlb in that "IOMMU bypass" case. > > > > > > Ben. > > > > And just to make sure I understand, on your platform DMA APIs do include > > some of the cache flushing tricks and this is why you don't want to > > declare iommu support in the hypervisor? > > I'm not sure I parse what you mean. > > We don't need cache flushing tricks. You don't but do real devices on same platform need them? > The problem we have with our > "secure" VMs is that: > > - At VM creation time we have no idea it's going to become a secure > VM, qemu doesn't know anything about it, and thus qemu (or other > management tools, libvirt etc...) are going to create "legacy" (ie > iommu bypass) virtio devices. > > - Once the VM goes secure (early during boot but too late for qemu), > it will need to make virtio do bounce-buffering via swiotlb because > qemu cannot physically access most VM pages (blocked by HW security > features), we need to bounce buffer using some unsecure pages that are > accessible to qemu. > > That said, I wouldn't object for us to more generally switch long run > to changing qemu so that virtio on powerpc starts using the IOMMU as a > default provided we fix our guest firmware to understand it (it > currently doesn't), and provided we verify that the performance impact > on things like vhost is negligible. > > Cheers, > Ben. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, Aug 02, 2018 at 12:53:41PM -0500, Benjamin Herrenschmidt wrote: > On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote: > > > > I see. So yes, given that device does not know or care, using > > virtio features is an awkward fit. > > > > So let's say as a quick fix for you maybe we could generalize the > > xen_domain hack, instead of just checking xen_domain check some static > > branch. Then teach xen and others to enable that.> > > > OK but problem then becomes this: if you do this and virtio device appears > > behind a vIOMMU and it does not advertize the IOMMU flag, the > > code will try to use the vIOMMU mappings and fail. > > > > It does look like even with trick above, you need a special version of > > DMA ops that does just swiotlb but not any of the other things DMA API > > might do. > > > > Thoughts? > > Yes, this is the purpose of Anshuman original patch (I haven't looked > at the details of the patch in a while but that's what I told him to > implement ;-) : > > - Make virtio always use DMA ops to simplify the code path (with a set > of "transparent" ops for legacy) > > and > > - Provide an arch hook allowing us to "override" those "transparent" > DMA ops with some custom ones that do the appropriate swiotlb gunk. > > Cheers, > Ben. > Right but as I tried to say doing that brings us to a bunch of issues with using DMA APIs in virtio. Put simply DMA APIs weren't designed for guest to hypervisor communication. When we do (as is the case with PLATFORM_IOMMU right now) this adds a bunch of overhead which we need to get rid of if we are to switch to PLATFORM_IOMMU by default. We need to fix that. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, 2018-08-02 at 20:19 +0300, Michael S. Tsirkin wrote: > > I see. So yes, given that device does not know or care, using > virtio features is an awkward fit. > > So let's say as a quick fix for you maybe we could generalize the > xen_domain hack, instead of just checking xen_domain check some static > branch. Then teach xen and others to enable that.> > OK but problem then becomes this: if you do this and virtio device appears > behind a vIOMMU and it does not advertize the IOMMU flag, the > code will try to use the vIOMMU mappings and fail. > > It does look like even with trick above, you need a special version of > DMA ops that does just swiotlb but not any of the other things DMA API > might do. > > Thoughts? Yes, this is the purpose of Anshuman original patch (I haven't looked at the details of the patch in a while but that's what I told him to implement ;-) : - Make virtio always use DMA ops to simplify the code path (with a set of "transparent" ops for legacy) and - Provide an arch hook allowing us to "override" those "transparent" DMA ops with some custom ones that do the appropriate swiotlb gunk. Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, Aug 02, 2018 at 11:01:26AM -0500, Benjamin Herrenschmidt wrote: > On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote: > > > > > I don't completely agree: > > > > > > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu > > > for example. It indicates that the peer bypasses the normal platform > > > iommu. The platform code in the guest has no real way to know that this > > > is the case, this is a specific "feature" of the qemu implementation. > > > > > > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a > > > property of the guest platform itself (not qemu), there's no way the > > > "peer" can advertize it via the virtio negociated flags. At least for > > > us. I don't know for sure whether that would be workable for the ARM > > > case. In our case, qemu has no idea at VM creation time that the VM > > > will turn itself into a secure VM and thus will require bounce > > > buffering for IOs (including virtio). > > > > > > So unless we have another hook for the arch code to set > > > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the > > > guest itself, I don't see that as a way to deal with it. > > > > > > > The other issue is VIRTIO_F_IO_BARRIER > > > > which is very vaguely defined, and which needs a better definition. > > > > And last but not least we'll need some text explaining the challenges > > > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + > > > > VIRTIO_F_IO_BARRIER > > > > is what would basically cover them, but a good description including > > > > an explanation of why these matter. > > > > > > Ben. > > > > > > > So is it true that from qemu point of view there is nothing special > > going on? You pass in a PA, host writes there. > > Yes, qemu doesn't see a different. It's the guest that will bounce the > pages via a pool of "insecure" pages that qemu can access. Normal pages > in a secure VM come from PAs that qemu cannot physically access. > > Cheers, > Ben. > I see. So yes, given that device does not know or care, using virtio features is an awkward fit. So let's say as a quick fix for you maybe we could generalize the xen_domain hack, instead of just checking xen_domain check some static branch. Then teach xen and others to enable that. OK but problem then becomes this: if you do this and virtio device appears behind a vIOMMU and it does not advertize the IOMMU flag, the code will try to use the vIOMMU mappings and fail. It does look like even with trick above, you need a special version of DMA ops that does just swiotlb but not any of the other things DMA API might do. Thoughts? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, 2018-08-02 at 18:41 +0300, Michael S. Tsirkin wrote: > > > I don't completely agree: > > > > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu > > for example. It indicates that the peer bypasses the normal platform > > iommu. The platform code in the guest has no real way to know that this > > is the case, this is a specific "feature" of the qemu implementation. > > > > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a > > property of the guest platform itself (not qemu), there's no way the > > "peer" can advertize it via the virtio negociated flags. At least for > > us. I don't know for sure whether that would be workable for the ARM > > case. In our case, qemu has no idea at VM creation time that the VM > > will turn itself into a secure VM and thus will require bounce > > buffering for IOs (including virtio). > > > > So unless we have another hook for the arch code to set > > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the > > guest itself, I don't see that as a way to deal with it. > > > > > The other issue is VIRTIO_F_IO_BARRIER > > > which is very vaguely defined, and which needs a better definition. > > > And last but not least we'll need some text explaining the challenges > > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER > > > is what would basically cover them, but a good description including > > > an explanation of why these matter. > > > > Ben. > > > > So is it true that from qemu point of view there is nothing special > going on? You pass in a PA, host writes there. Yes, qemu doesn't see a different. It's the guest that will bounce the pages via a pool of "insecure" pages that qemu can access. Normal pages in a secure VM come from PAs that qemu cannot physically access. Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, Aug 02, 2018 at 10:24:57AM -0500, Benjamin Herrenschmidt wrote: > On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote: > > We just need to figure out how to deal with devices that deviate > > from the default. One things is that VIRTIO_F_IOMMU_PLATFORM really > > should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu > > dma tweaks (offsets, cache flushing), which seems well in spirit of > > the original design. > > I don't completely agree: > > 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu > for example. It indicates that the peer bypasses the normal platform > iommu. The platform code in the guest has no real way to know that this > is the case, this is a specific "feature" of the qemu implementation. > > 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a > property of the guest platform itself (not qemu), there's no way the > "peer" can advertize it via the virtio negociated flags. At least for > us. I don't know for sure whether that would be workable for the ARM > case. In our case, qemu has no idea at VM creation time that the VM > will turn itself into a secure VM and thus will require bounce > buffering for IOs (including virtio). > > So unless we have another hook for the arch code to set > VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the > guest itself, I don't see that as a way to deal with it. > > > The other issue is VIRTIO_F_IO_BARRIER > > which is very vaguely defined, and which needs a better definition. > > And last but not least we'll need some text explaining the challenges > > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER > > is what would basically cover them, but a good description including > > an explanation of why these matter. > > Ben. > So is it true that from qemu point of view there is nothing special going on? You pass in a PA, host writes there. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote: > > but it's not, VMs are > > created in "legacy" mode all the times and we don't know at VM creation > > time whether it will become a secure VM or not. The way our secure VMs > > work is that they start as a normal VM, load a secure "payload" and > > call the Ultravisor to "become" secure. > > > > So we're in a bit of a bind here. We need that one-liner optional arch > > hook to make virtio use swiotlb in that "IOMMU bypass" case. > > > > Ben. > > And just to make sure I understand, on your platform DMA APIs do include > some of the cache flushing tricks and this is why you don't want to > declare iommu support in the hypervisor? I'm not sure I parse what you mean. We don't need cache flushing tricks. The problem we have with our "secure" VMs is that: - At VM creation time we have no idea it's going to become a secure VM, qemu doesn't know anything about it, and thus qemu (or other management tools, libvirt etc...) are going to create "legacy" (ie iommu bypass) virtio devices. - Once the VM goes secure (early during boot but too late for qemu), it will need to make virtio do bounce-buffering via swiotlb because qemu cannot physically access most VM pages (blocked by HW security features), we need to bounce buffer using some unsecure pages that are accessible to qemu. That said, I wouldn't object for us to more generally switch long run to changing qemu so that virtio on powerpc starts using the IOMMU as a default provided we fix our guest firmware to understand it (it currently doesn't), and provided we verify that the performance impact on things like vhost is negligible. Cheers, Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 0/4] Virtio uses DMA API for all devices
On Wed, 2018-08-01 at 01:36 -0700, Christoph Hellwig wrote: > We just need to figure out how to deal with devices that deviate > from the default. One things is that VIRTIO_F_IOMMU_PLATFORM really > should become VIRTIO_F_PLATFORM_DMA to cover the cases of non-iommu > dma tweaks (offsets, cache flushing), which seems well in spirit of > the original design. I don't completely agree: 1 - VIRTIO_F_IOMMU_PLATFORM is a property of the "other side", ie qemu for example. It indicates that the peer bypasses the normal platform iommu. The platform code in the guest has no real way to know that this is the case, this is a specific "feature" of the qemu implementation. 2 - VIRTIO_F_PLATFORM_DMA (or whatever you want to call it), is a property of the guest platform itself (not qemu), there's no way the "peer" can advertize it via the virtio negociated flags. At least for us. I don't know for sure whether that would be workable for the ARM case. In our case, qemu has no idea at VM creation time that the VM will turn itself into a secure VM and thus will require bounce buffering for IOs (including virtio). So unless we have another hook for the arch code to set VIRTIO_F_PLATFORM_DMA on selected (or all) virtio devices from the guest itself, I don't see that as a way to deal with it. > The other issue is VIRTIO_F_IO_BARRIER > which is very vaguely defined, and which needs a better definition. > And last but not least we'll need some text explaining the challenges > of hardware devices - I think VIRTIO_F_PLATFORM_DMA + VIRTIO_F_IO_BARRIER > is what would basically cover them, but a good description including > an explanation of why these matter. Ben. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On Thu, Aug 02, 2018 at 06:32:44PM +0800, Wei Wang wrote: > On 08/01/2018 07:34 PM, Michal Hocko wrote: > > On Wed 01-08-18 19:12:25, Wei Wang wrote: > > > On 07/30/2018 05:00 PM, Michal Hocko wrote: > > > > On Fri 27-07-18 17:24:55, Wei Wang wrote: > > > > > The OOM notifier is getting deprecated to use for the reasons > > > > > mentioned > > > > > here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314 > > > > > > > > > > This patch replaces the virtio-balloon oom notifier with a shrinker > > > > > to release balloon pages on memory pressure. > > > > It would be great to document the replacement. This is not a small > > > > change... > > > OK. I plan to document the following to the commit log: > > > > > >The OOM notifier is getting deprecated to use for the reasons: > > > - As a callout from the oom context, it is too subtle and easy to > > >generate bugs and corner cases which are hard to track; > > > - It is called too late (after the reclaiming has been performed). > > >Drivers with large amuont of reclaimable memory is expected to be > > >released them at an early age of memory pressure; > > > - The notifier callback isn't aware of the oom contrains; > > > Link: https://lkml.org/lkml/2018/7/12/314 > > > > > > This patch replaces the virtio-balloon oom notifier with a shrinker > > > to release balloon pages on memory pressure. Users can set the > > > amount of > > > memory pages to release each time a shrinker_scan is called via the > > > module parameter balloon_pages_to_shrink, and the default amount is > > > 256 > > > pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has > > > been used to release balloon pages on OOM. We continue to use this > > > feature bit for the shrinker, so the shrinker is only registered when > > > this feature bit has been negotiated with host. > > Do you have any numbers for how does this work in practice? > > It works in this way: for example, we can set the parameter, > balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called. > Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is > called, the balloon driver will get back 1GB memory and give them back to > mm, then the ballooned memory becomes 6GB. > > When the shrinker scan is called the second time, another 1GB will be given > back to mm. So the ballooned pages are given back to mm gradually. I think what's being asked here is a description of tests that were run. Which workloads see improved behaviour? Our behaviour under memory pressure isn't great, in particular it is not clear when it's safe to re-inflate the balloon, if host attempts to re-inflate it too soon then we still get OOM. It would be better if VIRTIO_BALLOON_F_DEFLATE_ON_OOM would somehow mean "it's ok to ask for almost all of memory, if guest needs memory from balloon for apps to function it can take it from the balloon". -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On Thu 02-08-18 18:32:44, Wei Wang wrote: > On 08/01/2018 07:34 PM, Michal Hocko wrote: > > On Wed 01-08-18 19:12:25, Wei Wang wrote: > > > On 07/30/2018 05:00 PM, Michal Hocko wrote: > > > > On Fri 27-07-18 17:24:55, Wei Wang wrote: > > > > > The OOM notifier is getting deprecated to use for the reasons > > > > > mentioned > > > > > here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314 > > > > > > > > > > This patch replaces the virtio-balloon oom notifier with a shrinker > > > > > to release balloon pages on memory pressure. > > > > It would be great to document the replacement. This is not a small > > > > change... > > > OK. I plan to document the following to the commit log: > > > > > >The OOM notifier is getting deprecated to use for the reasons: > > > - As a callout from the oom context, it is too subtle and easy to > > >generate bugs and corner cases which are hard to track; > > > - It is called too late (after the reclaiming has been performed). > > >Drivers with large amuont of reclaimable memory is expected to be > > >released them at an early age of memory pressure; > > > - The notifier callback isn't aware of the oom contrains; > > > Link: https://lkml.org/lkml/2018/7/12/314 > > > > > > This patch replaces the virtio-balloon oom notifier with a shrinker > > > to release balloon pages on memory pressure. Users can set the > > > amount of > > > memory pages to release each time a shrinker_scan is called via the > > > module parameter balloon_pages_to_shrink, and the default amount is > > > 256 > > > pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has > > > been used to release balloon pages on OOM. We continue to use this > > > feature bit for the shrinker, so the shrinker is only registered when > > > this feature bit has been negotiated with host. > > Do you have any numbers for how does this work in practice? > > It works in this way: for example, we can set the parameter, > balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called. > Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is > called, the balloon driver will get back 1GB memory and give them back to > mm, then the ballooned memory becomes 6GB. > > When the shrinker scan is called the second time, another 1GB will be given > back to mm. So the ballooned pages are given back to mm gradually. > > > Let's say > > you have a medium page cache workload which triggers kswapd to do a > > light reclaim? Hardcoded shrinking sounds quite dubious to me but I have > > no idea how people expect this to work. Shouldn't this be more > > adaptive? How precious are those pages anyway? > > Those pages are given to host to use usually because the guest has enough > free memory, and host doesn't want to waste those pieces of memory as they > are not used by this guest. When the guest needs them, it is reasonable that > the guest has higher priority to take them back. > But I'm not sure if there would be a more adaptive approach than "gradually > giving back as the guest wants more". I am not sure I follow. Let me be more specific. Say you have a trivial stream IO triggering reclaim to recycle clean page cache. This will invoke slab shrinkers as well. Do you really want to drop your batch of pages on each invocation? Doesn't that remove them very quickly? Just try to dd if=large_file of=/dev/null and see how your pages are disappearing. Shrinkers usually scale the number of objects they are going to reclaim based on the memory pressure (aka targer to be reclaimed). -- Michal Hocko SUSE Labs ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On 2018/08/02 19:32, Wei Wang wrote: > On 08/01/2018 07:34 PM, Michal Hocko wrote: >> Do you have any numbers for how does this work in practice? > > It works in this way: for example, we can set the parameter, > balloon_pages_to_shrink, > to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, > and we balloon > out 7GB. When shrink scan is called, the balloon driver will get back 1GB > memory and give > them back to mm, then the ballooned memory becomes 6GB. Since shrinker might be called concurrently (am I correct?), the balloon might deflate far more than needed if it releases such much memory. If shrinker is used, releasing 256 pages might be sufficient. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
CFP ICEIS 2019 - 21st Int.l Conf. on Enterprise Information Systems (Heraklion, Crete/Greece)
SUBMISSION DEADLINE 21st International Conference on Enterprise Information Systems Submission Deadline: December 10, 2018 http://www.iceis.org/ May 3 - 5, 2019 Heraklion, Crete, Greece. ICEIS is organized in 6 major tracks: - Databases and Information Systems Integration - Artificial Intelligence and Decision Support Systems - Information Systems Analysis and Specification - Software Agents and Internet Computing - Human-Computer Interaction - Enterprise Architecture Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer. All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/). Should you have any question please don’t hesitate contacting me. Kind regards, ICEIS Secretariat Address: Av. D. Manuel I, 27A, 2º esq. 2910-595 Setubal, Portugal Tel: +351 265 520 184 Fax: +351 265 520 186 Web: http://www.iceis.org/ e-mail: iceis.secretar...@insticc.org ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
CFP CLOSER 2019 - 8th Int.l Conf. on Cloud Computing and Services Science (Heraklion, Crete/Greece)
SUBMISSION DEADLINE 8th International Conference on Cloud Computing and Services Science Submission Deadline: December 10, 2018 http://closer.scitevents.org May 2 - 4, 2019 Heraklion, Crete, Greece. CLOSER is organized in 9 major tracks: - Services Science - Data as a Service - Cloud Operations - Edge Cloud and Fog Computing - Service Modelling and Analytics - Mobile Cloud Computing - Cloud Computing Fundamentals - Cloud Computing Platforms and Applications - Cloud Computing Enabling Technology Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer. All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/). Should you have any question please don’t hesitate contacting me. Kind regards, CLOSER Secretariat Address: Av. D. Manuel I, 27A, 2º esq. 2910-595 Setubal, Portugal Tel: +351 265 520 184 Fax: +351 265 520 186 Web: http://closer.scitevents.org e-mail: closer.secretar...@insticc.org ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
CFP VEHITS 2019 - 5th Int.l Conf. on Vehicle Technology and Intelligent Transport Systems (Heraklion, Crete/Greece)
SUBMISSION DEADLINE 5th International Conference on Vehicle Technology and Intelligent Transport Systems Submission Deadline: December 10, 2018 http://www.vehits.org/ May 3 - 5, 2019 Heraklion, Crete, Greece. VEHITS is organized in 5 major tracks: - Intelligent Vehicle Technologies - Intelligent Transport Systems and Infrastructure - Connected Vehicles - Sustainable Transport - Data Analytics Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. With the presence of internationally distinguished keynote speakers: Javier Sánchez-Medina, University of Las Palmas de Gran Canaria, Spain Jeroen Ploeg, 2getthere B.V., Utrecht, The Netherlands Lead Cooperative Driving Eindhoven University of Technology, Eindhoven, The Netherlands Associate professor (part time), faculty of Mechanical Engineering, Dynamics and Control group, Netherlands A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer. All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/). Should you have any question please don’t hesitate contacting me. Kind regards, VEHITS Secretariat Address: Av. D. Manuel I, 27A, 2º esq. 2910-595 Setubal, Portugal Tel: +351 265 520 185 Fax: +351 265 520 186 Web: http://www.vehits.org/ e-mail: vehits.secretar...@insticc.org ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
CFP SMARTGREENS 2019 - 8th Int.l Conf. on Smart Cities and Green ICT Systems (Heraklion, Crete/Greece)
SUBMISSION DEADLINE 8th International Conference on Smart Cities and Green ICT Systems Submission Deadline: December 10, 2018 http://www.smartgreens.org/ May 3 - 5, 2019 Heraklion, Crete, Greece. SMARTGREENS is organized in 5 major tracks: - Energy-Aware Systems and Technologies - Sustainable Computing and Systems - Smart Cities and Smart Buildings - Demos and Use-Cases - Smart and Digital Services Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. With the presence of internationally distinguished keynote speakers: Norbert Streitz, Founder and Scientific Director, Smart Future Initiative, Germany Rudolf Giffinger, Vienna University of Technology, Austria A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer. All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/). Should you have any question please don’t hesitate contacting me. Kind regards, SMARTGREENS Secretariat Address: Av. D. Manuel I, 27A, 2º esq. 2910-595 Setubal, Portugal Tel: +351 265 520 185 Fax: +351 265 520 186 Web: http://www.smartgreens.org/ e-mail: smartgreens.secretar...@insticc.org ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
CFP IoTBDS 2019 - 3rd Int.l Conf. on Internet of Things, Big Data and Security (Heraklion, Crete/Greece)
SUBMISSION DEADLINE 3rd International Conference on Internet of Things, Big Data and Security Submission Deadline: December 10, 2018 http://iotbds.org/ May 2 - 4, 2019 Heraklion, Crete, Greece. IoTBDS is organized in 7 major tracks: - Big Data Research - Emerging Services and Analytics - Internet of Things (IoT) Fundamentals - Internet of Things (IoT) Applications - Big Data for Multi-discipline Services - Security, Privacy and Trust - IoT Technologies Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. With the presence of internationally distinguished keynote speakers: Francisco Herrera, University of Granada, Spain A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer. All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/). Should you have any question please don’t hesitate contacting me. Kind regards, IoTBDS Secretariat Address: Av. D. Manuel I, 27A, 2º esq. 2910-595 Setubal, Portugal Tel: +351 265 520 184 Fax: +351 265 520 186 Web: http://iotbds.org/ e-mail: iotbds.secretar...@insticc.org ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
CFP SENSORNETS 2019 - 8th Int.l Conf. on Sensor Networks (Prague/Czech Republic)
SUBMISSION DEADLINE 8th International Conference on Sensor Networks Submission Deadline: October 1, 2018 http://www.sensornets.org/ February 26 - 27, 2019 Prague, Czech Republic. SENSORNETS is organized in 5 major tracks: - Sensor Networks Software, Architectures and Applications - Wireless Sensor Networks - Energy and Environment - Intelligent Data Analysis and Processing - Security and Privacy in Sensor Networks Proceedings will be submitted for indexation by: DBLP, Thomson Reuters, EI, SCOPUS and Semantic Scholar. A short list of presented papers will be selected so that revised and extended versions of these papers will be published by Springer. All papers presented at the congress venue will also be available at the SCITEPRESS Digital Library (http://www.scitepress.org/DigitalLibrary/). Should you have any question please don’t hesitate contacting me. Kind regards, SENSORNETS Secretariat Address: Av. D. Manuel I, 27A, 2º esq. 2910-595 Setubal, Portugal Tel: +351 265 100 033 Fax: +351 265 520 186 Web: http://www.sensornets.org/ e-mail: sensornets.secretar...@insticc.org ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On Thu 02-08-18 19:27:40, Wei Wang wrote: > On 08/02/2018 07:00 PM, Tetsuo Handa wrote: > > On 2018/08/02 19:32, Wei Wang wrote: > > > On 08/01/2018 07:34 PM, Michal Hocko wrote: > > > > Do you have any numbers for how does this work in practice? > > > It works in this way: for example, we can set the parameter, > > > balloon_pages_to_shrink, > > > to shrink 1GB memory once shrink scan is called. Now, we have a 8GB > > > guest, and we balloon > > > out 7GB. When shrink scan is called, the balloon driver will get back 1GB > > > memory and give > > > them back to mm, then the ballooned memory becomes 6GB. > > Since shrinker might be called concurrently (am I correct?), > > Not sure about it being concurrently, but I think it would be called > repeatedly as should_continue_reclaim() returns true. Multiple direct reclaimers might indeed invoke it concurrently. -- Michal Hocko SUSE Labs ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On 08/02/2018 07:00 PM, Tetsuo Handa wrote: On 2018/08/02 19:32, Wei Wang wrote: On 08/01/2018 07:34 PM, Michal Hocko wrote: Do you have any numbers for how does this work in practice? It works in this way: for example, we can set the parameter, balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is called, the balloon driver will get back 1GB memory and give them back to mm, then the ballooned memory becomes 6GB. Since shrinker might be called concurrently (am I correct?), Not sure about it being concurrently, but I think it would be called repeatedly as should_continue_reclaim() returns true. Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] virtio_balloon: replace oom notifier with shrinker
On 08/01/2018 07:34 PM, Michal Hocko wrote: On Wed 01-08-18 19:12:25, Wei Wang wrote: On 07/30/2018 05:00 PM, Michal Hocko wrote: On Fri 27-07-18 17:24:55, Wei Wang wrote: The OOM notifier is getting deprecated to use for the reasons mentioned here by Michal Hocko: https://lkml.org/lkml/2018/7/12/314 This patch replaces the virtio-balloon oom notifier with a shrinker to release balloon pages on memory pressure. It would be great to document the replacement. This is not a small change... OK. I plan to document the following to the commit log: The OOM notifier is getting deprecated to use for the reasons: - As a callout from the oom context, it is too subtle and easy to generate bugs and corner cases which are hard to track; - It is called too late (after the reclaiming has been performed). Drivers with large amuont of reclaimable memory is expected to be released them at an early age of memory pressure; - The notifier callback isn't aware of the oom contrains; Link: https://lkml.org/lkml/2018/7/12/314 This patch replaces the virtio-balloon oom notifier with a shrinker to release balloon pages on memory pressure. Users can set the amount of memory pages to release each time a shrinker_scan is called via the module parameter balloon_pages_to_shrink, and the default amount is 256 pages. Historically, the feature VIRTIO_BALLOON_F_DEFLATE_ON_OOM has been used to release balloon pages on OOM. We continue to use this feature bit for the shrinker, so the shrinker is only registered when this feature bit has been negotiated with host. Do you have any numbers for how does this work in practice? It works in this way: for example, we can set the parameter, balloon_pages_to_shrink, to shrink 1GB memory once shrink scan is called. Now, we have a 8GB guest, and we balloon out 7GB. When shrink scan is called, the balloon driver will get back 1GB memory and give them back to mm, then the ballooned memory becomes 6GB. When the shrinker scan is called the second time, another 1GB will be given back to mm. So the ballooned pages are given back to mm gradually. Let's say you have a medium page cache workload which triggers kswapd to do a light reclaim? Hardcoded shrinking sounds quite dubious to me but I have no idea how people expect this to work. Shouldn't this be more adaptive? How precious are those pages anyway? Those pages are given to host to use usually because the guest has enough free memory, and host doesn't want to waste those pieces of memory as they are not used by this guest. When the guest needs them, it is reasonable that the guest has higher priority to take them back. But I'm not sure if there would be a more adaptive approach than "gradually giving back as the guest wants more". Best, Wei ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018/08/02 18:23, Jason Wang wrote: > On 2018年08月02日 16:41, Toshiaki Makita wrote: >> On 2018/08/02 17:18, Jason Wang wrote: >>> On 2018年08月01日 17:52, Tonghao Zhang wrote: > +static void vhost_net_busy_poll_check(struct vhost_net *net, > + struct vhost_virtqueue *rvq, > + struct vhost_virtqueue *tvq, > + bool rx) > +{ > + struct socket *sock = rvq->private_data; > + > + if (rx) > + vhost_net_busy_poll_try_queue(net, tvq); > + else if (sock && sk_has_rx_data(sock->sk)) > + vhost_net_busy_poll_try_queue(net, rvq); > + else { > + /* On tx here, sock has no rx data, so we > + * will wait for sock wakeup for rx, and > + * vhost_enable_notify() is not needed. */ A possible case is we do have rx data but guest does not refill the rx queue. In this case we may lose notifications from guest. >>> Yes, should consider this case. thanks. >> I'm a bit confused. Isn't this covered by the previous >> "else if (sock && sk_has_rx_data(...))" block? > > The problem is it does nothing if vhost_vq_avail_empty() is true and > vhost_enble_notify() is false. If vhost_enable_notify() is false, guest will eventually kicks vq, no? -- Toshiaki Makita ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月02日 16:41, Toshiaki Makita wrote: On 2018/08/02 17:18, Jason Wang wrote: On 2018年08月01日 17:52, Tonghao Zhang wrote: +static void vhost_net_busy_poll_check(struct vhost_net *net, + struct vhost_virtqueue *rvq, + struct vhost_virtqueue *tvq, + bool rx) +{ + struct socket *sock = rvq->private_data; + + if (rx) + vhost_net_busy_poll_try_queue(net, tvq); + else if (sock && sk_has_rx_data(sock->sk)) + vhost_net_busy_poll_try_queue(net, rvq); + else { + /* On tx here, sock has no rx data, so we + * will wait for sock wakeup for rx, and + * vhost_enable_notify() is not needed. */ A possible case is we do have rx data but guest does not refill the rx queue. In this case we may lose notifications from guest. Yes, should consider this case. thanks. I'm a bit confused. Isn't this covered by the previous "else if (sock && sk_has_rx_data(...))" block? The problem is it does nothing if vhost_vq_avail_empty() is true and vhost_enble_notify() is false. + + cpu_relax(); + } + + preempt_enable(); + + if (!rx) + vhost_net_enable_vq(net, vq); No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. If we disable rx virtqueue in handle_tx and don't send packets from guest anymore(handle_tx is not called), so we can wake up for sock rx. so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. Yes. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018/08/02 17:18, Jason Wang wrote: > On 2018年08月01日 17:52, Tonghao Zhang wrote: >>> +static void vhost_net_busy_poll_check(struct vhost_net *net, >>> + struct vhost_virtqueue *rvq, >>> + struct vhost_virtqueue *tvq, >>> + bool rx) >>> +{ >>> + struct socket *sock = rvq->private_data; >>> + >>> + if (rx) >>> + vhost_net_busy_poll_try_queue(net, tvq); >>> + else if (sock && sk_has_rx_data(sock->sk)) >>> + vhost_net_busy_poll_try_queue(net, rvq); >>> + else { >>> + /* On tx here, sock has no rx data, so we >>> + * will wait for sock wakeup for rx, and >>> + * vhost_enable_notify() is not needed. */ >> >> A possible case is we do have rx data but guest does not refill the rx >> queue. In this case we may lose notifications from guest. > Yes, should consider this case. thanks. I'm a bit confused. Isn't this covered by the previous "else if (sock && sk_has_rx_data(...))" block? + + cpu_relax(); + } + + preempt_enable(); + + if (!rx) + vhost_net_enable_vq(net, vq); >>> No need to enable rx virtqueue, if we are sure handle_rx() will be >>> called soon. >> If we disable rx virtqueue in handle_tx and don't send packets from >> guest anymore(handle_tx is not called), so we can wake up for sock rx. >> so the network is broken. > > Not sure I understand here. I mean is we schedule work for handle_rx(), > there's no need to enable it since handle_rx() will do this for us. Looks like in the last "else" block in vhost_net_busy_poll_check() we need to enable vq since in that case we have no rx data and handle_rx() is not scheduled. -- Toshiaki Makita ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v7 3/4] net: vhost: factor out busy polling logic to vhost_net_busy_poll()
On 2018年08月01日 17:52, Tonghao Zhang wrote: + + cpu_relax(); + } + + preempt_enable(); + + if (!rx) + vhost_net_enable_vq(net, vq); No need to enable rx virtqueue, if we are sure handle_rx() will be called soon. If we disable rx virtqueue in handle_tx and don't send packets from guest anymore(handle_tx is not called), so we can wake up for sock rx. so the network is broken. Not sure I understand here. I mean is we schedule work for handle_rx(), there's no need to enable it since handle_rx() will do this for us. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization