Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

2019-08-24 Thread Laine Stump

On 8/24/19 3:08 AM, Dan Kenigsberg wrote:



On Fri, 23 Aug 2019, 0:27 Laine Stump, > wrote:


(Adding Alex Williamson to Cc so he can correct any mistakes)

On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:
 > On Thu, Aug 22, 2019 at 12:01 PM Laine Stump mailto:la...@redhat.com>> wrote:
 >>
 >> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
 >>> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé
mailto:berra...@redhat.com>> wrote:
 
  On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
 > Hi all,
 >
 > KubeVirt uses libvirtd to manage qemu VMs represented as
Kubernetes
 > API resources. In this case, libvirtd is running inside an
 > unprivileged pod, with some host mounts / capabilities added
to the
 > pod, needed by libvirtd and other services.
 >
 > One of the capabilities libvirtd requires for successful startup
 > inside a pod is SYS_RESOURCE. This capability is used to adjust
 > RLIMIT_MEMLOCK ulimit value depending on devices attached to the
 > managed guest, both on startup and during hotplug. AFAIU the
need to
 > lock the memory is to avoid pages being pushed out from RAM
into swap.
 >>
 >>
 >> I recall successfully testing GPU assignment from an unprivileged
 >> libvirtd several years ago by setting a high enough ulimit for
the uid
 >> used to run libvirtd in advance (. I think we check if the current
 >> setting is high enough, and don't try to set it unless we think
we need to.
 >>
 >
 > The PR I linked to in the original email does just that: it starts
 > libvirtd; then, if domain is going to use VFIO, sets ulimit of
 > libvirtd process to VM memory size + 1Gb (mimicking libvirt code) +
 > 256Mb (to stay conservative) using prlimit() syscall; then
defines the
 > domain.

So you're making an educated guess, which is essentially what
libvirt is
doing (based on advice from other people with better information than
us, but still a guess).

 >
 >> If I understand you correctly, you're saying that in your case
it's okay
 >> for the memlock limit to be lower than we try to set it to,
because swap
 >> is disabled anyway, is that correct?
 >>
 >
 > I'm honestly not exactly sure about the reason why we need to set the
 > limit, but I assume it's because of swap. I can be totally
confused on
 > that part though.


What I understand from an IRC conversation with Alex just now is that
increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages
being swapped out. It's done because "all GPAs (Guest Physical
Addresses) that could potentially be DMA targets need to have fixed
mappings through the iommu, therefore all need to be allocated and
mappings fixed [...] setting rlimit allows us to perform all the
necessary pins within the user's locked memory limit".

So even if swap is disabled, it still needs to be done (either by
libvirt, or by someone else who has the necessary privileges and
control
over the libvirtd process).


 
  Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
  something in the XML that requires it - one of
 >>>
 >>> You are right, sorry. We add SYS_RESOURCE only for particular
domains.
 >>>
 
     - hard limit memory value is present
     - host PCI device passthrough is requested
 >>>
 >>> We are using passthrough
 >>
 >> (If you want to make Alex happy, use the term "VFIO device
assignment"
 >> rather than passthrough :-).)
 >>
 >
 > Not sure who Alex is but I'll try to make everyone happy! :)

The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO
maintainer.


 >>> to pass SR-IOV NIC VFs into guests. We also
 >>> plan to do the same for GPUs in the near future.
 >>
 >>   >>> I believe we would benefit from one of the following
features on
 >>   >>> libvirt side (or both):
 >>   >>>
 >>   >>> a) expose the memory lock value calculated by libvirtd through
 >>   >>> libvirt ABI so that we can use it when calling prlimit()
on libvirtd
 >>   >>> process;
 >>   >>> b) allow to disable setrlimit() calls via libvirtd config
file knob
 >>   >>> or domain definition.
 >>
 >> (b) sounds much more reasonable, as long as qemu doesn't complain (I
 >> don't know whether or not it checks)
 >>
 >> Slightly related to this - I'm currently working on patches to avoid
 >> making any ioctl calls that would fail in an unprivileged
libvirtd when
 >> using tap/macvtap devices. 



This is music to my ears, great to hear.

ATM, I'm doing this by adding an attribute
 >> "unmanaged='yes'" to the interface  element. The idea 

Re: [libvirt-users] RLIMIT_MEMLOCK in container environment

2019-08-24 Thread Dan Kenigsberg
On Fri, 23 Aug 2019, 0:27 Laine Stump,  wrote:

> (Adding Alex Williamson to Cc so he can correct any mistakes)
>
> On 8/22/19 4:39 PM, Ihar Hrachyshka wrote:
> > On Thu, Aug 22, 2019 at 12:01 PM Laine Stump  wrote:
> >>
> >> On 8/22/19 10:56 AM, Ihar Hrachyshka wrote:
> >>> On Thu, Aug 22, 2019 at 2:24 AM Daniel P. Berrangé <
> berra...@redhat.com> wrote:
> 
>  On Wed, Aug 21, 2019 at 01:37:21PM -0700, Ihar Hrachyshka wrote:
> > Hi all,
> >
> > KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
> > API resources. In this case, libvirtd is running inside an
> > unprivileged pod, with some host mounts / capabilities added to the
> > pod, needed by libvirtd and other services.
> >
> > One of the capabilities libvirtd requires for successful startup
> > inside a pod is SYS_RESOURCE. This capability is used to adjust
> > RLIMIT_MEMLOCK ulimit value depending on devices attached to the
> > managed guest, both on startup and during hotplug. AFAIU the need to
> > lock the memory is to avoid pages being pushed out from RAM into
> swap.
> >>
> >>
> >> I recall successfully testing GPU assignment from an unprivileged
> >> libvirtd several years ago by setting a high enough ulimit for the uid
> >> used to run libvirtd in advance (. I think we check if the current
> >> setting is high enough, and don't try to set it unless we think we need
> to.
> >>
> >
> > The PR I linked to in the original email does just that: it starts
> > libvirtd; then, if domain is going to use VFIO, sets ulimit of
> > libvirtd process to VM memory size + 1Gb (mimicking libvirt code) +
> > 256Mb (to stay conservative) using prlimit() syscall; then defines the
> > domain.
>
> So you're making an educated guess, which is essentially what libvirt is
> doing (based on advice from other people with better information than
> us, but still a guess).
>
> >
> >> If I understand you correctly, you're saying that in your case it's okay
> >> for the memlock limit to be lower than we try to set it to, because swap
> >> is disabled anyway, is that correct?
> >>
> >
> > I'm honestly not exactly sure about the reason why we need to set the
> > limit, but I assume it's because of swap. I can be totally confused on
> > that part though.
>
>
> What I understand from an IRC conversation with Alex just now is that
> increasing RLIMIT_MEMLOCK isn't done just to prevent any of the pages
> being swapped out. It's done because "all GPAs (Guest Physical
> Addresses) that could potentially be DMA targets need to have fixed
> mappings through the iommu, therefore all need to be allocated and
> mappings fixed [...] setting rlimit allows us to perform all the
> necessary pins within the user's locked memory limit".
>
> So even if swap is disabled, it still needs to be done (either by
> libvirt, or by someone else who has the necessary privileges and control
> over the libvirtd process).
>
>
> 
>  Libvirt shouldn't set RLIMIT_MEMLOCK by default, unless there's
>  something in the XML that requires it - one of
> >>>
> >>> You are right, sorry. We add SYS_RESOURCE only for particular domains.
> >>>
> 
> - hard limit memory value is present
> - host PCI device passthrough is requested
> >>>
> >>> We are using passthrough
> >>
> >> (If you want to make Alex happy, use the term "VFIO device assignment"
> >> rather than passthrough :-).)
> >>
> >
> > Not sure who Alex is but I'll try to make everyone happy! :)
>
> The Alex I'm referring to is the Alex I just Cc'ed. He is the VFIO
> maintainer.
>
>
> >>> to pass SR-IOV NIC VFs into guests. We also
> >>> plan to do the same for GPUs in the near future.
> >>
> >>   >>> I believe we would benefit from one of the following features on
> >>   >>> libvirt side (or both):
> >>   >>>
> >>   >>> a) expose the memory lock value calculated by libvirtd through
> >>   >>> libvirt ABI so that we can use it when calling prlimit() on
> libvirtd
> >>   >>> process;
> >>   >>> b) allow to disable setrlimit() calls via libvirtd config file
> knob
> >>   >>> or domain definition.
> >>
> >> (b) sounds much more reasonable, as long as qemu doesn't complain (I
> >> don't know whether or not it checks)
> >>
> >> Slightly related to this - I'm currently working on patches to avoid
> >> making any ioctl calls that would fail in an unprivileged libvirtd when
> >> using tap/macvtap devices.


This is music to my ears, great to hear.

ATM, I'm doing this by adding an attribute
> >> "unmanaged='yes'" to the interface  element. The idea is that if
> >> someone sets unmanaged='yes', they're stating that the caller (i.e.
> >> kubevirt) is responsible for all device setup, and that libvirt should
> >> just use it without further setup. A similar approach could be applied
> >> to hostdev devices - if unmanaged is set, we assume that the caller has
> >> done everything to make the associated device usable.
> >>
> >> (Of course this all makes me realize the