Re: [openstack-dev] realtime kvm cpu affinities

2017-07-06 Thread Henning Schild
Stephen,

thanks for summing it all up! I am guessing that a blueprint or updates
to an existing blueprint will be next. We currently have a patch that
introduces a second pin_set to nova.conf and solves problem1 and 2 in
ocata. But that might be overlooking a couple of cases we do not care
about/did not come across yet.
Next to the text, that could serve as a discussion basis for what will
be imlpemented eventually.

I am happy because the two problems where acknowledged, the placement 
strategy of the threads was discussed/reviewed with some input from kvm,
and we already talked about possible solutions.
So things are moving ;)

regards,
Henning

Am Thu, 29 Jun 2017 17:59:41 +0100
schrieb :

> On Tue, 2017-06-20 at 09:48 +0200, Henning Schild wrote:
> > Hi,
> > 
> > We are using OpenStack for managing realtime guests. We modified
> > it and contributed to discussions on how to model the realtime
> > feature. More recent versions of OpenStack have support for
> > realtime, and there are a few proposals on how to improve that
> > further.
> > 
> > ...  
> 
> I'd put off working my way through this thread until I'd time to sit
> down and read it in full. Here's what I'm seeing by way of summaries
> _so far_.
> 
> # Current situation
> 
> I think this tree (sans 'hw' prefixes for brevity) represents the
> current situation around flavor extra specs and image meta. Pretty
> much everything hangs off cpu_policy=dedicated. Correct me if I'm
> wrong.
> 
>   cpu_policy
>   ╞═> shared
>   ╘═> dedicated
>   ├─> cpu_thread_policy  
>   │   ╞═> prefer
>   │   ╞═> isolate
>   │   ╘═> require
>   ├─> emulator_threads_policy (*)
>   │   ╞═> share  
>   │   ╘═> isolate
>   └─> cpu_realtime
>   ╞═> no
>   ╘═> yes
>   └─> cpu_realtime_mask
>   ╘═> (a mask of guest cores)  
> 
> (*) this one isn't configurable via images. I never really got why
> but meh.
> 
> There's also some host-level configuration options
> 
>   vcpu_pin_set
>   ╘═> (a list of host cores that nova can use)
> 
> Finally, there's some configuration you can do with your choice of
> kernel and kernel options (e.g. 'isolcpus').
> 
> For real time workloads, the expectation would be that you would set:
> 
>   cpu_policy
>   ╘═> dedicated
>   ├─> cpu_thread_policy
>   │   ╘═> isolate
>   ├─> emulator_threads_policy
>   │   ╘═> isolate
>   └─> cpu_realtime
>   ╘═> yes
>   └─> cpu_realtime_mask
>   ╘═> (a mask of guest cores)  
> 
> That would result in a host that would use N+1 vCPUs, where N
> corresponds to the number of instance cores. Of the N cores, the set
> masked by 'cpu_realtime_mask' will be non-realtime. The remainder
> will be realtime.
> 
> # The Problem(s)
> 
> I'm going to thread this to capture the arguments and counter
> arguments:
> 
> ## Problem 1
> 
> henning.schild suggested that the current implementation of
> 'emulator_thread_policy' is too resource intensive, as the 1 core
> generally has a minimal workload for entire guests. This can
> significantly limit the number of guests that can be booted per host,
> particularly for guests with smaller numbers of cores. Instead, he
> has implemented a 'emulator_pin_set' host-level option, which
> complements 'vcpu_pin_set'. This allows us to "pool" emulator
> threads, similar to how vCPU threads behave with 'cpu_policy=shared'.
> He suggests this be adopted by nova.
> 
>   sahid seconded this, but suggests 'emulator_pin_set' be renamed
>   'cpu_emulator_threads_mask' and work as a mask of 'vcpu_pin_set'.
> He also suggested making a similarly-named flavor property, that
> would allow the user to use one of their cores for non-realtime 
> 
> henning.schild suggested a set would still be better, but that
> 'vpu_pin_set' be renamed to 'pin_set', as it would no longer be
> for only vCPUs
> 
>   cfriesen seconded henning.schild's position but was not entirely
>   convinced that sharing emulator threads on a single pCPU is
> guaranteed to be safe, for example if one instance starts seriously
> hammering on I/O or does live migration or something. He suggested
> that an additional option, 'rt_emulator_overcommit_ratio' be added to
> make overcommitting explicit. In addition, he suggested making the
> flavor property a bitmask
> 
> sahid questioned the need for an overcommit ratio, given that
> there is no overcommit of the hosts. An operator could synthesize a
> suitable value for 'emulator_pin_set'/'cpu_emulator_threads_mask'. He
> also disagreed with the suggestion that the flavor property be a
> bitmask as the only set is that of the vCPUs.
> 
>   cfriesen clarifies to point out how a few instances with
> many vCPUs will have more overhead requirements than many instances
> with few vCPUs. We need to be able to fail scheduling if the emulator
> thread cores are oversubscribed.
> 
> ## Problem 2
> 
> henning.schild suggests that hosts 

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-30 Thread Chris Friesen

On 06/30/2017 07:06 AM, sfinu...@redhat.com wrote:

On Thu, 2017-06-29 at 12:20 -0600, Chris Friesen wrote:

On 06/29/2017 10:59 AM, sfinu...@redhat.com wrote:



  From the above, there are 3-4 work items:

- Add a 'emulator_pin_set' or 'cpu_emulator_threads_mask' configuration
option

- If using a mask, rename 'vcpu_pin_set' to 'pin_set' (or, better,
  'usable_cpus')

- Add a 'emulator_overcommit_ratio', which will do for emulator threads
what
the other ratios do for vCPUs and memory


If we were going to support "emulator_overcommit_ratio", then we wouldn't
necessarily need an explicit mask/set as a config option. If someone wants
to run with 'hw:emulator_thread_policy=isolate' and we're below the
overcommit ratio then we run it, otherwise nova could try to allocate a new
pCPU to add to the emulator_pin_set internally tracked by nova.  This would
allow for the number of pCPUs in emulator_pin_set to vary depending on the
number of instances with 'hw:emulator_thread_policy=isolate'on the compute
node, which should allow for optimal packing.


So we'd now mark pCPUs not only as used, but also as used for a specific
purpose? That would probably be more flexible that using a static pool of CPUs,
particularly if instances are heterogeneous. I'd imagine it would, however, be
much tougher to do right. I need to think on this.


I think you could do it with a new "emulator_cpus" field in NUMACell, and a new 
"emulator_pcpu" field in InstanceNUMACell.



As an aside, what would we do about billing? Currently we include CPUs used for
emulator threads as overhead. Would this change?


We currently have local changes to allow instances with "shared" and "dedicated" 
CPUs to coexist on the same compute node.  For CPU usage, "dedicated" CPUs count 
as "1", and "shared" CPUs count as 1/cpu_overcommit_ratio.  That way the total 
CPU usage can never exceed the number of available CPUs.


You could follow this model and bill for an extra 1/emulator_overcommit_ratio 
worth of a CPU for instances with 'hw:emulator_thread_policy=isolate'.



- Deprecate 'hw:emulator_thread_policy'???


I'm not sure we need to deprecate it, it would instead signify whether the
emulator threads should be isolated from the vCPU threads.  If set to
"isolate" then they would run on the emulator_pin_set identified above
(potentially sharing them with emulator threads from other instances) rather
than each instance getting a whole pCPU for its emulator threads.


I'm confused, I thought we weren't going to need 'emulator_pin_set'?


I meant whatever field we use internally to track which pCPUs are currently 
being used to run emulator threads as opposed to vCPU threads.  (ie the 
"emulator_cpus" field in NUMACell suggested above.


> In any

case, it's probably less about deprecating the extra spec and instead changing
how things work under the hood. We'd actually still want something to signify
"I want my emulator overhead accounted for separately".


Agreed.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-30 Thread sfinucan
On Thu, 2017-06-29 at 12:20 -0600, Chris Friesen wrote:
> On 06/29/2017 10:59 AM, sfinu...@redhat.com wrote:
> 
> > Thus far, we've no clear conclusions on directions to go, so I've took a
> > stab
> > below. Henning, Sahid, Chris: does the above/below make sense, and is there
> > anything we need to further clarify?
> 
> The above is close enough. :)

Excellent :tentsfingers:

> > # Problem 1
> > 
> >  From the above, there are 3-4 work items:
> > 
> > - Add a 'emulator_pin_set' or 'cpu_emulator_threads_mask' configuration
> > option
> > 
> >    - If using a mask, rename 'vcpu_pin_set' to 'pin_set' (or, better,
> >  'usable_cpus')
> > 
> > - Add a 'emulator_overcommit_ratio', which will do for emulator threads  
> > what
> >    the other ratios do for vCPUs and memory
> 
> If we were going to support "emulator_overcommit_ratio", then we wouldn't 
> necessarily need an explicit mask/set as a config option. If someone wants
> to run with 'hw:emulator_thread_policy=isolate' and we're below the
> overcommit ratio then we run it, otherwise nova could try to allocate a new
> pCPU to add to the emulator_pin_set internally tracked by nova.  This would
> allow for the number of pCPUs in emulator_pin_set to vary depending on the
> number of instances with 'hw:emulator_thread_policy=isolate'on the compute
> node, which should allow for optimal packing.

So we'd now mark pCPUs not only as used, but also as used for a specific
purpose? That would probably be more flexible that using a static pool of CPUs,
particularly if instances are heterogeneous. I'd imagine it would, however, be
much tougher to do right. I need to think on this.

As an aside, what would we do about billing? Currently we include CPUs used for
emulator threads as overhead. Would this change?

> > - Deprecate 'hw:emulator_thread_policy'???
> 
> I'm not sure we need to deprecate it, it would instead signify whether the 
> emulator threads should be isolated from the vCPU threads.  If set to
> "isolate" then they would run on the emulator_pin_set identified above
> (potentially sharing them with emulator threads from other instances) rather
> than each instance getting a whole pCPU for its emulator threads.

I'm confused, I thought we weren't going to need 'emulator_pin_set'? In any
case, it's probably less about deprecating the extra spec and instead changing
how things work under the hood. We'd actually still want something to signify
"I want my emulator overhead accounted for separately".

> > # Problem 2
> > 
> > No clear conclusions yet?
> 
> I don't see any particular difficulty in supporting both RT and non-RT
> instances on the same host with one nova-compute process.  It might even be
> valid for a high-performance VM to make use 
> of 'hw:emulator_thread_policy=isolate' without enabling RT.  (Which is why
> I've been careful not to imply RT in the description above.)

Yeah, I might focus on the above problem for now, as I've no clear ideas or
suggestions on how to proceed here. Happy to work on specs if necessary,
though.

Stephen

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-29 Thread Chris Friesen

On 06/29/2017 10:59 AM, sfinu...@redhat.com wrote:


Thus far, we've no clear conclusions on directions to go, so I've took a stab
below. Henning, Sahid, Chris: does the above/below make sense, and is there
anything we need to further clarify?


The above is close enough. :)


# Problem 1

 From the above, there are 3-4 work items:

- Add a 'emulator_pin_set' or 'cpu_emulator_threads_mask' configuration option

   - If using a mask, rename 'vcpu_pin_set' to 'pin_set' (or, better,
 'usable_cpus')

- Add a 'emulator_overcommit_ratio', which will do for emulator threads what
   the other ratios do for vCPUs and memory


If we were going to support "emulator_overcommit_ratio", then we wouldn't 
necessarily need an explicit mask/set as a config option. If someone wants to 
run with 'hw:emulator_thread_policy=isolate' and we're below the overcommit 
ratio then we run it, otherwise nova could try to allocate a new pCPU to add to 
the emulator_pin_set internally tracked by nova.  This would allow for the 
number of pCPUs in emulator_pin_set to vary depending on the number of instances 
with 'hw:emulator_thread_policy=isolate'on the compute node, which should allow 
for optimal packing.



- Deprecate 'hw:emulator_thread_policy'???


I'm not sure we need to deprecate it, it would instead signify whether the 
emulator threads should be isolated from the vCPU threads.  If set to "isolate" 
then they would run on the emulator_pin_set identified above (potentially 
sharing them with emulator threads from other instances) rather than each 
instance getting a whole pCPU for its emulator threads.



# Problem 2

No clear conclusions yet?


I don't see any particular difficulty in supporting both RT and non-RT instances 
on the same host with one nova-compute process.  It might even be valid for a 
high-performance VM to make use of 'hw:emulator_thread_policy=isolate' without 
enabling RT.  (Which is why I've been careful not to imply RT in the description 
above.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-29 Thread sfinucan
On Tue, 2017-06-20 at 09:48 +0200, Henning Schild wrote:
> Hi,
> 
> We are using OpenStack for managing realtime guests. We modified
> it and contributed to discussions on how to model the realtime
> feature. More recent versions of OpenStack have support for realtime,
> and there are a few proposals on how to improve that further.
> 
> ...

I'd put off working my way through this thread until I'd time to sit down and
read it in full. Here's what I'm seeing by way of summaries _so far_.

# Current situation

I think this tree (sans 'hw' prefixes for brevity) represents the current
situation around flavor extra specs and image meta. Pretty much everything
hangs off cpu_policy=dedicated. Correct me if I'm wrong.

  cpu_policy
  ╞═> shared
  ╘═> dedicated
  ├─> cpu_thread_policy
  │   ╞═> prefer
  │   ╞═> isolate
  │   ╘═> require
  ├─> emulator_threads_policy (*)
  │   ╞═> share
  │   ╘═> isolate
  └─> cpu_realtime
  ╞═> no
  ╘═> yes
  └─> cpu_realtime_mask
  ╘═> (a mask of guest cores)

(*) this one isn't configurable via images. I never really got why but meh.

There's also some host-level configuration options

  vcpu_pin_set
  ╘═> (a list of host cores that nova can use)

Finally, there's some configuration you can do with your choice of kernel and
kernel options (e.g. 'isolcpus').

For real time workloads, the expectation would be that you would set:

  cpu_policy
  ╘═> dedicated
  ├─> cpu_thread_policy
  │   ╘═> isolate
  ├─> emulator_threads_policy
  │   ╘═> isolate
  └─> cpu_realtime
  ╘═> yes
  └─> cpu_realtime_mask
  ╘═> (a mask of guest cores)

That would result in a host that would use N+1 vCPUs, where N corresponds to
the number of instance cores. Of the N cores, the set masked by
'cpu_realtime_mask' will be non-realtime. The remainder will be realtime.

# The Problem(s)

I'm going to thread this to capture the arguments and counter arguments:

## Problem 1

henning.schild suggested that the current implementation of
'emulator_thread_policy' is too resource intensive, as the 1 core generally
has a minimal workload for entire guests. This can significantly limit the
number of guests that can be booted per host, particularly for guests with
smaller numbers of cores. Instead, he has implemented a 'emulator_pin_set'
host-level option, which complements 'vcpu_pin_set'. This allows us to "pool"
emulator threads, similar to how vCPU threads behave with 'cpu_policy=shared'.
He suggests this be adopted by nova.

  sahid seconded this, but suggests 'emulator_pin_set' be renamed
  'cpu_emulator_threads_mask' and work as a mask of 'vcpu_pin_set'. He also
  suggested making a similarly-named flavor property, that would allow the
  user to use one of their cores for non-realtime 

henning.schild suggested a set would still be better, but that
'vpu_pin_set' be renamed to 'pin_set', as it would no longer be for only
vCPUs

  cfriesen seconded henning.schild's position but was not entirely
  convinced that sharing emulator threads on a single pCPU is guaranteed
  to be safe, for example if one instance starts seriously hammering on
  I/O or does live migration or something. He suggested that an additional
  option, 'rt_emulator_overcommit_ratio' be added to make overcommitting
  explicit. In addition, he suggested making the flavor property a bitmask

sahid questioned the need for an overcommit ratio, given that there is
no overcommit of the hosts. An operator could synthesize a suitable
value for 'emulator_pin_set'/'cpu_emulator_threads_mask'. He also
disagreed with the suggestion that the flavor property be a bitmask as
the only set is that of the vCPUs.

  cfriesen clarifies to point out how a few instances with many vCPUs
  will have more overhead requirements than many instances with few
  vCPUs. We need to be able to fail scheduling if the emulator thread
  cores are oversubscribed.

## Problem 2

henning.schild suggests that hosts should be able to handle both RT and non-RT
instances. This could be achieved through multiple instances of nova

  sahid points out that the recommendation is to use host aggregates to
  separate the two.

henning.schild states that hosts with RT kernels can manage non-RT guests
just fine. However, if using host aggregates is the recommendation then it
should be possible to run multiple nova instances on a host, because
dedicating an entire machine is not viable for smaller operations. cfriesen
seconds this perspective, though not this solution.

# Solutions

Thus far, we've no clear conclusions on directions to go, so I've took a stab
below. Henning, Sahid, Chris: does the above/below make sense, and is there
anything we need to further clarify?

# Problem 1

From the above, there are 3-4 work items:

- Add a 'emulato

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-29 Thread Henning Schild
Am Wed, 28 Jun 2017 11:34:42 +0200
schrieb Sahid Orentino Ferdjaoui :

> On Tue, Jun 27, 2017 at 04:00:35PM +0200, Henning Schild wrote:
> > Am Tue, 27 Jun 2017 09:44:22 +0200
> > schrieb Sahid Orentino Ferdjaoui :
> >   
> > > On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:  
> > > > Am Sun, 25 Jun 2017 10:09:10 +0200
> > > > schrieb Sahid Orentino Ferdjaoui :
> > > > 
> > > > > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen
> > > > > wrote:
> > > > > > On 06/23/2017 09:35 AM, Henning Schild wrote:  
> > > > > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > > > > schrieb Sahid Orentino Ferdjaoui
> > > > > > > :  
> > > > > >   
> > > > > > > > In Linux RT context, and as you mentioned, the non-RT
> > > > > > > > vCPU can acquire some guest kernel lock, then be
> > > > > > > > pre-empted by emulator thread while holding this lock.
> > > > > > > > This situation blocks RT vCPUs from doing its work. So
> > > > > > > > that is why we have implemented [2]. For DPDK I don't
> > > > > > > > think we have such problems because it's running in
> > > > > > > > userland.
> > > > > > > > 
> > > > > > > > So for DPDK context I think we could have a mask like we
> > > > > > > > have for RT and basically considering vCPU0 to handle
> > > > > > > > best effort works (emulator threads, SSH...). I think
> > > > > > > > it's the current pattern used by DPDK users.  
> > > > > > > 
> > > > > > > DPDK is just a library and one can imagine an application
> > > > > > > that has cross-core communication/synchronisation needs
> > > > > > > where the emulator slowing down vpu0 will also slow down
> > > > > > > vcpu1. You DPDK application would have to know which of
> > > > > > > its cores did not get a full pcpu.
> > > > > > > 
> > > > > > > I am not sure what the DPDK-example is doing in this
> > > > > > > discussion, would that not just be cpu_policy=dedicated? I
> > > > > > > guess normal behaviour of dedicated is that emulators and
> > > > > > > io happily share pCPUs with vCPUs and you are looking for
> > > > > > > a way to restrict emulators/io to a subset of pCPUs
> > > > > > > because you can live with some of them beeing not
> > > > > > > 100%.  
> > > > > > 
> > > > > > Yes.  A typical DPDK-using VM might look something like
> > > > > > this:
> > > > > > 
> > > > > > vCPU0: non-realtime, housekeeping and I/O, handles all
> > > > > > virtual interrupts and "normal" linux stuff, emulator runs
> > > > > > on same pCPU vCPU1: realtime, runs in tight loop in
> > > > > > userspace processing packets vCPU2: realtime, runs in tight
> > > > > > loop in userspace processing packets vCPU3: realtime, runs
> > > > > > in tight loop in userspace processing packets
> > > > > > 
> > > > > > In this context, vCPUs 1-3 don't really ever enter the
> > > > > > kernel, and we've offloaded as much kernel work as possible
> > > > > > from them onto vCPU0.  This works pretty well with the
> > > > > > current system. 
> > > > > > > > For RT we have to isolate the emulator threads to an
> > > > > > > > additional pCPU per guests or as your are suggesting to
> > > > > > > > a set of pCPUs for all the guests running.
> > > > > > > > 
> > > > > > > > I think we should introduce a new option:
> > > > > > > > 
> > > > > > > >- hw:cpu_emulator_threads_mask=^1
> > > > > > > > 
> > > > > > > > If on 'nova.conf' - that mask will be applied to the
> > > > > > > > set of all host CPUs (vcpu_pin_set) to basically pack
> > > > > > > > the emulator threads of all VMs running here (useful
> > > > > > > > for RT context).  
> > > > > > > 
> > > > > > > That would allow modelling exactly what we need.
> > > > > > > In nova.conf we are talking absolute known values, no need
> > > > > > > for a mask and a set is much easier to read. Also using
> > > > > > > the same name does not sound like a good idea.
> > > > > > > And the name vcpu_pin_set clearly suggest what kind of
> > > > > > > load runs here, if using a mask it should be called
> > > > > > > pin_set.  
> > > > > > 
> > > > > > I agree with Henning.
> > > > > > 
> > > > > > In nova.conf we should just use a set, something like
> > > > > > "rt_emulator_vcpu_pin_set" which would be used for running
> > > > > > the emulator/io threads of *only* realtime instances.  
> > > > > 
> > > > > I'm not agree with you, we have a set of pCPUs and we want to
> > > > > substract some of them for the emulator threads. We need a
> > > > > mask. The only set we need is to selection which pCPUs Nova
> > > > > can use (vcpus_pin_set).
> > > > 
> > > > At that point it does not really matter whether it is a set or a
> > > > mask. They can both express the same where a set is easier to
> > > > read/configure. With the same argument you could say that
> > > > vcpu_pin_set should be a mask over the hosts pcpus.
> > > > 
> > > > As i said before: vcpu_pin_set should be renamed because all
> > > > sorts of threads are put here (pcpu_pin_set?). But that would
> > 

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-28 Thread Chris Friesen

On 06/28/2017 03:34 AM, Sahid Orentino Ferdjaoui wrote:

On Tue, Jun 27, 2017 at 04:00:35PM +0200, Henning Schild wrote:



As far as i remember it was not straight forward to get two novas onto
one host in the older release, i am not surprised that causing trouble
with the update to mitaka. If we agree on 2 novas and aggregates as the
recommended way we should make sure the 2 novas is a supported feature,
covered in test-cases and documented.
Dedicating a whole machine to either RT or nonRT would imho be no
viable option.


The realtime nodes should be isolated by aggregates as you seem to do.


Yes, with two novas on one machine. They share one libvirt using
different instrance-prefixes and have some other config options set, so
they do not collide on resources.


It's clearly not what I was suggesting, you should have 2 groups of
compute hosts. One aggregate with hosts for the non-RT VMs and an
other one for hosts with RT VMs.


Not all clouds are large enough to have an entire physical machine dedicated to 
RT VMs.  So Henning divided up the resources of the physical machine between two 
nova-compute instances and put them in separate aggregates.


It would be easier for operators if one single nova instance could manage both 
RT and non-RT instances on the same host (presumably running an RT OS).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-28 Thread Sahid Orentino Ferdjaoui
On Tue, Jun 27, 2017 at 04:00:35PM +0200, Henning Schild wrote:
> Am Tue, 27 Jun 2017 09:44:22 +0200
> schrieb Sahid Orentino Ferdjaoui :
> 
> > On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:
> > > Am Sun, 25 Jun 2017 10:09:10 +0200
> > > schrieb Sahid Orentino Ferdjaoui :
> > >   
> > > > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:  
> > > > > On 06/23/2017 09:35 AM, Henning Schild wrote:
> > > > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > > > schrieb Sahid Orentino Ferdjaoui :
> > > > > 
> > > > > > > In Linux RT context, and as you mentioned, the non-RT vCPU
> > > > > > > can acquire some guest kernel lock, then be pre-empted by
> > > > > > > emulator thread while holding this lock. This situation
> > > > > > > blocks RT vCPUs from doing its work. So that is why we have
> > > > > > > implemented [2]. For DPDK I don't think we have such
> > > > > > > problems because it's running in userland.
> > > > > > > 
> > > > > > > So for DPDK context I think we could have a mask like we
> > > > > > > have for RT and basically considering vCPU0 to handle best
> > > > > > > effort works (emulator threads, SSH...). I think it's the
> > > > > > > current pattern used by DPDK users.
> > > > > > 
> > > > > > DPDK is just a library and one can imagine an application
> > > > > > that has cross-core communication/synchronisation needs where
> > > > > > the emulator slowing down vpu0 will also slow down vcpu1. You
> > > > > > DPDK application would have to know which of its cores did
> > > > > > not get a full pcpu.
> > > > > > 
> > > > > > I am not sure what the DPDK-example is doing in this
> > > > > > discussion, would that not just be cpu_policy=dedicated? I
> > > > > > guess normal behaviour of dedicated is that emulators and io
> > > > > > happily share pCPUs with vCPUs and you are looking for a way
> > > > > > to restrict emulators/io to a subset of pCPUs because you can
> > > > > > live with some of them beeing not 100%.
> > > > > 
> > > > > Yes.  A typical DPDK-using VM might look something like this:
> > > > > 
> > > > > vCPU0: non-realtime, housekeeping and I/O, handles all virtual
> > > > > interrupts and "normal" linux stuff, emulator runs on same pCPU
> > > > > vCPU1: realtime, runs in tight loop in userspace processing
> > > > > packets vCPU2: realtime, runs in tight loop in userspace
> > > > > processing packets vCPU3: realtime, runs in tight loop in
> > > > > userspace processing packets
> > > > > 
> > > > > In this context, vCPUs 1-3 don't really ever enter the kernel,
> > > > > and we've offloaded as much kernel work as possible from them
> > > > > onto vCPU0.  This works pretty well with the current system.
> > > > > 
> > > > > > > For RT we have to isolate the emulator threads to an
> > > > > > > additional pCPU per guests or as your are suggesting to a
> > > > > > > set of pCPUs for all the guests running.
> > > > > > > 
> > > > > > > I think we should introduce a new option:
> > > > > > > 
> > > > > > >- hw:cpu_emulator_threads_mask=^1
> > > > > > > 
> > > > > > > If on 'nova.conf' - that mask will be applied to the set of
> > > > > > > all host CPUs (vcpu_pin_set) to basically pack the emulator
> > > > > > > threads of all VMs running here (useful for RT context).
> > > > > > 
> > > > > > That would allow modelling exactly what we need.
> > > > > > In nova.conf we are talking absolute known values, no need
> > > > > > for a mask and a set is much easier to read. Also using the
> > > > > > same name does not sound like a good idea.
> > > > > > And the name vcpu_pin_set clearly suggest what kind of load
> > > > > > runs here, if using a mask it should be called pin_set.
> > > > > 
> > > > > I agree with Henning.
> > > > > 
> > > > > In nova.conf we should just use a set, something like
> > > > > "rt_emulator_vcpu_pin_set" which would be used for running the
> > > > > emulator/io threads of *only* realtime instances.
> > > > 
> > > > I'm not agree with you, we have a set of pCPUs and we want to
> > > > substract some of them for the emulator threads. We need a mask.
> > > > The only set we need is to selection which pCPUs Nova can use
> > > > (vcpus_pin_set).  
> > > 
> > > At that point it does not really matter whether it is a set or a
> > > mask. They can both express the same where a set is easier to
> > > read/configure. With the same argument you could say that
> > > vcpu_pin_set should be a mask over the hosts pcpus.
> > > 
> > > As i said before: vcpu_pin_set should be renamed because all sorts
> > > of threads are put here (pcpu_pin_set?). But that would be a bigger
> > > change and should be discussed as a seperate issue.
> > > 
> > > So far we talked about a compute-node for realtime only doing
> > > realtime. In that case vcpu_pin_set + emulator_io_mask would work.
> > > If you want to run regular VMs on the same host, you can run a
> > > second nova, like we do.
> > > 
> > > We could also use vcpu_pin_set + rt_vc

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Chris Friesen

On 06/27/2017 09:36 AM, Henning Schild wrote:

Am Tue, 27 Jun 2017 09:28:34 -0600
schrieb Chris Friesen :



Once you use "isolcpus" on the host, the host scheduler won't "float"
threads between the CPUs based on load.  To get the float behaviour
you'd have to not isolate the pCPUs that will be used for emulator
threads, but then you run the risk of the host running other work on
those pCPUs (unless you use cpusets or something to isolate the host
work to a subset of non-isolcpus pCPUs).


With openstack you use libvirt and libvirt uses cgroups/cpusets to get
those threads onto these cores.


Right.  I misremembered.  We are currently using "isolcpus" on the compute node 
to isolate the pCPUs used for packet processing, but the pCPUs used for guests 
are not isolated.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Henning Schild
Am Tue, 27 Jun 2017 09:25:14 -0600
schrieb Chris Friesen :

> On 06/27/2017 01:44 AM, Sahid Orentino Ferdjaoui wrote:
> > On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:  
> >> Am Sun, 25 Jun 2017 10:09:10 +0200
> >> schrieb Sahid Orentino Ferdjaoui :
> >>  
> >>> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:  
>  On 06/23/2017 09:35 AM, Henning Schild wrote:  
> > Am Fri, 23 Jun 2017 11:11:10 +0200
> > schrieb Sahid Orentino Ferdjaoui :  
>   
> >> In Linux RT context, and as you mentioned, the non-RT vCPU can
> >> acquire some guest kernel lock, then be pre-empted by emulator
> >> thread while holding this lock. This situation blocks RT vCPUs
> >> from doing its work. So that is why we have implemented [2].
> >> For DPDK I don't think we have such problems because it's
> >> running in userland.
> >>
> >> So for DPDK context I think we could have a mask like we have
> >> for RT and basically considering vCPU0 to handle best effort
> >> works (emulator threads, SSH...). I think it's the current
> >> pattern used by DPDK users.  
> >
> > DPDK is just a library and one can imagine an application that
> > has cross-core communication/synchronisation needs where the
> > emulator slowing down vpu0 will also slow down vcpu1. You DPDK
> > application would have to know which of its cores did not get a
> > full pcpu.
> >
> > I am not sure what the DPDK-example is doing in this discussion,
> > would that not just be cpu_policy=dedicated? I guess normal
> > behaviour of dedicated is that emulators and io happily share
> > pCPUs with vCPUs and you are looking for a way to restrict
> > emulators/io to a subset of pCPUs because you can live with some
> > of them beeing not 100%.  
> 
>  Yes.  A typical DPDK-using VM might look something like this:
> 
>  vCPU0: non-realtime, housekeeping and I/O, handles all virtual
>  interrupts and "normal" linux stuff, emulator runs on same pCPU
>  vCPU1: realtime, runs in tight loop in userspace processing
>  packets vCPU2: realtime, runs in tight loop in userspace
>  processing packets vCPU3: realtime, runs in tight loop in
>  userspace processing packets
> 
>  In this context, vCPUs 1-3 don't really ever enter the kernel,
>  and we've offloaded as much kernel work as possible from them
>  onto vCPU0.  This works pretty well with the current system.
>   
> >> For RT we have to isolate the emulator threads to an additional
> >> pCPU per guests or as your are suggesting to a set of pCPUs for
> >> all the guests running.
> >>
> >> I think we should introduce a new option:
> >>
> >> - hw:cpu_emulator_threads_mask=^1
> >>
> >> If on 'nova.conf' - that mask will be applied to the set of all
> >> host CPUs (vcpu_pin_set) to basically pack the emulator threads
> >> of all VMs running here (useful for RT context).  
> >
> > That would allow modelling exactly what we need.
> > In nova.conf we are talking absolute known values, no need for a
> > mask and a set is much easier to read. Also using the same name
> > does not sound like a good idea.
> > And the name vcpu_pin_set clearly suggest what kind of load runs
> > here, if using a mask it should be called pin_set.  
> 
>  I agree with Henning.
> 
>  In nova.conf we should just use a set, something like
>  "rt_emulator_vcpu_pin_set" which would be used for running the
>  emulator/io threads of *only* realtime instances.  
> >>>
> >>> I'm not agree with you, we have a set of pCPUs and we want to
> >>> substract some of them for the emulator threads. We need a mask.
> >>> The only set we need is to selection which pCPUs Nova can use
> >>> (vcpus_pin_set).  
> >>
> >> At that point it does not really matter whether it is a set or a
> >> mask. They can both express the same where a set is easier to
> >> read/configure. With the same argument you could say that
> >> vcpu_pin_set should be a mask over the hosts pcpus.
> >>
> >> As i said before: vcpu_pin_set should be renamed because all sorts
> >> of threads are put here (pcpu_pin_set?). But that would be a
> >> bigger change and should be discussed as a seperate issue.
> >>
> >> So far we talked about a compute-node for realtime only doing
> >> realtime. In that case vcpu_pin_set + emulator_io_mask would work.
> >> If you want to run regular VMs on the same host, you can run a
> >> second nova, like we do.
> >>
> >> We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think
> >> that would allow modelling all cases in just one nova. Having all
> >> in one nova, you could potentially repurpose rt cpus to
> >> best-effort and back. Some day in the future ...  
> >
> > That is not something we should allow or at least
> > advertise. compute-node can't run both RT and non-RT guests and that
> > bec

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Henning Schild
Am Tue, 27 Jun 2017 09:28:34 -0600
schrieb Chris Friesen :

> On 06/27/2017 01:45 AM, Sahid Orentino Ferdjaoui wrote:
> > On Mon, Jun 26, 2017 at 12:12:49PM -0600, Chris Friesen wrote:  
> >> On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:  
> >>> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:  
>  On 06/23/2017 09:35 AM, Henning Schild wrote:  
> > Am Fri, 23 Jun 2017 11:11:10 +0200
> > schrieb Sahid Orentino Ferdjaoui :  
>   
> >> In Linux RT context, and as you mentioned, the non-RT vCPU can
> >> acquire some guest kernel lock, then be pre-empted by emulator
> >> thread while holding this lock. This situation blocks RT vCPUs
> >> from doing its work. So that is why we have implemented [2].
> >> For DPDK I don't think we have such problems because it's
> >> running in userland.
> >>
> >> So for DPDK context I think we could have a mask like we have
> >> for RT and basically considering vCPU0 to handle best effort
> >> works (emulator threads, SSH...). I think it's the current
> >> pattern used by DPDK users.  
> >
> > DPDK is just a library and one can imagine an application that
> > has cross-core communication/synchronisation needs where the
> > emulator slowing down vpu0 will also slow down vcpu1. You DPDK
> > application would have to know which of its cores did not get a
> > full pcpu.
> >
> > I am not sure what the DPDK-example is doing in this
> > discussion, would that not just be cpu_policy=dedicated? I
> > guess normal behaviour of dedicated is that emulators and io
> > happily share pCPUs with vCPUs and you are looking for a way to
> > restrict emulators/io to a subset of pCPUs because you can live
> > with some of them beeing not 100%.  
> 
>  Yes.  A typical DPDK-using VM might look something like this:
> 
>  vCPU0: non-realtime, housekeeping and I/O, handles all virtual
>  interrupts and "normal" linux stuff, emulator runs on same pCPU
>  vCPU1: realtime, runs in tight loop in userspace processing
>  packets vCPU2: realtime, runs in tight loop in userspace
>  processing packets vCPU3: realtime, runs in tight loop in
>  userspace processing packets
> 
>  In this context, vCPUs 1-3 don't really ever enter the kernel,
>  and we've offloaded as much kernel work as possible from them
>  onto vCPU0.  This works pretty well with the current system.
>   
> >> For RT we have to isolate the emulator threads to an
> >> additional pCPU per guests or as your are suggesting to a set
> >> of pCPUs for all the guests running.
> >>
> >> I think we should introduce a new option:
> >>
> >>  - hw:cpu_emulator_threads_mask=^1
> >>
> >> If on 'nova.conf' - that mask will be applied to the set of
> >> all host CPUs (vcpu_pin_set) to basically pack the emulator
> >> threads of all VMs running here (useful for RT context).  
> >
> > That would allow modelling exactly what we need.
> > In nova.conf we are talking absolute known values, no need for
> > a mask and a set is much easier to read. Also using the same
> > name does not sound like a good idea.
> > And the name vcpu_pin_set clearly suggest what kind of load
> > runs here, if using a mask it should be called pin_set.  
> 
>  I agree with Henning.
> 
>  In nova.conf we should just use a set, something like
>  "rt_emulator_vcpu_pin_set" which would be used for running the
>  emulator/io threads of *only* realtime instances.  
> >>>
> >>> I'm not agree with you, we have a set of pCPUs and we want to
> >>> substract some of them for the emulator threads. We need a mask.
> >>> The only set we need is to selection which pCPUs Nova can use
> >>> (vcpus_pin_set).
> >>>  
>  We may also want to have "rt_emulator_overcommit_ratio" to
>  control how many threads/instances we allow per pCPU.  
> >>>
> >>> Not really sure to have understand this point? If it is to
> >>> indicate that for a pCPU isolated we want X guest emulator
> >>> threads, the same behavior is achieved by the mask. A host for
> >>> realtime is dedicated for realtime, no overcommitment and the
> >>> operators know the number of host CPUs, they can easily deduct a
> >>> ratio and so the corresponding mask.  
> >>
> >> Suppose I have a host with 64 CPUs.  I reserve three for host
> >> overhead and networking, leaving 61 for instances.  If I have
> >> instances with one non-RT vCPU and one RT vCPU then I can run 30
> >> instances.  If instead my instances have one non-RT and 5 RT vCPUs
> >> then I can run 12 instances.  If I put all of my emulator threads
> >> on the same pCPU, it might make a difference whether I put 30 sets
> >> of emulator threads or 12 sets.  
> >
> > Oh I understand your point now, but not sure that is going to make
> > any difference. I would say the load in the isolated cores is
> > prob

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Chris Friesen

On 06/27/2017 01:45 AM, Sahid Orentino Ferdjaoui wrote:

On Mon, Jun 26, 2017 at 12:12:49PM -0600, Chris Friesen wrote:

On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:

On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui :



In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
offloaded as much kernel work as possible from them onto vCPU0.  This works
pretty well with the current system.


For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

 - hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


We may also want to have "rt_emulator_overcommit_ratio" to control how many
threads/instances we allow per pCPU.


Not really sure to have understand this point? If it is to indicate
that for a pCPU isolated we want X guest emulator threads, the same
behavior is achieved by the mask. A host for realtime is dedicated for
realtime, no overcommitment and the operators know the number of host
CPUs, they can easily deduct a ratio and so the corresponding mask.


Suppose I have a host with 64 CPUs.  I reserve three for host overhead and
networking, leaving 61 for instances.  If I have instances with one non-RT
vCPU and one RT vCPU then I can run 30 instances.  If instead my instances
have one non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of
my emulator threads on the same pCPU, it might make a difference whether I
put 30 sets of emulator threads or 12 sets.


Oh I understand your point now, but not sure that is going to make any
difference. I would say the load in the isolated cores is probably
going to be the same. Even that an overhead will be the number of
threads handled which will be slightly higher in your first scenario.


The proposed "rt_emulator_overcommit_ratio" would simply say "nova is
allowed to run X instances worth of emulator threads on each pCPU in
"rt_emulator_vcpu_pin_set".  If we've hit that threshold, then no more RT
instances are allowed to schedule on this compute node (but non-RT instances
would still be allowed).


Also I don't think we want to schedule where the emulator threads of
the guests should be pinned on the isolated cores. We will let them
float on the set of cores isolated. If there is a requierement to have
them pinned so probably the current implementation will be enough.


Once you us

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Chris Friesen

On 06/27/2017 01:44 AM, Sahid Orentino Ferdjaoui wrote:

On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:

Am Sun, 25 Jun 2017 10:09:10 +0200
schrieb Sahid Orentino Ferdjaoui :


On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui :



In Linux RT context, and as you mentioned, the non-RT vCPU can
acquire some guest kernel lock, then be pre-empted by emulator
thread while holding this lock. This situation blocks RT vCPUs
from doing its work. So that is why we have implemented [2].
For DPDK I don't think we have such problems because it's
running in userland.

So for DPDK context I think we could have a mask like we have
for RT and basically considering vCPU0 to handle best effort
works (emulator threads, SSH...). I think it's the current
pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application
would have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion,
would that not just be cpu_policy=dedicated? I guess normal
behaviour of dedicated is that emulators and io happily share
pCPUs with vCPUs and you are looking for a way to restrict
emulators/io to a subset of pCPUs because you can live with some
of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual
interrupts and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and
we've offloaded as much kernel work as possible from them onto
vCPU0.  This works pretty well with the current system.


For RT we have to isolate the emulator threads to an additional
pCPU per guests or as your are suggesting to a set of pCPUs for
all the guests running.

I think we should introduce a new option:

- hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all
host CPUs (vcpu_pin_set) to basically pack the emulator threads
of all VMs running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a
mask and a set is much easier to read. Also using the same name
does not sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs
here, if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the
emulator/io threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


At that point it does not really matter whether it is a set or a mask.
They can both express the same where a set is easier to read/configure.
With the same argument you could say that vcpu_pin_set should be a mask
over the hosts pcpus.

As i said before: vcpu_pin_set should be renamed because all sorts of
threads are put here (pcpu_pin_set?). But that would be a bigger change
and should be discussed as a seperate issue.

So far we talked about a compute-node for realtime only doing realtime.
In that case vcpu_pin_set + emulator_io_mask would work. If you want to
run regular VMs on the same host, you can run a second nova, like we do.

We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think that
would allow modelling all cases in just one nova. Having all in one
nova, you could potentially repurpose rt cpus to best-effort and back.
Some day in the future ...


That is not something we should allow or at least
advertise. compute-node can't run both RT and non-RT guests and that
because the nodes should have a kernel RT. We can't guarantee RT if
both are on same nodes.


A compute node with an RT OS could run RT and non-RT guests at the same time 
just fine.  In a small cloud (think hyperconverged with maybe two nodes total) 
it's not viable to dedicate an entire node to just RT loads.


I'd personally rather see nova able to handle a mix of RT and non-RT than need 
to run multiple nova instances on the same node and figure out an up-front split 
of resources between RT nova and non-RT nova.  Better to allow nova to 
dynamically allocate resources as needed.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: o

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Henning Schild
Am Tue, 27 Jun 2017 09:44:22 +0200
schrieb Sahid Orentino Ferdjaoui :

> On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:
> > Am Sun, 25 Jun 2017 10:09:10 +0200
> > schrieb Sahid Orentino Ferdjaoui :
> >   
> > > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:  
> > > > On 06/23/2017 09:35 AM, Henning Schild wrote:
> > > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > > schrieb Sahid Orentino Ferdjaoui :
> > > > 
> > > > > > In Linux RT context, and as you mentioned, the non-RT vCPU
> > > > > > can acquire some guest kernel lock, then be pre-empted by
> > > > > > emulator thread while holding this lock. This situation
> > > > > > blocks RT vCPUs from doing its work. So that is why we have
> > > > > > implemented [2]. For DPDK I don't think we have such
> > > > > > problems because it's running in userland.
> > > > > > 
> > > > > > So for DPDK context I think we could have a mask like we
> > > > > > have for RT and basically considering vCPU0 to handle best
> > > > > > effort works (emulator threads, SSH...). I think it's the
> > > > > > current pattern used by DPDK users.
> > > > > 
> > > > > DPDK is just a library and one can imagine an application
> > > > > that has cross-core communication/synchronisation needs where
> > > > > the emulator slowing down vpu0 will also slow down vcpu1. You
> > > > > DPDK application would have to know which of its cores did
> > > > > not get a full pcpu.
> > > > > 
> > > > > I am not sure what the DPDK-example is doing in this
> > > > > discussion, would that not just be cpu_policy=dedicated? I
> > > > > guess normal behaviour of dedicated is that emulators and io
> > > > > happily share pCPUs with vCPUs and you are looking for a way
> > > > > to restrict emulators/io to a subset of pCPUs because you can
> > > > > live with some of them beeing not 100%.
> > > > 
> > > > Yes.  A typical DPDK-using VM might look something like this:
> > > > 
> > > > vCPU0: non-realtime, housekeeping and I/O, handles all virtual
> > > > interrupts and "normal" linux stuff, emulator runs on same pCPU
> > > > vCPU1: realtime, runs in tight loop in userspace processing
> > > > packets vCPU2: realtime, runs in tight loop in userspace
> > > > processing packets vCPU3: realtime, runs in tight loop in
> > > > userspace processing packets
> > > > 
> > > > In this context, vCPUs 1-3 don't really ever enter the kernel,
> > > > and we've offloaded as much kernel work as possible from them
> > > > onto vCPU0.  This works pretty well with the current system.
> > > > 
> > > > > > For RT we have to isolate the emulator threads to an
> > > > > > additional pCPU per guests or as your are suggesting to a
> > > > > > set of pCPUs for all the guests running.
> > > > > > 
> > > > > > I think we should introduce a new option:
> > > > > > 
> > > > > >- hw:cpu_emulator_threads_mask=^1
> > > > > > 
> > > > > > If on 'nova.conf' - that mask will be applied to the set of
> > > > > > all host CPUs (vcpu_pin_set) to basically pack the emulator
> > > > > > threads of all VMs running here (useful for RT context).
> > > > > 
> > > > > That would allow modelling exactly what we need.
> > > > > In nova.conf we are talking absolute known values, no need
> > > > > for a mask and a set is much easier to read. Also using the
> > > > > same name does not sound like a good idea.
> > > > > And the name vcpu_pin_set clearly suggest what kind of load
> > > > > runs here, if using a mask it should be called pin_set.
> > > > 
> > > > I agree with Henning.
> > > > 
> > > > In nova.conf we should just use a set, something like
> > > > "rt_emulator_vcpu_pin_set" which would be used for running the
> > > > emulator/io threads of *only* realtime instances.
> > > 
> > > I'm not agree with you, we have a set of pCPUs and we want to
> > > substract some of them for the emulator threads. We need a mask.
> > > The only set we need is to selection which pCPUs Nova can use
> > > (vcpus_pin_set).  
> > 
> > At that point it does not really matter whether it is a set or a
> > mask. They can both express the same where a set is easier to
> > read/configure. With the same argument you could say that
> > vcpu_pin_set should be a mask over the hosts pcpus.
> > 
> > As i said before: vcpu_pin_set should be renamed because all sorts
> > of threads are put here (pcpu_pin_set?). But that would be a bigger
> > change and should be discussed as a seperate issue.
> > 
> > So far we talked about a compute-node for realtime only doing
> > realtime. In that case vcpu_pin_set + emulator_io_mask would work.
> > If you want to run regular VMs on the same host, you can run a
> > second nova, like we do.
> > 
> > We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think
> > that would allow modelling all cases in just one nova. Having all
> > in one nova, you could potentially repurpose rt cpus to best-effort
> > and back. Some day in the future ...  
> 
> That is not something we should allow

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Sahid Orentino Ferdjaoui
On Mon, Jun 26, 2017 at 12:12:49PM -0600, Chris Friesen wrote:
> On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:
> > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
> > > On 06/23/2017 09:35 AM, Henning Schild wrote:
> > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > schrieb Sahid Orentino Ferdjaoui :
> > > 
> > > > > In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
> > > > > some guest kernel lock, then be pre-empted by emulator thread while
> > > > > holding this lock. This situation blocks RT vCPUs from doing its
> > > > > work. So that is why we have implemented [2]. For DPDK I don't think
> > > > > we have such problems because it's running in userland.
> > > > > 
> > > > > So for DPDK context I think we could have a mask like we have for RT
> > > > > and basically considering vCPU0 to handle best effort works (emulator
> > > > > threads, SSH...). I think it's the current pattern used by DPDK users.
> > > > 
> > > > DPDK is just a library and one can imagine an application that has
> > > > cross-core communication/synchronisation needs where the emulator
> > > > slowing down vpu0 will also slow down vcpu1. You DPDK application would
> > > > have to know which of its cores did not get a full pcpu.
> > > > 
> > > > I am not sure what the DPDK-example is doing in this discussion, would
> > > > that not just be cpu_policy=dedicated? I guess normal behaviour of
> > > > dedicated is that emulators and io happily share pCPUs with vCPUs and
> > > > you are looking for a way to restrict emulators/io to a subset of pCPUs
> > > > because you can live with some of them beeing not 100%.
> > > 
> > > Yes.  A typical DPDK-using VM might look something like this:
> > > 
> > > vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
> > > and "normal" linux stuff, emulator runs on same pCPU
> > > vCPU1: realtime, runs in tight loop in userspace processing packets
> > > vCPU2: realtime, runs in tight loop in userspace processing packets
> > > vCPU3: realtime, runs in tight loop in userspace processing packets
> > > 
> > > In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
> > > offloaded as much kernel work as possible from them onto vCPU0.  This 
> > > works
> > > pretty well with the current system.
> > > 
> > > > > For RT we have to isolate the emulator threads to an additional pCPU
> > > > > per guests or as your are suggesting to a set of pCPUs for all the
> > > > > guests running.
> > > > > 
> > > > > I think we should introduce a new option:
> > > > > 
> > > > > - hw:cpu_emulator_threads_mask=^1
> > > > > 
> > > > > If on 'nova.conf' - that mask will be applied to the set of all host
> > > > > CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
> > > > > running here (useful for RT context).
> > > > 
> > > > That would allow modelling exactly what we need.
> > > > In nova.conf we are talking absolute known values, no need for a mask
> > > > and a set is much easier to read. Also using the same name does not
> > > > sound like a good idea.
> > > > And the name vcpu_pin_set clearly suggest what kind of load runs here,
> > > > if using a mask it should be called pin_set.
> > > 
> > > I agree with Henning.
> > > 
> > > In nova.conf we should just use a set, something like
> > > "rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
> > > threads of *only* realtime instances.
> > 
> > I'm not agree with you, we have a set of pCPUs and we want to
> > substract some of them for the emulator threads. We need a mask. The
> > only set we need is to selection which pCPUs Nova can use
> > (vcpus_pin_set).
> > 
> > > We may also want to have "rt_emulator_overcommit_ratio" to control how 
> > > many
> > > threads/instances we allow per pCPU.
> > 
> > Not really sure to have understand this point? If it is to indicate
> > that for a pCPU isolated we want X guest emulator threads, the same
> > behavior is achieved by the mask. A host for realtime is dedicated for
> > realtime, no overcommitment and the operators know the number of host
> > CPUs, they can easily deduct a ratio and so the corresponding mask.
> 
> Suppose I have a host with 64 CPUs.  I reserve three for host overhead and
> networking, leaving 61 for instances.  If I have instances with one non-RT
> vCPU and one RT vCPU then I can run 30 instances.  If instead my instances
> have one non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of
> my emulator threads on the same pCPU, it might make a difference whether I
> put 30 sets of emulator threads or 12 sets.

Oh I understand your point now, but not sure that is going to make any
difference. I would say the load in the isolated cores is probably
going to be the same. Even that an overhead will be the number of
threads handled which will be slightly higher in your first scenario.

> The proposed "rt_emulator_overcommit_ratio" would simply say "nova is
> allowed to run X instances 

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Sahid Orentino Ferdjaoui
On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:
> Am Sun, 25 Jun 2017 10:09:10 +0200
> schrieb Sahid Orentino Ferdjaoui :
> 
> > On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
> > > On 06/23/2017 09:35 AM, Henning Schild wrote:  
> > > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > > schrieb Sahid Orentino Ferdjaoui :  
> > >   
> > > > > In Linux RT context, and as you mentioned, the non-RT vCPU can
> > > > > acquire some guest kernel lock, then be pre-empted by emulator
> > > > > thread while holding this lock. This situation blocks RT vCPUs
> > > > > from doing its work. So that is why we have implemented [2].
> > > > > For DPDK I don't think we have such problems because it's
> > > > > running in userland.
> > > > > 
> > > > > So for DPDK context I think we could have a mask like we have
> > > > > for RT and basically considering vCPU0 to handle best effort
> > > > > works (emulator threads, SSH...). I think it's the current
> > > > > pattern used by DPDK users.  
> > > > 
> > > > DPDK is just a library and one can imagine an application that has
> > > > cross-core communication/synchronisation needs where the emulator
> > > > slowing down vpu0 will also slow down vcpu1. You DPDK application
> > > > would have to know which of its cores did not get a full pcpu.
> > > > 
> > > > I am not sure what the DPDK-example is doing in this discussion,
> > > > would that not just be cpu_policy=dedicated? I guess normal
> > > > behaviour of dedicated is that emulators and io happily share
> > > > pCPUs with vCPUs and you are looking for a way to restrict
> > > > emulators/io to a subset of pCPUs because you can live with some
> > > > of them beeing not 100%.  
> > > 
> > > Yes.  A typical DPDK-using VM might look something like this:
> > > 
> > > vCPU0: non-realtime, housekeeping and I/O, handles all virtual
> > > interrupts and "normal" linux stuff, emulator runs on same pCPU
> > > vCPU1: realtime, runs in tight loop in userspace processing packets
> > > vCPU2: realtime, runs in tight loop in userspace processing packets
> > > vCPU3: realtime, runs in tight loop in userspace processing packets
> > > 
> > > In this context, vCPUs 1-3 don't really ever enter the kernel, and
> > > we've offloaded as much kernel work as possible from them onto
> > > vCPU0.  This works pretty well with the current system.
> > >   
> > > > > For RT we have to isolate the emulator threads to an additional
> > > > > pCPU per guests or as your are suggesting to a set of pCPUs for
> > > > > all the guests running.
> > > > > 
> > > > > I think we should introduce a new option:
> > > > > 
> > > > >- hw:cpu_emulator_threads_mask=^1
> > > > > 
> > > > > If on 'nova.conf' - that mask will be applied to the set of all
> > > > > host CPUs (vcpu_pin_set) to basically pack the emulator threads
> > > > > of all VMs running here (useful for RT context).  
> > > > 
> > > > That would allow modelling exactly what we need.
> > > > In nova.conf we are talking absolute known values, no need for a
> > > > mask and a set is much easier to read. Also using the same name
> > > > does not sound like a good idea.
> > > > And the name vcpu_pin_set clearly suggest what kind of load runs
> > > > here, if using a mask it should be called pin_set.  
> > > 
> > > I agree with Henning.
> > > 
> > > In nova.conf we should just use a set, something like
> > > "rt_emulator_vcpu_pin_set" which would be used for running the
> > > emulator/io threads of *only* realtime instances.  
> > 
> > I'm not agree with you, we have a set of pCPUs and we want to
> > substract some of them for the emulator threads. We need a mask. The
> > only set we need is to selection which pCPUs Nova can use
> > (vcpus_pin_set).
> 
> At that point it does not really matter whether it is a set or a mask.
> They can both express the same where a set is easier to read/configure.
> With the same argument you could say that vcpu_pin_set should be a mask
> over the hosts pcpus.
> 
> As i said before: vcpu_pin_set should be renamed because all sorts of
> threads are put here (pcpu_pin_set?). But that would be a bigger change
> and should be discussed as a seperate issue.
> 
> So far we talked about a compute-node for realtime only doing realtime.
> In that case vcpu_pin_set + emulator_io_mask would work. If you want to
> run regular VMs on the same host, you can run a second nova, like we do.
> 
> We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think that
> would allow modelling all cases in just one nova. Having all in one
> nova, you could potentially repurpose rt cpus to best-effort and back.
> Some day in the future ...

That is not something we should allow or at least
advertise. compute-node can't run both RT and non-RT guests and that
because the nodes should have a kernel RT. We can't guarantee RT if
both are on same nodes.

The realtime nodes should be isolated by aggregates as you seem to do.

> > > We may also want to have "rt_emulator_overcommit_ratio

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-26 Thread Chris Friesen

On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:

On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui :



In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
offloaded as much kernel work as possible from them onto vCPU0.  This works
pretty well with the current system.


For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

- hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


We may also want to have "rt_emulator_overcommit_ratio" to control how many
threads/instances we allow per pCPU.


Not really sure to have understand this point? If it is to indicate
that for a pCPU isolated we want X guest emulator threads, the same
behavior is achieved by the mask. A host for realtime is dedicated for
realtime, no overcommitment and the operators know the number of host
CPUs, they can easily deduct a ratio and so the corresponding mask.


Suppose I have a host with 64 CPUs.  I reserve three for host overhead and 
networking, leaving 61 for instances.  If I have instances with one non-RT vCPU 
and one RT vCPU then I can run 30 instances.  If instead my instances have one 
non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of my emulator 
threads on the same pCPU, it might make a difference whether I put 30 sets of 
emulator threads or 12 sets.


The proposed "rt_emulator_overcommit_ratio" would simply say "nova is allowed to 
run X instances worth of emulator threads on each pCPU in 
"rt_emulator_vcpu_pin_set".  If we've hit that threshold, then no more RT 
instances are allowed to schedule on this compute node (but non-RT instances 
would still be allowed).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-26 Thread Henning Schild
Am Sun, 25 Jun 2017 10:09:10 +0200
schrieb Sahid Orentino Ferdjaoui :

> On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
> > On 06/23/2017 09:35 AM, Henning Schild wrote:  
> > > Am Fri, 23 Jun 2017 11:11:10 +0200
> > > schrieb Sahid Orentino Ferdjaoui :  
> >   
> > > > In Linux RT context, and as you mentioned, the non-RT vCPU can
> > > > acquire some guest kernel lock, then be pre-empted by emulator
> > > > thread while holding this lock. This situation blocks RT vCPUs
> > > > from doing its work. So that is why we have implemented [2].
> > > > For DPDK I don't think we have such problems because it's
> > > > running in userland.
> > > > 
> > > > So for DPDK context I think we could have a mask like we have
> > > > for RT and basically considering vCPU0 to handle best effort
> > > > works (emulator threads, SSH...). I think it's the current
> > > > pattern used by DPDK users.  
> > > 
> > > DPDK is just a library and one can imagine an application that has
> > > cross-core communication/synchronisation needs where the emulator
> > > slowing down vpu0 will also slow down vcpu1. You DPDK application
> > > would have to know which of its cores did not get a full pcpu.
> > > 
> > > I am not sure what the DPDK-example is doing in this discussion,
> > > would that not just be cpu_policy=dedicated? I guess normal
> > > behaviour of dedicated is that emulators and io happily share
> > > pCPUs with vCPUs and you are looking for a way to restrict
> > > emulators/io to a subset of pCPUs because you can live with some
> > > of them beeing not 100%.  
> > 
> > Yes.  A typical DPDK-using VM might look something like this:
> > 
> > vCPU0: non-realtime, housekeeping and I/O, handles all virtual
> > interrupts and "normal" linux stuff, emulator runs on same pCPU
> > vCPU1: realtime, runs in tight loop in userspace processing packets
> > vCPU2: realtime, runs in tight loop in userspace processing packets
> > vCPU3: realtime, runs in tight loop in userspace processing packets
> > 
> > In this context, vCPUs 1-3 don't really ever enter the kernel, and
> > we've offloaded as much kernel work as possible from them onto
> > vCPU0.  This works pretty well with the current system.
> >   
> > > > For RT we have to isolate the emulator threads to an additional
> > > > pCPU per guests or as your are suggesting to a set of pCPUs for
> > > > all the guests running.
> > > > 
> > > > I think we should introduce a new option:
> > > > 
> > > >- hw:cpu_emulator_threads_mask=^1
> > > > 
> > > > If on 'nova.conf' - that mask will be applied to the set of all
> > > > host CPUs (vcpu_pin_set) to basically pack the emulator threads
> > > > of all VMs running here (useful for RT context).  
> > > 
> > > That would allow modelling exactly what we need.
> > > In nova.conf we are talking absolute known values, no need for a
> > > mask and a set is much easier to read. Also using the same name
> > > does not sound like a good idea.
> > > And the name vcpu_pin_set clearly suggest what kind of load runs
> > > here, if using a mask it should be called pin_set.  
> > 
> > I agree with Henning.
> > 
> > In nova.conf we should just use a set, something like
> > "rt_emulator_vcpu_pin_set" which would be used for running the
> > emulator/io threads of *only* realtime instances.  
> 
> I'm not agree with you, we have a set of pCPUs and we want to
> substract some of them for the emulator threads. We need a mask. The
> only set we need is to selection which pCPUs Nova can use
> (vcpus_pin_set).

At that point it does not really matter whether it is a set or a mask.
They can both express the same where a set is easier to read/configure.
With the same argument you could say that vcpu_pin_set should be a mask
over the hosts pcpus.

As i said before: vcpu_pin_set should be renamed because all sorts of
threads are put here (pcpu_pin_set?). But that would be a bigger change
and should be discussed as a seperate issue.

So far we talked about a compute-node for realtime only doing realtime.
In that case vcpu_pin_set + emulator_io_mask would work. If you want to
run regular VMs on the same host, you can run a second nova, like we do.

We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think that
would allow modelling all cases in just one nova. Having all in one
nova, you could potentially repurpose rt cpus to best-effort and back.
Some day in the future ...

> > We may also want to have "rt_emulator_overcommit_ratio" to control
> > how many threads/instances we allow per pCPU.  
> 
> Not really sure to have understand this point? If it is to indicate
> that for a pCPU isolated we want X guest emulator threads, the same
> behavior is achieved by the mask. A host for realtime is dedicated for
> realtime, no overcommitment and the operators know the number of host
> CPUs, they can easily deduct a ratio and so the corresponding mask.

Agreed.

> > > > If on flavor extra-specs It will be applied to the vCPUs
> > > > dedicated for the guest (u

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-25 Thread Sahid Orentino Ferdjaoui
On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:
> On 06/23/2017 09:35 AM, Henning Schild wrote:
> > Am Fri, 23 Jun 2017 11:11:10 +0200
> > schrieb Sahid Orentino Ferdjaoui :
> 
> > > In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
> > > some guest kernel lock, then be pre-empted by emulator thread while
> > > holding this lock. This situation blocks RT vCPUs from doing its
> > > work. So that is why we have implemented [2]. For DPDK I don't think
> > > we have such problems because it's running in userland.
> > > 
> > > So for DPDK context I think we could have a mask like we have for RT
> > > and basically considering vCPU0 to handle best effort works (emulator
> > > threads, SSH...). I think it's the current pattern used by DPDK users.
> > 
> > DPDK is just a library and one can imagine an application that has
> > cross-core communication/synchronisation needs where the emulator
> > slowing down vpu0 will also slow down vcpu1. You DPDK application would
> > have to know which of its cores did not get a full pcpu.
> > 
> > I am not sure what the DPDK-example is doing in this discussion, would
> > that not just be cpu_policy=dedicated? I guess normal behaviour of
> > dedicated is that emulators and io happily share pCPUs with vCPUs and
> > you are looking for a way to restrict emulators/io to a subset of pCPUs
> > because you can live with some of them beeing not 100%.
> 
> Yes.  A typical DPDK-using VM might look something like this:
> 
> vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
> and "normal" linux stuff, emulator runs on same pCPU
> vCPU1: realtime, runs in tight loop in userspace processing packets
> vCPU2: realtime, runs in tight loop in userspace processing packets
> vCPU3: realtime, runs in tight loop in userspace processing packets
> 
> In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
> offloaded as much kernel work as possible from them onto vCPU0.  This works
> pretty well with the current system.
> 
> > > For RT we have to isolate the emulator threads to an additional pCPU
> > > per guests or as your are suggesting to a set of pCPUs for all the
> > > guests running.
> > > 
> > > I think we should introduce a new option:
> > > 
> > >- hw:cpu_emulator_threads_mask=^1
> > > 
> > > If on 'nova.conf' - that mask will be applied to the set of all host
> > > CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
> > > running here (useful for RT context).
> > 
> > That would allow modelling exactly what we need.
> > In nova.conf we are talking absolute known values, no need for a mask
> > and a set is much easier to read. Also using the same name does not
> > sound like a good idea.
> > And the name vcpu_pin_set clearly suggest what kind of load runs here,
> > if using a mask it should be called pin_set.
> 
> I agree with Henning.
> 
> In nova.conf we should just use a set, something like
> "rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
> threads of *only* realtime instances.

I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).

> We may also want to have "rt_emulator_overcommit_ratio" to control how many
> threads/instances we allow per pCPU.

Not really sure to have understand this point? If it is to indicate
that for a pCPU isolated we want X guest emulator threads, the same
behavior is achieved by the mask. A host for realtime is dedicated for
realtime, no overcommitment and the operators know the number of host
CPUs, they can easily deduct a ratio and so the corresponding mask.

> > > If on flavor extra-specs It will be applied to the vCPUs dedicated for
> > > the guest (useful for DPDK context).
> > 
> > And if both are present the flavor wins and nova.conf is ignored?
> 
> In the flavor I'd like to see it be a full bitmask, not an exclusion mask
> with an implicit full set.  Thus the end-user could specify
> "hw:cpu_emulator_threads_mask=0" and get the emulator threads to run
> alongside vCPU0.

Same here, I'm not agree, the only set is the vCPUs of the guest. Then
we want a mask to substract some of them.

> Henning, there is no conflict, the nova.conf setting and the flavor setting
> are used for two different things.
> 
> Chris
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-23 Thread Chris Friesen

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui :



In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts and 
"normal" linux stuff, emulator runs on same pCPU

vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and we've 
offloaded as much kernel work as possible from them onto vCPU0.  This works 
pretty well with the current system.



For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

   - hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like "rt_emulator_vcpu_pin_set" 
which would be used for running the emulator/io threads of *only* realtime 
instances.


We may also want to have "rt_emulator_overcommit_ratio" to control how many 
threads/instances we allow per pCPU.



If on flavor extra-specs It will be applied to the vCPUs dedicated for
the guest (useful for DPDK context).


And if both are present the flavor wins and nova.conf is ignored?


In the flavor I'd like to see it be a full bitmask, not an exclusion mask with 
an implicit full set.  Thus the end-user could specify 
"hw:cpu_emulator_threads_mask=0" and get the emulator threads to run alongside 
vCPU0.


Henning, there is no conflict, the nova.conf setting and the flavor setting are 
used for two different things.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-23 Thread Henning Schild
Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui :

> On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote:
> > Am Tue, 20 Jun 2017 10:04:30 -0400
> > schrieb Luiz Capitulino :
> >   
> > > On Tue, 20 Jun 2017 09:48:23 +0200
> > > Henning Schild  wrote:
> > >   
> > > > Hi,
> > > > 
> > > > We are using OpenStack for managing realtime guests. We modified
> > > > it and contributed to discussions on how to model the realtime
> > > > feature. More recent versions of OpenStack have support for
> > > > realtime, and there are a few proposals on how to improve that
> > > > further.
> > > > 
> > > > But there is still no full answer on how to distribute threads
> > > > across host-cores. The vcpus are easy but for the emulation and
> > > > io-threads there are multiple options. I would like to collect
> > > > the constraints from a qemu/kvm perspective first, and than
> > > > possibly influence the OpenStack development
> > > > 
> > > > I will put the summary/questions first, the text below provides
> > > > more context to where the questions come from.
> > > > - How do you distribute your threads when reaching the really
> > > > low cyclictest results in the guests? In [3] Rik talked about
> > > > problems like hold holder preemption, starvation etc. but not
> > > > where/how to schedule emulators and io
> > > 
> > > We put emulator threads and io-threads in housekeeping cores in
> > > the host. I think housekeeping cores is what you're calling
> > > best-effort cores, those are non-isolated cores that will run host
> > > load.  
> > 
> > As expected, any best-effort/housekeeping core will do but overlap
> > with the vcpu-cores is a bad idea.
> >   
> > > > - Is it ok to put a vcpu and emulator thread on the same core as
> > > > long as the guest knows about it? Any funny behaving guest, not
> > > > just Linux.
> > > 
> > > We can't do this for KVM-RT because we run all vcpu threads with
> > > FIFO priority.  
> > 
> > Same point as above, meaning the "hw:cpu_realtime_mask" approach is
> > wrong for realtime.
> >   
> > > However, we have another project with DPDK whose goal is to
> > > achieve zero-loss networking. The configuration required by this
> > > project is very similar to the one required by KVM-RT. One
> > > difference though is that we don't use RT and hence don't use
> > > FIFO priority.
> > > 
> > > In this project we've been running with the emulator thread and a
> > > vcpu sharing the same core. As long as the guest housekeeping CPUs
> > > are idle, we don't get any packet drops (most of the time, what
> > > causes packet drops in this test-case would cause spikes in
> > > cyclictest). However, we're seeing some packet drops for certain
> > > guest workloads which we are still debugging.  
> > 
> > Ok but that seems to be a different scenario where hw:cpu_policy
> > dedicated should be sufficient. However if the placement of the io
> > and emulators has to be on a subset of the dedicated cpus something
> > like hw:cpu_realtime_mask would be required.
> >   
> > > > - Is it ok to make the emulators potentially slow by running
> > > > them on busy best-effort cores, or will they quickly be on the
> > > > critical path if you do more than just cyclictest? - our
> > > > experience says we don't need them reactive even with
> > > > rt-networking involved
> > > 
> > > I believe it is ok.  
> > 
> > Ok.
> >
> > > > Our goal is to reach a high packing density of realtime VMs. Our
> > > > pragmatic first choice was to run all non-vcpu-threads on a
> > > > shared set of pcpus where we also run best-effort VMs and host
> > > > load. Now the OpenStack guys are not too happy with that
> > > > because that is load outside the assigned resources, which
> > > > leads to quota and accounting problems.
> > > > 
> > > > So the current OpenStack model is to run those threads next to
> > > > one or more vcpu-threads. [1] You will need to remember that
> > > > the vcpus in question should not be your rt-cpus in the guest.
> > > > I.e. if vcpu0 shares its pcpu with the hypervisor noise your
> > > > preemptrt-guest would use isolcpus=1.
> > > > 
> > > > Is that kind of sharing a pcpu really a good idea? I could
> > > > imagine things like smp housekeeping (cache invalidation etc.)
> > > > to eventually cause vcpu1 having to wait for the emulator stuck
> > > > in IO.
> > > 
> > > Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
> > > running vcpu0 on an non-isolated core and without FIFO priority
> > > caused spikes in vcpu1. I guess we debugged this down to vcpu1
> > > waiting a few dozen microseconds for vcpu0 for some reason.
> > > Running vcpu0 on a isolated core with FIFO priority fixed this
> > > (again, this was years ago, I won't remember all the details).
> > >   
> > > > Or maybe a busy polling vcpu0 starving its own emulator causing
> > > > high latency or even deadlocks.
> > > 
> > > This will probably happen if you run vcpu0 with FIFO prio

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-23 Thread Sahid Orentino Ferdjaoui
On Wed, Jun 21, 2017 at 12:47:27PM +0200, Henning Schild wrote:
> Am Tue, 20 Jun 2017 10:04:30 -0400
> schrieb Luiz Capitulino :
> 
> > On Tue, 20 Jun 2017 09:48:23 +0200
> > Henning Schild  wrote:
> > 
> > > Hi,
> > > 
> > > We are using OpenStack for managing realtime guests. We modified
> > > it and contributed to discussions on how to model the realtime
> > > feature. More recent versions of OpenStack have support for
> > > realtime, and there are a few proposals on how to improve that
> > > further.
> > > 
> > > But there is still no full answer on how to distribute threads
> > > across host-cores. The vcpus are easy but for the emulation and
> > > io-threads there are multiple options. I would like to collect the
> > > constraints from a qemu/kvm perspective first, and than possibly
> > > influence the OpenStack development
> > > 
> > > I will put the summary/questions first, the text below provides more
> > > context to where the questions come from.
> > > - How do you distribute your threads when reaching the really low
> > >   cyclictest results in the guests? In [3] Rik talked about problems
> > >   like hold holder preemption, starvation etc. but not where/how to
> > >   schedule emulators and io  
> > 
> > We put emulator threads and io-threads in housekeeping cores in
> > the host. I think housekeeping cores is what you're calling
> > best-effort cores, those are non-isolated cores that will run host
> > load.
> 
> As expected, any best-effort/housekeeping core will do but overlap with
> the vcpu-cores is a bad idea.
> 
> > > - Is it ok to put a vcpu and emulator thread on the same core as
> > > long as the guest knows about it? Any funny behaving guest, not
> > > just Linux.  
> > 
> > We can't do this for KVM-RT because we run all vcpu threads with
> > FIFO priority.
> 
> Same point as above, meaning the "hw:cpu_realtime_mask" approach is
> wrong for realtime.
> 
> > However, we have another project with DPDK whose goal is to achieve
> > zero-loss networking. The configuration required by this project is
> > very similar to the one required by KVM-RT. One difference though is
> > that we don't use RT and hence don't use FIFO priority.
> > 
> > In this project we've been running with the emulator thread and a
> > vcpu sharing the same core. As long as the guest housekeeping CPUs
> > are idle, we don't get any packet drops (most of the time, what
> > causes packet drops in this test-case would cause spikes in
> > cyclictest). However, we're seeing some packet drops for certain
> > guest workloads which we are still debugging.
> 
> Ok but that seems to be a different scenario where hw:cpu_policy
> dedicated should be sufficient. However if the placement of the io and
> emulators has to be on a subset of the dedicated cpus something like
> hw:cpu_realtime_mask would be required.
> 
> > > - Is it ok to make the emulators potentially slow by running them on
> > >   busy best-effort cores, or will they quickly be on the critical
> > > path if you do more than just cyclictest? - our experience says we
> > > don't need them reactive even with rt-networking involved  
> > 
> > I believe it is ok.
> 
> Ok.
>  
> > > Our goal is to reach a high packing density of realtime VMs. Our
> > > pragmatic first choice was to run all non-vcpu-threads on a shared
> > > set of pcpus where we also run best-effort VMs and host load.
> > > Now the OpenStack guys are not too happy with that because that is
> > > load outside the assigned resources, which leads to quota and
> > > accounting problems.
> > > 
> > > So the current OpenStack model is to run those threads next to one
> > > or more vcpu-threads. [1] You will need to remember that the vcpus
> > > in question should not be your rt-cpus in the guest. I.e. if vcpu0
> > > shares its pcpu with the hypervisor noise your preemptrt-guest
> > > would use isolcpus=1.
> > > 
> > > Is that kind of sharing a pcpu really a good idea? I could imagine
> > > things like smp housekeeping (cache invalidation etc.) to eventually
> > > cause vcpu1 having to wait for the emulator stuck in IO.  
> > 
> > Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
> > running vcpu0 on an non-isolated core and without FIFO priority
> > caused spikes in vcpu1. I guess we debugged this down to vcpu1
> > waiting a few dozen microseconds for vcpu0 for some reason. Running
> > vcpu0 on a isolated core with FIFO priority fixed this (again, this
> > was years ago, I won't remember all the details).
> > 
> > > Or maybe a busy polling vcpu0 starving its own emulator causing high
> > > latency or even deadlocks.  
> > 
> > This will probably happen if you run vcpu0 with FIFO priority.
> 
> Two more points that indicate that hw:cpu_realtime_mask (putting
> emulators/io next to any vcpu) does not work for general rt.
> 
> > > Even if it happens to work for Linux guests it seems like a strong
> > > assumption that an rt-guest that has noise cores can deal with even
> > > more noise one schedul

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-22 Thread Chris Friesen

On 06/22/2017 01:47 AM, Henning Schild wrote:

Am Wed, 21 Jun 2017 11:40:14 -0600
schrieb Chris Friesen :


On 06/21/2017 10:46 AM, Henning Schild wrote:



As we know from our setup, and as Luiz confirmed - it is _not_
"critical to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each
other. At least not on the "cpuset" basis, maybe "blkio" and
cgroups like that.


I'm reluctant to say conclusively that we don't need to separate
emulator threads since I don't think we've considered all the cases.
For example, what happens if one or more of the instances are being
live-migrated?  The migration thread for those instances will be very
busy scanning for dirty pages, which could delay the emulator threads
for other instances and also cause significant cross-NUMA traffic
unless we ensure at least one core per NUMA-node.


Realtime instances can not be live-migrated. We are talking about
threads that can not even be moved between two cores on one numa-node
without missing a deadline. But your point is good because it could
mean that such an emulator_set - if defined - should not be used for all
VMs.


I'd suggest that realtime instances cannot be live-migrated *while meeting 
realtime commitments*.  There may be reasons to live-migrate realtime instances 
that aren't currently providing service.



Also, I don't think we've determined how much CPU time is needed for
the emulator threads.  If we have ~60 CPUs available for instances
split across two NUMA nodes, can we safely run the emulator threads
of 30 instances all together on a single CPU?  If not, how much
"emulator overcommit" is allowable?


That depends on how much IO your VMs are issuing and can not be
answered in general. All VMs can cause high load with IO/emulation,
rt-VMs are probably less likely to do so.


I think the result of this is that in addition to "rt_emulator_pin_set" you'd 
probably want a config option for "rt_emulator_overcommit_ratio" or something 
similar.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-22 Thread Henning Schild
Am Wed, 21 Jun 2017 11:40:14 -0600
schrieb Chris Friesen :

> On 06/21/2017 10:46 AM, Henning Schild wrote:
> > Am Wed, 21 Jun 2017 10:04:52 -0600
> > schrieb Chris Friesen :  
> 
> > i guess you are talking about that section from [1]:
> >  
>  We could use a host level tunable to just reserve a set of host
>  pCPUs for running emulator threads globally, instead of trying to
>  account for it per instance. This would work in the simple case,
>  but when NUMA is used, it is highly desirable to have more fine
>  grained config to control emulator thread placement. When
>  real-time or dedicated CPUs are used, it will be critical to
>  separate emulator threads for different KVM instances.  
> 
> Yes, that's the relevant section.
> 
> > I know it has been considered, but i would like to bring the topic
> > up again. Because doing it that way allows for many more rt-VMs on
> > a host and i am not sure i fully understood why the idea was
> > discarded in the end.
> >
> > I do not really see the influence of NUMA here. Say the
> > emulator_pin_set is used only for realtime VMs, we know that the
> > emulators and IOs can be "slow" so crossing numa-nodes should not
> > be an issue. Or you could say the set needs to contain at least one
> > core per numa-node and schedule emulators next to their vcpus.
> >
> > As we know from our setup, and as Luiz confirmed - it is _not_
> > "critical to separate emulator threads for different KVM instances".
> > They have to be separated from the vcpu-cores but not from each
> > other. At least not on the "cpuset" basis, maybe "blkio" and
> > cgroups like that.  
> 
> I'm reluctant to say conclusively that we don't need to separate
> emulator threads since I don't think we've considered all the cases.
> For example, what happens if one or more of the instances are being
> live-migrated?  The migration thread for those instances will be very
> busy scanning for dirty pages, which could delay the emulator threads
> for other instances and also cause significant cross-NUMA traffic
> unless we ensure at least one core per NUMA-node.

Realtime instances can not be live-migrated. We are talking about
threads that can not even be moved between two cores on one numa-node
without missing a deadline. But your point is good because it could
mean that such an emulator_set - if defined - should not be used for all
VMs.
 
> Also, I don't think we've determined how much CPU time is needed for
> the emulator threads.  If we have ~60 CPUs available for instances
> split across two NUMA nodes, can we safely run the emulator threads
> of 30 instances all together on a single CPU?  If not, how much
> "emulator overcommit" is allowable?

That depends on how much IO your VMs are issuing and can not be
answered in general. All VMs can cause high load with IO/emulation,
rt-VMs are probably less likely to do so.
Say your 64cpu compute-node would be used for both rt and regular. To
mix you would have two instances of nova running on that machine. One
gets node0 (32 cpus) for regular VMs. The emulator-pin-set would not
be defined here (so it would equal the vcpu_pin_set, full overlap).
The other nova would get node1 and disable hyperthreads for all rt
cores (17 cpus left). It would need at least one core for housekeeping
and io/emulation threads. So you are down to max. 15 VMs putting their
IO on that one core and its hyperthread 7.5 per cpu.

In the same setup with [2] we would get a max of 7 single-cpu VMs,
instead of 15! And 15 vs 31 if you dedicate the whole box to rt.

Henning 
 
> Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Chris Friesen

On 06/21/2017 10:46 AM, Henning Schild wrote:

Am Wed, 21 Jun 2017 10:04:52 -0600
schrieb Chris Friesen :



i guess you are talking about that section from [1]:


We could use a host level tunable to just reserve a set of host
pCPUs for running emulator threads globally, instead of trying to
account for it per instance. This would work in the simple case,
but when NUMA is used, it is highly desirable to have more fine
grained config to control emulator thread placement. When real-time
or dedicated CPUs are used, it will be critical to separate
emulator threads for different KVM instances.


Yes, that's the relevant section.


I know it has been considered, but i would like to bring the topic up
again. Because doing it that way allows for many more rt-VMs on a host
and i am not sure i fully understood why the idea was discarded in the
end.

I do not really see the influence of NUMA here. Say the
emulator_pin_set is used only for realtime VMs, we know that the
emulators and IOs can be "slow" so crossing numa-nodes should not be an
issue. Or you could say the set needs to contain at least one core per
numa-node and schedule emulators next to their vcpus.

As we know from our setup, and as Luiz confirmed - it is _not_ "critical
to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each other.
At least not on the "cpuset" basis, maybe "blkio" and cgroups like that.


I'm reluctant to say conclusively that we don't need to separate emulator 
threads since I don't think we've considered all the cases.  For example, what 
happens if one or more of the instances are being live-migrated?  The migration 
thread for those instances will be very busy scanning for dirty pages, which 
could delay the emulator threads for other instances and also cause significant 
cross-NUMA traffic unless we ensure at least one core per NUMA-node.


Also, I don't think we've determined how much CPU time is needed for the 
emulator threads.  If we have ~60 CPUs available for instances split across two 
NUMA nodes, can we safely run the emulator threads of 30 instances all together 
on a single CPU?  If not, how much "emulator overcommit" is allowable?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Henning Schild
Am Wed, 21 Jun 2017 10:04:52 -0600
schrieb Chris Friesen :

> On 06/21/2017 09:45 AM, Chris Friesen wrote:
> > On 06/21/2017 02:42 AM, Henning Schild wrote:  
> >> Am Tue, 20 Jun 2017 10:41:44 -0600
> >> schrieb Chris Friesen :  
> >  
>  Our goal is to reach a high packing density of realtime VMs. Our
>  pragmatic first choice was to run all non-vcpu-threads on a
>  shared set of pcpus where we also run best-effort VMs and host
>  load. Now the OpenStack guys are not too happy with that because
>  that is load outside the assigned resources, which leads to
>  quota and accounting problems.  
> >>>
> >>> If you wanted to go this route, you could just edit the
> >>> "vcpu_pin_set" entry in nova.conf on the compute nodes so that
> >>> nova doesn't actually know about all of the host vCPUs.  Then you
> >>> could run host load and emulator threads on the pCPUs that nova
> >>> doesn't know about, and there will be no quota/accounting issues
> >>> in nova.  
> >>
> >> Exactly that is the idea but OpenStack currently does not allow
> >> that. No thread will ever end up on a core outside the
> >> vcpu_pin_set and emulator/io-threads are controlled by
> >> OpenStack/libvirt.  
> >
> > Ah, right.  This will isolate the host load from the guest load,
> > but it will leave the guest emulator work running on the same pCPUs
> > as one or more vCPU threads.
> >
> > Your emulator_pin_set idea is interesting...it might be worth
> > proposing in nova.  
> 
> Actually, based on [1] it appears they considered it and decided that
> it didn't provide enough isolation between realtime VMs.

Hey Chris,

i guess you are talking about that section from [1]:

>>> We could use a host level tunable to just reserve a set of host
>>> pCPUs for running emulator threads globally, instead of trying to
>>> account for it per instance. This would work in the simple case,
>>> but when NUMA is used, it is highly desirable to have more fine
>>> grained config to control emulator thread placement. When real-time
>>> or dedicated CPUs are used, it will be critical to separate
>>> emulator threads for different KVM instances.

I know it has been considered, but i would like to bring the topic up
again. Because doing it that way allows for many more rt-VMs on a host
and i am not sure i fully understood why the idea was discarded in the
end.

I do not really see the influence of NUMA here. Say the
emulator_pin_set is used only for realtime VMs, we know that the
emulators and IOs can be "slow" so crossing numa-nodes should not be an
issue. Or you could say the set needs to contain at least one core per
numa-node and schedule emulators next to their vcpus.

As we know from our setup, and as Luiz confirmed - it is _not_ "critical
to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each other.
At least not on the "cpuset" basis, maybe "blkio" and cgroups like that.

Henning

> Chris
> 
> [1] 
> https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Chris Friesen

On 06/21/2017 02:42 AM, Henning Schild wrote:

Am Tue, 20 Jun 2017 10:41:44 -0600
schrieb Chris Friesen :



Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared
set of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is
load outside the assigned resources, which leads to quota and
accounting problems.


If you wanted to go this route, you could just edit the
"vcpu_pin_set" entry in nova.conf on the compute nodes so that nova
doesn't actually know about all of the host vCPUs.  Then you could
run host load and emulator threads on the pCPUs that nova doesn't
know about, and there will be no quota/accounting issues in nova.


Exactly that is the idea but OpenStack currently does not allow that.
No thread will ever end up on a core outside the vcpu_pin_set and
emulator/io-threads are controlled by OpenStack/libvirt.


Ah, right.  This will isolate the host load from the guest load, but it will 
leave the guest emulator work running on the same pCPUs as one or more vCPU threads.


Your emulator_pin_set idea is interesting...it might be worth proposing in nova.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Henning Schild
Am Wed, 21 Jun 2017 09:32:42 -0400
schrieb Luiz Capitulino :

> On Wed, 21 Jun 2017 12:47:27 +0200
> Henning Schild  wrote:
> 
> > > What is your solution?
> > 
> > We have a kilo-based prototype that introduced emulator_pin_set in
> > nova.conf. All vcpu threads will be scheduled on vcpu_pin_set and
> > emulators and IO of all VMs will share emulator_pin_set.
> > vcpu_pin_set contains isolcpus from the host and emulator_pin_set
> > contains best-effort cores from the host.  
> 
> You lost me here a bit as I'm not familiar with OpenStack
> configuration.

Does not matter, i guess you got the point and some other people might
find that useful.

> > That basically means you put all emulators and io of all VMs onto a
> > set of cores that the host potentially also uses for other stuff.
> > Sticking with the made up numbers from above, all the 0.05s can
> > share pcpus.  
> 
> So, this seems to be way we use KVM-RT without OpenStack: emulator
> threads and io threads run on the host housekeeping cores, where all
> other host processes will run. IOW, you only reserve pcpus for vcpus
> threads.

Thanks for the input. I think you confirmend that the current
implementation in openstack can not work and that the new proposal and
our approach should work.
Now we will have to see how to proceed with that information in the
openstack community.

> I can't comment on OpenStack accounting trade-off/implications of
> doing this, but from KVM-RT perspective this is probably the best
> solution. I say "probably" because so far we have only tested with
> cyclictest and simple applications. I don't know if more complex
> applications would have different needs wrt I/O threads for example.

We have a networking ping/pong cyclictest kind of thing and much more
complex setups. Emulators and IO are not on the critical path in our
examples.

> PS: OpenStack devel list refuses emails from non-subscribers. I won't
> subscribe for a one-time discussion, so my emails are not
> reaching the list...

Yeah had the same problem, also with their gerrit. Lets just call it
Stack ... I kept all your text in my replies, and they end up on the
list.

Henning

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Henning Schild
Am Tue, 20 Jun 2017 10:04:30 -0400
schrieb Luiz Capitulino :

> On Tue, 20 Jun 2017 09:48:23 +0200
> Henning Schild  wrote:
> 
> > Hi,
> > 
> > We are using OpenStack for managing realtime guests. We modified
> > it and contributed to discussions on how to model the realtime
> > feature. More recent versions of OpenStack have support for
> > realtime, and there are a few proposals on how to improve that
> > further.
> > 
> > But there is still no full answer on how to distribute threads
> > across host-cores. The vcpus are easy but for the emulation and
> > io-threads there are multiple options. I would like to collect the
> > constraints from a qemu/kvm perspective first, and than possibly
> > influence the OpenStack development
> > 
> > I will put the summary/questions first, the text below provides more
> > context to where the questions come from.
> > - How do you distribute your threads when reaching the really low
> >   cyclictest results in the guests? In [3] Rik talked about problems
> >   like hold holder preemption, starvation etc. but not where/how to
> >   schedule emulators and io  
> 
> We put emulator threads and io-threads in housekeeping cores in
> the host. I think housekeeping cores is what you're calling
> best-effort cores, those are non-isolated cores that will run host
> load.

As expected, any best-effort/housekeeping core will do but overlap with
the vcpu-cores is a bad idea.

> > - Is it ok to put a vcpu and emulator thread on the same core as
> > long as the guest knows about it? Any funny behaving guest, not
> > just Linux.  
> 
> We can't do this for KVM-RT because we run all vcpu threads with
> FIFO priority.

Same point as above, meaning the "hw:cpu_realtime_mask" approach is
wrong for realtime.

> However, we have another project with DPDK whose goal is to achieve
> zero-loss networking. The configuration required by this project is
> very similar to the one required by KVM-RT. One difference though is
> that we don't use RT and hence don't use FIFO priority.
> 
> In this project we've been running with the emulator thread and a
> vcpu sharing the same core. As long as the guest housekeeping CPUs
> are idle, we don't get any packet drops (most of the time, what
> causes packet drops in this test-case would cause spikes in
> cyclictest). However, we're seeing some packet drops for certain
> guest workloads which we are still debugging.

Ok but that seems to be a different scenario where hw:cpu_policy
dedicated should be sufficient. However if the placement of the io and
emulators has to be on a subset of the dedicated cpus something like
hw:cpu_realtime_mask would be required.

> > - Is it ok to make the emulators potentially slow by running them on
> >   busy best-effort cores, or will they quickly be on the critical
> > path if you do more than just cyclictest? - our experience says we
> > don't need them reactive even with rt-networking involved  
> 
> I believe it is ok.

Ok.
 
> > Our goal is to reach a high packing density of realtime VMs. Our
> > pragmatic first choice was to run all non-vcpu-threads on a shared
> > set of pcpus where we also run best-effort VMs and host load.
> > Now the OpenStack guys are not too happy with that because that is
> > load outside the assigned resources, which leads to quota and
> > accounting problems.
> > 
> > So the current OpenStack model is to run those threads next to one
> > or more vcpu-threads. [1] You will need to remember that the vcpus
> > in question should not be your rt-cpus in the guest. I.e. if vcpu0
> > shares its pcpu with the hypervisor noise your preemptrt-guest
> > would use isolcpus=1.
> > 
> > Is that kind of sharing a pcpu really a good idea? I could imagine
> > things like smp housekeeping (cache invalidation etc.) to eventually
> > cause vcpu1 having to wait for the emulator stuck in IO.  
> 
> Agreed. IIRC, in the beginning of KVM-RT we saw a problem where
> running vcpu0 on an non-isolated core and without FIFO priority
> caused spikes in vcpu1. I guess we debugged this down to vcpu1
> waiting a few dozen microseconds for vcpu0 for some reason. Running
> vcpu0 on a isolated core with FIFO priority fixed this (again, this
> was years ago, I won't remember all the details).
> 
> > Or maybe a busy polling vcpu0 starving its own emulator causing high
> > latency or even deadlocks.  
> 
> This will probably happen if you run vcpu0 with FIFO priority.

Two more points that indicate that hw:cpu_realtime_mask (putting
emulators/io next to any vcpu) does not work for general rt.

> > Even if it happens to work for Linux guests it seems like a strong
> > assumption that an rt-guest that has noise cores can deal with even
> > more noise one scheduling level below.
> > 
> > More recent proposals [2] suggest a scheme where the emulator and io
> > threads are on a separate core. That sounds more reasonable /
> > conservative but dramatically increases the per VM cost. And the
> > pcpus hosting the hypervisor threads will 

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Henning Schild
Am Tue, 20 Jun 2017 10:41:44 -0600
schrieb Chris Friesen :

> On 06/20/2017 01:48 AM, Henning Schild wrote:
> > Hi,
> >
> > We are using OpenStack for managing realtime guests. We modified
> > it and contributed to discussions on how to model the realtime
> > feature. More recent versions of OpenStack have support for
> > realtime, and there are a few proposals on how to improve that
> > further.
> >
> > But there is still no full answer on how to distribute threads
> > across host-cores. The vcpus are easy but for the emulation and
> > io-threads there are multiple options. I would like to collect the
> > constraints from a qemu/kvm perspective first, and than possibly
> > influence the OpenStack development
> >
> > I will put the summary/questions first, the text below provides more
> > context to where the questions come from.
> > - How do you distribute your threads when reaching the really low
> >cyclictest results in the guests? In [3] Rik talked about
> > problems like hold holder preemption, starvation etc. but not
> > where/how to schedule emulators and io
> > - Is it ok to put a vcpu and emulator thread on the same core as
> > long as the guest knows about it? Any funny behaving guest, not
> > just Linux.
> > - Is it ok to make the emulators potentially slow by running them on
> >busy best-effort cores, or will they quickly be on the critical
> > path if you do more than just cyclictest? - our experience says we
> > don't need them reactive even with rt-networking involved
> >
> >
> > Our goal is to reach a high packing density of realtime VMs. Our
> > pragmatic first choice was to run all non-vcpu-threads on a shared
> > set of pcpus where we also run best-effort VMs and host load.
> > Now the OpenStack guys are not too happy with that because that is
> > load outside the assigned resources, which leads to quota and
> > accounting problems.  
> 
> If you wanted to go this route, you could just edit the
> "vcpu_pin_set" entry in nova.conf on the compute nodes so that nova
> doesn't actually know about all of the host vCPUs.  Then you could
> run host load and emulator threads on the pCPUs that nova doesn't
> know about, and there will be no quota/accounting issues in nova.

Exactly that is the idea but OpenStack currently does not allow that.
No thread will ever end up on a core outside the vcpu_pin_set and
emulator/io-threads are controlled by OpenStack/libvirt. And you need a
way to specify exactly which cores outside vcpu_pin_set are allowed for
breaking out of that set.
On our compute nodes we also have cores for host-realtime tasks i.e.
dpdk-based rt-networking.

Henning

> Chris
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-20 Thread Chris Friesen

On 06/20/2017 01:48 AM, Henning Schild wrote:

Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for realtime,
and there are a few proposals on how to improve that further.

But there is still no full answer on how to distribute threads across
host-cores. The vcpus are easy but for the emulation and io-threads
there are multiple options. I would like to collect the constraints
from a qemu/kvm perspective first, and than possibly influence the
OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
   cyclictest results in the guests? In [3] Rik talked about problems
   like hold holder preemption, starvation etc. but not where/how to
   schedule emulators and io
- Is it ok to put a vcpu and emulator thread on the same core as long as
   the guest knows about it? Any funny behaving guest, not just Linux.
- Is it ok to make the emulators potentially slow by running them on
   busy best-effort cores, or will they quickly be on the critical path
   if you do more than just cyclictest? - our experience says we don't
   need them reactive even with rt-networking involved


Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared set
of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is load
outside the assigned resources, which leads to quota and accounting
problems.


If you wanted to go this route, you could just edit the "vcpu_pin_set" entry in 
nova.conf on the compute nodes so that nova doesn't actually know about all of 
the host vCPUs.  Then you could run host load and emulator threads on the pCPUs 
that nova doesn't know about, and there will be no quota/accounting issues in nova.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] realtime kvm cpu affinities

2017-06-20 Thread Henning Schild
Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for realtime,
and there are a few proposals on how to improve that further.

But there is still no full answer on how to distribute threads across
host-cores. The vcpus are easy but for the emulation and io-threads
there are multiple options. I would like to collect the constraints
from a qemu/kvm perspective first, and than possibly influence the
OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
  cyclictest results in the guests? In [3] Rik talked about problems
  like hold holder preemption, starvation etc. but not where/how to
  schedule emulators and io
- Is it ok to put a vcpu and emulator thread on the same core as long as
  the guest knows about it? Any funny behaving guest, not just Linux.
- Is it ok to make the emulators potentially slow by running them on
  busy best-effort cores, or will they quickly be on the critical path
  if you do more than just cyclictest? - our experience says we don't
  need them reactive even with rt-networking involved


Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared set
of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is load
outside the assigned resources, which leads to quota and accounting
problems.

So the current OpenStack model is to run those threads next to one
or more vcpu-threads. [1] You will need to remember that the vcpus in
question should not be your rt-cpus in the guest. I.e. if vcpu0 shares
its pcpu with the hypervisor noise your preemptrt-guest would use
isolcpus=1.

Is that kind of sharing a pcpu really a good idea? I could imagine
things like smp housekeeping (cache invalidation etc.) to eventually
cause vcpu1 having to wait for the emulator stuck in IO.
Or maybe a busy polling vcpu0 starving its own emulator causing high
latency or even deadlocks.
Even if it happens to work for Linux guests it seems like a strong
assumption that an rt-guest that has noise cores can deal with even more
noise one scheduling level below.

More recent proposals [2] suggest a scheme where the emulator and io
threads are on a separate core. That sounds more reasonable /
conservative but dramatically increases the per VM cost. And the pcpus
hosting the hypervisor threads will probably be idle most of the time.
I guess in this context the most important question is whether qemu is
ever involved in "regular operation" if you avoid the obvious IO
problems on your critical path.

My guess is that just [1] has serious hidden latency problems and [2]
is taking it a step too far by wasting whole cores for idle emulators.
We would like to suggest some other way inbetween, that is a little
easier on the core count. Our current solution seems to work fine but
has the mentioned quota problems.
With this mail i am hoping to collect some constraints to derive a
suggestion from. Or maybe collect some information that could be added
to the current blueprints as reasoning/documentation.

Sorry if you receive this mail a second time, i was not subscribed to
openstack-dev the first time.

best regards,
Henning

[1]
https://specs.openstack.org/openstack/nova-specs/specs/mitaka/implemented/libvirt-real-time.html
[2]
https://specs.openstack.org/openstack/nova-specs/specs/ocata/approved/libvirt-emulator-threads-policy.html
[3]
http://events.linuxfoundation.org/sites/events/files/slides/kvmforum2015-realtimekvm.pdf

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev