Re: [vdsm] FW: Fwd: Question about MOM

Chegu Vinod Tue, 01 Apr 2014 01:55:29 -0700

On 3/26/2014 11:55 PM, Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC)wrote:

Hi Martin & Adam,

If MOM is using the VDSM HypervisorInterface:
1. using API Global.getCapabilities function to get host NUMA topology data
'autoNumaBalancing': true/false
'numaNodes': {'<nodeIndex>': {'memTotal': 'str'}, …}


2. using API Global.getStats function to get host NUMA statistics data
'numaNodeMemFree': {'<nodeIndex>': {'memFree': 'str'}, …}


Assume MOM already gets the per cpu usage info. etc.


You also can using libvirt API getCapabilities and getMemoryStats to merge 
these data.

I am not sure
1. are these data is enough for MOM feature ?
2. do you need the VM NUMA topology data ?

If its a matter of just defining/implementing an API on the VDSM sidethat returns a specified VM's Virtual NUMA topology let us considerdoing so.... If not MOM perhaps someone else will eventually find a usefor it later on :)


Vinod


Best Regards,
Jason Liao

-----Original Message-----
From: Vinod, Chegu
Sent: 2014年3月26日 21:35
To: Adam Litke
Cc: Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); Martin Sivak; Gilad 
Chaplik; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC); Shi, Xiao-Lei 
(Bruce, HP Servers-PSC-CQ); Doron Fediuck; vdsm-devel
Subject: Re: FW: Fwd: Question about MOM

On 3/26/2014 5:35 AM, Adam Litke wrote:

On 26/03/14 03:50 -0700, Chegu Vinod wrote:

<removing the email alias>

Restoring the email alias.  Please keep discussions as public as
possible to allow others to contribute to the design and planning.

Fine

Jason.

Please see below...


On 3/26/2014 1:38 AM, Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC)
wrote:

Hi All,

Follow below discussion. I got these points:
1. MOM gathering NUMA information(topology, statistics...) will
changed in future. (one side using VDSM API, another side using
libvirt and system API)

I didn't follow your sentence..

Pl.. work with Adam/Martin and provide the needful API's on the VDSM
side ...so that MOM entity thread can use the API and extract the
needful about NUMA topology and cpu/memory usage info. As I see
it...this is probably the only piece that would be relevant to be
made available at the earliest (preferably in oVirt 3.5) and that
would enable MOM to pursue next steps as they say fit.

Beyond that ...at this point (for oVirt 3.5) let us not spend more
time on MOM internals please. Let us leave that to Adam and Martin to
pursue this as/when they see fit.

2. Martin and Adam will take a look at MOM policy in ovirt scheduler
when NUMA feature turn on.

Yes please.

3. ovirt engine will have numa-aware placement algorithm to make the
VM run within NUMA nodes as best way.

"algorithm" here is decided by user specified pinning requests
(and/or) by the oVirt scheduler. In the case of user request (upon
approval from oVirt scheduler) the VDSM-> libvirt will be explicitly
told what to do via numatune/cputune etc etc.   In the absence of the
user specified pinning request I don't know if oVirt scheduler
intends to convey the numatune/cputune type of requests to the
libvirt...

4. ovirt engine will have some algorithm to automatic configure
virtual NUMA when big VM creation (big memory or vcpus)

This is a good suggestion but in my view should be taken up after
oVirt 3.5.
For now just accept and process the user specified requests...

5. Investigate on KSM, memory ballooning have the right tuning
parameter when NUMA feature turn on.

That is for Adam/Martin et.al. ...not for your specific project.

We just need to ensure that they have the basic NUMA info, they need
(via the VDSM API i mentioned above)...so that it enables them to
work on their part independently as/when they see fit.

6. Investigate on if Automatic NUMA balancing is keeping the process
reasonably balanced and notify ovirt engine.

Not sure I follow what you are saying...

Here is what I have in my mind :

Check if the target host has Automatic NUMA balancing enabled (you
can use the sysctl -a |grep numa_balancing or a similar underlying
mechanism for determining this). If its present then check if its
enabled or not (value of 1 is enabled and 0 is disabled)... and
convey this information to the oVirt engine GUI for display (this is
a hint for a user (if they wish) to skip manual pinning)..   This in
my view is the minimum...at this point (and it would be great if we
can make it happen for oVirt 3.5).

I think since we have vdsm you can choose to enable autonuma always
(when it is present).

I don't speak for the various Linux distros out there... but I suspect most may 
choose to  have the default set to enabled (if the feature is present in the 
OS).

Again... there should be some indication on the oVirt engine side (and in my 
opinion it might be useful to display to the user too) whether a given host has 
the feature currently enabled or not (either because it was disabled or the 
feature is not present in the OS)

Are there any drawbacks to enabling it always?

Can't speak for every possible use case...but based on what I know at this 
current moment : With the exception of a few targeted benchmarking type of use 
cases (where folks may consider turning it off..) I haven't yet run into a 
situation where there are negative side effects of leaving it enabled.

A customer can still choose to manually pin a workload or a guest if they wish 
to do so (even if it is enabled).

We can discuss (at some later point i.e for post oVirt 3.5) whether
we should really provide a way to the user to disable Automatic NUMA
balancing.   Changing the other numa balancing tunables is just not
going to happen...as far as I can see at this point (so let us not
worry about that right now..)

7. Investigate on libvirt have any NUMA tuning APIs

No. There is nothing to investigate here..

IMO.  libvirt should not be playing with the host wide NUMA settings.

Please feel free to correct me if I am missing something.

See above

BTW. I think there is no point in ovirt 3.5 release, am I right?

If you are referring to just the MOM stuff then with the exception of
my comment about having an appropriate API on the VDSM for enabling
MOM there is nothing else.

Vinod

Best Regards,
Jason Liao

-----Original Message-----
From: Vinod, Chegu
Sent: 2014年3月21日 21:32
To: Adam Litke
Cc: Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); vdsm-devel;
Martin Sivak; Gilad Chaplik; Liang, Shang-Chun (David Liang,
HPservers-Core-OE-PSC); Shi, Xiao-Lei (Bruce, HP Servers-PSC-CQ);
Doron Fediuck
Subject: Re: FW: Fwd: Question about MOM

On 3/21/2014 6:13 AM, Adam Litke wrote:

On 20/03/14 18:03 -0700, Chegu Vinod wrote:

On 3/19/2014 11:01 PM, Liao, Chuan (Jason Liao,
HPservers-Core-OE-PSC) wrote:

Add Vinod in this thread.

Best Regards, Jason Liao

-----Original Message----- From: Adam Litke
[mailto:ali...@redhat.com] Sent: 2014年3月19日 21:23 To: Doron
Fediuck
Cc: vdsm-devel; Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC);
Martin Sivak; Gilad Chaplik; Liang, Shang-Chun (David Liang,
HPservers-Core-OE-PSC); Shi, Xiao-Lei (Bruce, HP Servers-PSC-CQ)
Subject: Re: Fwd: Question about MOM

On 19/03/14 05:50 -0400, Doron Fediuck wrote:

Moving this to the vdsm list.

----- Forwarded Message ----- From: "Chuan Liao (Jason Liao,
HPservers-Core-OE-PSC)" <chuan.l...@hp.com> To: "Martin Sivak"
<msi...@redhat.com>, ali...@redhat.com, "Doron Fediuck"
<dfedi...@redhat.com>, "Gilad Chaplik" <gchap...@redhat.com> Cc:
"Shang-Chun Liang (David Liang, HPservers-Core-OE-PSC)"
<shangchun.li...@hp.com>, "Xiao-Lei Shi (Bruce, HP Servers-PSC-CQ)"
<xiao-lei....@hp.com> Sent: Wednesday, March 19,
2014 11:28:01 AM Subject: Question about MOM

Hi All,

I am a new with MOM feature.

In my understanding, MOM is the collector both from host and guest
and set the right policy to KSM and memory ballooning get better
performance.

Yes this is correct.  In oVirt, MOM runs as another vdsm thread and
uses the vdsm API to collect host and guest statistics. Those
statistics are fed into a policy file which can create some outputs
(such as ksm tuning parameters and guest balloon sizes).  MOM then
uses the vdsm API to apply those outputs to the system.

Ok..Understood about the statistics gathering part and then
initiating policy driven inputs for the ksm and balloning on the host
etc.

Perhaps this was already discussed earlier ? Does the MOM thread in
vdsm intend to gather the NUMA topology of the host from the VDSM
(using some new TBD or some enhanced existing API) or does it intend
to collect this directly from the host using libvirt/libnuma etc ?

When MOM is using the VDSM HypervisorInterface, it must get all of its
information from vdsm.  It is considered an API layering violation for
MOM to access the system or libvirt connection directly. When running
with the Libvirt HypervisorInterface, it should use libvirt and the
system directly as necessary.  Your new features should consider this
and make use of the HypervisorInterface abstraction to provide both
implementations.

Thanks for clarifying.  (please include your comment about this in
Jason's design document that you may have seen)

I am not sure how it has relationship with NUMA, does anyone can
explain it to me?

Jason, Here is my understanding (and I believe I am just
paraphrasing/echoing Adam's comments ).

MOM's NUMA related enhancements are independent of what the oVirt
UI/oVirt scheduler does.

It is likely that MOM's vdsm thread may choose to extract information
about NUMA topology (includes dynamic stuff like cpu usage or free
memory) from the VDSM (i.e. if they choose to not get it directly
from libvirt/libnuma or /proc etc).

How MOM interprets that NUMA information along with other statistics
that it gathers (along side with user requested SLA requirements for
each guest etc) should be left to MOM to decide and direct
KSM/ballooning related actions. I don't believe we need to intervene
in the MOM related internals.

Once we decide to have NUMA-aware MOM policies there will need to be
some infrastructure enhancements to enable it.  I think Martin and I
will take the lead on it since we have been thinking about these kinds
of issues for some time now.

Ok.

I guess we need to start by examining the currently planned use
cases.  Please feel free to correct me if I am missing something or
over-simplifying something: 1) NUMA-aware placement - Try to
schedule VMs to run on hosts where the guest will not have to span
multiple NUMA nodes.

I guess you are referring to the case where the user (and/or the
oVirt scheduler) has not explicitly directed libvirt on the host to
schedule the VM in some specific way... In those cases the decision
is left to the smarts of the host OS scheduler to take care of it
(that includes the future/smarter Automatic NUMA balancing enabled
scheduler).

Yes.  For this one, we need a numa-aware placement algorithm on
engine, and the autonuma feature available and configured on all virt
hosts.  In the first phase I don't anticipate any changes to MOM
internals.  I would prefer to observe the performance characteristics
of this and tweak MOM in the future to address actual performance
problems we see.

Ok.

   2) Virtual NUMA topology - Emulate a NUMA topology inside the VM.

Yes. Irrespective of any NUMA specified for the backing resources of
a guest...when the guest size increases it is a "required" practice
to have virtual NUMA topology enabled. (This helps the OS running
inside the guest to scale/perform much by making NUMA aware decisions
etc. Also it helps the applications running in the OS to
scale/perform better).

Agreed.  One point I might make then... Should the VM creation process
on engine automatically configure virtual NUMA (even if the user
doesn't select it) once a guest reaches a certain memory size?

Good point. and yes we have thought about it a little bit... (btw,
Its not just the memory size but the # vcpus too ).
Perhaps mimic the host topology etc..but there could be some
issues...so we wanted to defer this for a future oVirt version.
(BTW, We are aware of at least one other competing hypervisor
management tool  that does this automatically)

These two use cases are intertwined because VMs with NUMA can be
scheduled with more flexibility (albeit with more sophistication)
since the scheduler can fit the VM onto hosts where the memory can
be split across multiple Host NUMA nodes.

   3) Manual NUMA pinning - Allow advanced admins to schedule a VM
   to run on a specific host with a manual pinning strategy.

Yes

Most of these use cases involve the engine scheduler and engine UI.

Correct.

There is not much for MOM to do to support their direct
implementation.  We should focus on managing interactions with other
SLA features that MOM does implement: - How should KSM be adjusted
when NUMA is in effect?  In a NUMA host, are there numa-aware KSM
tunables that we should use?  - When ballooning VMs, should we take
into account how much memory we need to reclaim from VMs on a node
by node basis?

If MOM had the NUMA topology information of the host I believe it
should be able to determine where the guest related processes are
currently running on the host (irrespective of how those guests ended
up there etc). MOM can then use all the relevant information (NUMA
topology, statistics, SLAs etc etc). to decide and direct KSM and
ballooning in a NUMA friendly way...

Yes, exactly.  For example, only run ksm on nodes where there is
memory pressure and only balloon guests whose memory resides on nodes
with a memory shortage.

That's correct..

Lastly, let's see if MOM needs to manage the existing NUMA utilities
in place on the system.  I don't know much about AutoNUMA.  Does it
have tunables that should be adjusted or is it completely
autonomous?

For the most part its automated (that's the whole point of being
Automatic...although the technology will mature in phases :)) ...but
if someone really really needs it to be disabled the can do so.

There are certainly some NUMA related tunables in the kernel today
(as shown below)....but at this point I am not very sure about the
specific scenarios where one would really need to change these
default settings.  (As we do more studies of various use cases on
different platforms and workload sizes etc there may be a need...but
at this point I don't see MOM necessarily getting involved in these
settings. Does MOM change other kernel tunables today ? ).


# sysctl -a |grep numa kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
kernel.numa_balancing_settle_count = 4 vm.numa_zonelist_order =
default

These remind me of the KSM tunables.  Maybe some day we will be clever
enough to tune them but you're right, it should not be our first
priority.  One idea I have for MOM is that it could check up on
autonuma by checking /proc/<pid>/numa_maps for each qemu process on
the host and seeing if autonuma is keeping the process reasonably
balanced.  If not, we could actually raise an alarm so that
vdsm/engine would try and migrate a VM away from this host if
possible.  Once that is done, autonuma might be able to make better
progress.  This is really just a research level idea at the moment.

Ok. I agree that this can be deferred to a later phase (based on
further
investigation)

Does libvirt have any NUMA tuning APIs that MOM may want to call to
enhance performance in certain situations?

I am no expert on libvirt's philosophy/goals etc. and have always
viewed libvirt as providing APIs for provisioning/controlling the
individual guests either on the local or in some cases remote
hosts....but not changing the host wide parameters/tunables itself. I
shall let libvirt experts comment if that is not the case...

If we do identify valid use cases where NUMA related tunables need to
be changed then MOM can use mechanisms similar to sysctl etc. to
change them... but I am yet to envision such a scenario (beyond the
rare use cases where oVirt upon user request may choose to entirely
disable automatic NUMA balancing feature on a given host)

Hope that makes some sense...  Thanks Vinod

Fair enough.  You're right that it doesn't want to handle policy, but
in some cases it provides APIs that allow a management system to tune
things.  For example: CPU pinning, IO/Net throttling, CPU shares,
balloon.

Yes...   however the above examples are still falling in the
category of
managing guests
and not the host itself :)  But I get your point...

Thanks
Vinod

One of the main questions I ask when trying to decide if MOM should
manage a particular setting is: "Is this something that is set once
and stays the same or is it something that must change dynamically
in accordance with current system conditions?"  In the former case,
it is probably best managed by engine or vdsm directly. In the
latter case, it fits the MOM model.

Hope this was helpful!  Please feel free to continue engaging this
list with any additional questions that you might have.

On engine side, there is only one button with this feature: Sync
MoM Policy, right?

On vdsm side, I saw the momIF is working for this, right?

Best Regards, Jason Liao

-- Adam Litke

[Jason] +Martin's part Hi,

In my understanding, MOM is the collector both from host and guest
and set the right policy to KSM and memory ballooning get better
performance.

Correct. MoM controls the Guest memory allocations using KSM and
ballooning and allows overcommitment to work this way. I does not
really set the policy thought, it contains the policy and uses it to
dynamically update the memory space available for VMs.

I am not sure how it has relationship with NUMA, does anyone can
explain it to me?

In theory MoM might be able to play with ballooning on per node
basis.

Without NUMA information it would free memory somewhere on the host,
but that memory might be too slow to access because it won't be
localized on nearby nodes.

With NUMA information MoM will know which VMs can be ballooned so
the newly released memory segments are a bit more closer to each
other.

On engine side, there is only one button with this feature: Sync
MoM Policy, right?

There is also Balloon device checkbox in the Edit VM dialog and
Enable ballooning on the Edit Cluster dialog.

On vdsm side, I saw the momIF is working for this, right?

Yes, momIF is responsible for the MoM specific communication and for
creating the policy file with parameters.

MoM also uses standard VDSM APIs to get other information and you
can see that in MoM's source code in hypervisor_interfaces/vdsm
(that interface is then used by collectors).

Regards

-- Martin Sivak msi...@redhat.com


_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Re: [vdsm] FW: Fwd: Question about MOM

Reply via email to