On 3/21/2014 6:13 AM, Adam Litke wrote:
On 20/03/14 18:03 -0700, Chegu Vinod wrote:
On 3/19/2014 11:01 PM, Liao, Chuan (Jason Liao,
HPservers-Core-OE-PSC) wrote:
Add Vinod in this thread.

Best Regards, Jason Liao

-----Original Message----- From: Adam Litke
[mailto:ali...@redhat.com] Sent: 2014年3月19日 21:23 To: Doron
Fediuck Cc: vdsm-devel; Liao, Chuan (Jason Liao,
HPservers-Core-OE-PSC); Martin Sivak; Gilad Chaplik; Liang,
Shang-Chun (David Liang, HPservers-Core-OE-PSC); Shi, Xiao-Lei
(Bruce, HP Servers-PSC-CQ) Subject: Re: Fwd: Question about MOM

On 19/03/14 05:50 -0400, Doron Fediuck wrote:
Moving this to the vdsm list.

----- Forwarded Message ----- From: "Chuan Liao (Jason Liao,
HPservers-Core-OE-PSC)" <chuan.l...@hp.com> To: "Martin Sivak"
<msi...@redhat.com>, ali...@redhat.com, "Doron Fediuck"
<dfedi...@redhat.com>, "Gilad Chaplik" <gchap...@redhat.com> Cc:
"Shang-Chun Liang (David Liang, HPservers-Core-OE-PSC)"
<shangchun.li...@hp.com>, "Xiao-Lei Shi (Bruce, HP
Servers-PSC-CQ)" <xiao-lei....@hp.com> Sent: Wednesday, March 19,
2014 11:28:01 AM Subject: Question about MOM

Hi All,

I am a new with MOM feature.

In my understanding, MOM is the collector both from host and guest
and set the right policy to KSM and memory ballooning get better
performance.
Yes this is correct.  In oVirt, MOM runs as another vdsm thread and
uses the vdsm API to collect host and guest statistics.  Those
statistics are fed into a policy file which can create some outputs
(such as ksm tuning parameters and guest balloon sizes).  MOM then
uses the vdsm API to apply those outputs to the system.


Ok..Understood about the statistics gathering part and then
initiating policy driven inputs for the ksm and balloning on the host
etc.

Perhaps this was already discussed earlier ? Does the MOM thread in
vdsm intend to gather the NUMA topology of the host from the VDSM
(using some new TBD or some enhanced existing API) or does it intend
to collect this directly from the host using libvirt/libnuma etc ?

When MOM is using the VDSM HypervisorInterface, it must get all of its
information from vdsm.  It is considered an API layering violation for
MOM to access the system or libvirt connection directly.  When running
with the Libvirt HypervisorInterface, it should use libvirt and the
system directly as necessary.  Your new features should consider this
and make use of the HypervisorInterface abstraction to provide both
implementations.


Thanks for clarifying.  (please include your comment about this in Jason's
design document that you may have seen)

I am not sure how it has relationship with NUMA, does anyone can
explain it to me?

Jason, Here is my understanding (and I believe I am just
paraphrasing/echoing Adam's comments ).

MOM's NUMA related enhancements are independent of what the oVirt
UI/oVirt scheduler does.

It is likely that MOM's vdsm thread may choose to extract information
about NUMA topology (includes dynamic stuff like cpu usage or free
memory) from the VDSM (i.e. if they choose to not get it directly
from libvirt/libnuma or /proc etc).

How MOM interprets that NUMA information along with other statistics
that it gathers (along side with user requested SLA requirements for
each guest etc) should be left to MOM to decide and direct
KSM/ballooning related actions. I don't believe we need to intervene
in the MOM related internals.

Once we decide to have NUMA-aware MOM policies there will need to be
some infrastructure enhancements to enable it.  I think Martin and I
will take the lead on it since we have been thinking about these kinds
of issues for some time now.

Ok.


I guess we need to start by examining the currently planned use
cases.  Please feel free to correct me if I am missing something or
over-simplifying something: 1) NUMA-aware placement - Try to
schedule VMs to run on hosts where the guest will not have to span
multiple NUMA nodes.

I guess you are referring to the case where the user (and/or the
oVirt scheduler) has not explicitly directed libvirt on the host to
schedule the VM in some specific way... In those cases the decision
is left to the smarts of the host OS scheduler to take care of it
(that includes the future/smarter Automatic NUMA balancing enabled
scheduler).

Yes.  For this one, we need a numa-aware placement algorithm on
engine, and the autonuma feature available and configured on all virt
hosts.  In the first phase I don't anticipate any changes to MOM
internals.  I would prefer to observe the performance characteristics
of this and tweak MOM in the future to address actual performance
problems we see.

Ok.


  2) Virtual NUMA topology - Emulate a NUMA topology inside the VM.

Yes. Irrespective of any NUMA specified for the backing resources of
a guest...when the guest size increases it is a "required" practice
to have virtual NUMA topology enabled. (This helps the OS running
inside the guest to scale/perform much by making NUMA aware decisions
etc. Also it helps the applications running in the OS to
scale/perform better).

Agreed.  One point I might make then... Should the VM creation process
on engine automatically configure virtual NUMA (even if the user
doesn't select it) once a guest reaches a certain memory size?


Good point. and yes we have thought about it a little bit... (btw, Its not just the memory size but the # vcpus too ). Perhaps mimic the host topology etc..but there could be some issues...so we wanted to defer this for a future oVirt version. (BTW, We are aware of at least one other competing hypervisor management tool that does this automatically)


These two use cases are intertwined because VMs with NUMA can be
scheduled with more flexibility (albeit with more sophistication)
since the scheduler can fit the VM onto hosts where the memory can
be split across multiple Host NUMA nodes.

  3) Manual NUMA pinning - Allow advanced admins to schedule a VM
  to run on a specific host with a manual pinning strategy.

Yes


Most of these use cases involve the engine scheduler and engine UI.

Correct.

There is not much for MOM to do to support their direct
implementation.  We should focus on managing interactions with
other SLA features that MOM does implement: - How should KSM be
adjusted when NUMA is in effect?  In a NUMA host, are there
numa-aware KSM tunables that we should use?  - When ballooning VMs,
should we take into account how much memory we need to reclaim from
VMs on a node by node basis?

If MOM had the NUMA topology information of the host I believe it
should be able to determine where the guest related processes are
currently running on the host (irrespective of how those guests ended
up there etc). MOM can then use all the relevant information (NUMA
topology, statistics, SLAs etc etc). to decide and direct KSM and
ballooning in a NUMA friendly way...

Yes, exactly.  For example, only run ksm on nodes where there is
memory pressure and only balloon guests whose memory resides on nodes
with a memory shortage.

That's correct..



Lastly, let's see if MOM needs to manage the existing NUMA
utilities in place on the system.  I don't know much about
AutoNUMA.  Does it have tunables that should be adjusted or is it
completely autonomous?

For the most part its automated (that's the whole point of being
Automatic...although the technology will mature in phases :)) ...but
if someone really really needs it to be disabled the can do so.

There are certainly some NUMA related tunables in the kernel today
(as shown below)....but at this point I am not very sure about the
specific scenarios where one would really need to change these
default settings.  (As we do more studies of various use cases on
different platforms and workload sizes etc there may be a need...but
at this point I don't see MOM necessarily getting involved in these
settings. Does MOM change other kernel tunables today ? ).


# sysctl -a |grep numa kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
kernel.numa_balancing_settle_count = 4 vm.numa_zonelist_order =
default

These remind me of the KSM tunables.  Maybe some day we will be clever
enough to tune them but you're right, it should not be our first
priority.  One idea I have for MOM is that it could check up on
autonuma by checking /proc/<pid>/numa_maps for each qemu process on
the host and seeing if autonuma is keeping the process reasonably
balanced.  If not, we could actually raise an alarm so that
vdsm/engine would try and migrate a VM away from this host if
possible.  Once that is done, autonuma might be able to make better
progress.  This is really just a research level idea at the moment.

Ok. I agree that this can be deferred to a later phase (based on further investigation)

Does libvirt have any NUMA tuning APIs that MOM may want to call to
enhance performance in certain situations?

I am no expert on libvirt's philosophy/goals etc. and have always
viewed libvirt as providing APIs for provisioning/controlling the
individual guests either on the local or in some cases remote
hosts....but not changing the host wide parameters/tunables itself. I
shall let libvirt experts comment if that is not the case...

If we do identify valid use cases where NUMA related tunables need to
be changed then MOM can use mechanisms similar to sysctl etc. to
change them... but I am yet to envision such a scenario (beyond the
rare use cases where oVirt upon user request may choose to entirely
disable automatic NUMA balancing feature on a given host)

Hope that makes some sense...  Thanks Vinod

Fair enough.  You're right that it doesn't want to handle policy, but
in some cases it provides APIs that allow a management system to tune
things.  For example: CPU pinning, IO/Net throttling, CPU shares,
balloon.

Yes... however the above examples are still falling in the category of managing guests
and not the host itself :)  But I get your point...

Thanks
Vinod



One of the main questions I ask when trying to decide if MOM should
manage a particular setting is: "Is this something that is set once
and stays the same or is it something that must change dynamically
in accordance with current system conditions?"  In the former case,
it is probably best managed by engine or vdsm directly.  In the
latter case, it fits the MOM model.

Hope this was helpful!  Please feel free to continue engaging this
list with any additional questions that you might have.

On engine side, there is only one button with this feature: Sync
MoM Policy, right?

On vdsm side, I saw the momIF is working for this, right?

Best Regards, Jason Liao

-- Adam Litke

[Jason] +Martin's part Hi,

In my understanding, MOM is the collector both from host and guest
and set the right policy to KSM and memory ballooning get better
performance.
Correct. MoM controls the Guest memory allocations using KSM and
ballooning and allows overcommitment to work this way. I does not
really set the policy thought, it contains the policy and uses it
to dynamically update the memory space available for VMs.

I am not sure how it has relationship with NUMA, does anyone can
explain it to me?
In theory MoM might be able to play with ballooning on per node
basis.

Without NUMA information it would free memory somewhere on the
host, but that memory might be too slow to access because it won't
be localized on nearby nodes.

With NUMA information MoM will know which VMs can be ballooned so
the newly released memory segments are a bit more closer to each
other.

On engine side, there is only one button with this feature: Sync
MoM Policy, right?
There is also Balloon device checkbox in the Edit VM dialog and
Enable ballooning on the Edit Cluster dialog.

On vdsm side, I saw the momIF is working for this, right?
Yes, momIF is responsible for the MoM specific communication and
for creating the policy file with parameters.

MoM also uses standard VDSM APIs to get other information and you
can see that in MoM's source code in hypervisor_interfaces/vdsm
(that interface is then used by collectors).

Regards

-- Martin Sivak msi...@redhat.com



_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Reply via email to