On 3/26/2014 5:35 AM, Adam Litke wrote:
On 26/03/14 03:50 -0700, Chegu Vinod wrote:<removing the email alias>Restoring the email alias. Please keep discussions as public as possible to allow others to contribute to the design and planning.
Fine
Jason. Please see below...On 3/26/2014 1:38 AM, Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC) wrote:Hi All, Follow below discussion. I got these points:1. MOM gathering NUMA information(topology, statistics...) will changed in future. (one side using VDSM API, another side using libvirt and system API)I didn't follow your sentence..Pl.. work with Adam/Martin and provide the needful API's on the VDSM side ...so that MOM entity thread can use the API and extract the needful about NUMA topology and cpu/memory usage info. As I see it...this is probably the only piece that would be relevant to be made available at the earliest (preferably in oVirt 3.5) and that would enable MOM to pursue next steps as they say fit.Beyond that ...at this point (for oVirt 3.5) let us not spend more time on MOM internals please. Let us leave that to Adam and Martin to pursue this as/when they see fit.2. Martin and Adam will take a look at MOM policy in ovirt scheduler when NUMA feature turn on.Yes please.3. ovirt engine will have numa-aware placement algorithm to make the VM run within NUMA nodes as best way."algorithm" here is decided by user specified pinning requests (and/or) by the oVirt scheduler. In the case of user request (upon approval from oVirt scheduler) the VDSM-> libvirt will be explicitly told what to do via numatune/cputune etc etc. In the absence of the user specified pinning request I don't know if oVirt scheduler intends to convey the numatune/cputune type of requests to the libvirt...4. ovirt engine will have some algorithm to automatic configure virtual NUMA when big VM creation (big memory or vcpus)This is a good suggestion but in my view should be taken up after oVirt 3.5.For now just accept and process the user specified requests...5. Investigate on KSM, memory ballooning have the right tuning parameter when NUMA feature turn on.That is for Adam/Martin et.al. ...not for your specific project.We just need to ensure that they have the basic NUMA info, they need (via the VDSM API i mentioned above)...so that it enables them to work on their part independently as/when they see fit.6. Investigate on if Automatic NUMA balancing is keeping the process reasonably balanced and notify ovirt engine.Not sure I follow what you are saying... Here is what I have in my mind :Check if the target host has Automatic NUMA balancing enabled (you can use the sysctl -a |grep numa_balancing or a similar underlying mechanism for determining this). If its present then check if its enabled or not (value of 1 is enabled and 0 is disabled)... and convey this information to the oVirt engine GUI for display (this is a hint for a user (if they wish) to skip manual pinning).. This in my view is the minimum...at this point (and it would be great if we can make it happen for oVirt 3.5).I think since we have vdsm you can choose to enable autonuma always(when it is present).
I don't speak for the various Linux distros out there... but I suspect most may choose to have the default set to enabled (if the feature is present in the OS).
Again... there should be some indication on the oVirt engine side (and in my opinion it might be useful to display to the user too) whether a given host has the feature currently enabled or not (either because it was disabled or the feature is not present in the OS)
Are there any drawbacks to enabling it always?
Can't speak for every possible use case...but based on what I know at this current moment : With the exception of a few targeted benchmarking type of use cases (where folks may consider turning it off..) I haven't yet run into a situation where there are negative side effects of leaving it enabled.
A customer can still choose to manually pin a workload or a guest if they wish to do so (even if it is enabled).
We can discuss (at some later point i.e for post oVirt 3.5) whether we should really provide a way to the user to disable Automatic NUMA balancing. Changing the other numa balancing tunables is just not going to happen...as far as I can see at this point (so let us not worry about that right now..)7. Investigate on libvirt have any NUMA tuning APIsNo. There is nothing to investigate here.. IMO. libvirt should not be playing with the host wide NUMA settings.Please feel free to correct me if I am missing something.See aboveBTW. I think there is no point in ovirt 3.5 release, am I right?If you are referring to just the MOM stuff then with the exception of my comment about having an appropriate API on the VDSM for enabling MOM there is nothing else.VinodBest Regards, Jason Liao -----Original Message----- From: Vinod, Chegu Sent: 2014年3月21日 21:32 To: Adam LitkeCc: Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); vdsm-devel; Martin Sivak; Gilad Chaplik; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC); Shi, Xiao-Lei (Bruce, HP Servers-PSC-CQ); Doron FediuckSubject: Re: FW: Fwd: Question about MOM On 3/21/2014 6:13 AM, Adam Litke wrote:Thanks for clarifying. (please include your comment about this in Jason's design document that you may have seen)On 20/03/14 18:03 -0700, Chegu Vinod wrote:On 3/19/2014 11:01 PM, Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC) wrote:Add Vinod in this thread. Best Regards, Jason Liao -----Original Message----- From: Adam Litke[mailto:ali...@redhat.com] Sent: 2014年3月19日 21:23 To: Doron FediuckCc: vdsm-devel; Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); Martin Sivak; Gilad Chaplik; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC); Shi, Xiao-Lei (Bruce, HP Servers-PSC-CQ) Subject: Re: Fwd: Question about MOM On 19/03/14 05:50 -0400, Doron Fediuck wrote:Moving this to the vdsm list. ----- Forwarded Message ----- From: "Chuan Liao (Jason Liao, HPservers-Core-OE-PSC)" <chuan.l...@hp.com> To: "Martin Sivak" <msi...@redhat.com>, ali...@redhat.com, "Doron Fediuck" <dfedi...@redhat.com>, "Gilad Chaplik" <gchap...@redhat.com> Cc: "Shang-Chun Liang (David Liang, HPservers-Core-OE-PSC)" <shangchun.li...@hp.com>, "Xiao-Lei Shi (Bruce, HP Servers-PSC-CQ)" <xiao-lei....@hp.com> Sent: Wednesday, March 19, 2014 11:28:01 AM Subject: Question about MOM Hi All, I am a new with MOM feature. In my understanding, MOM is the collector both from host and guest and set the right policy to KSM and memory ballooning get better performance.Yes this is correct. In oVirt, MOM runs as another vdsm thread and uses the vdsm API to collect host and guest statistics. Those statistics are fed into a policy file which can create some outputs (such as ksm tuning parameters and guest balloon sizes). MOM then uses the vdsm API to apply those outputs to the system.Ok..Understood about the statistics gathering part and then initiating policy driven inputs for the ksm and balloning on the host etc. Perhaps this was already discussed earlier ? Does the MOM thread in vdsm intend to gather the NUMA topology of the host from the VDSM (using some new TBD or some enhanced existing API) or does it intend to collect this directly from the host using libvirt/libnuma etc ?When MOM is using the VDSM HypervisorInterface, it must get all of its information from vdsm. It is considered an API layering violation for MOM to access the system or libvirt connection directly. When running with the Libvirt HypervisorInterface, it should use libvirt and the system directly as necessary. Your new features should consider this and make use of the HypervisorInterface abstraction to provide both implementations.I am not sure how it has relationship with NUMA, does anyone can explain it to me?Jason, Here is my understanding (and I believe I am just paraphrasing/echoing Adam's comments ). MOM's NUMA related enhancements are independent of what the oVirt UI/oVirt scheduler does. It is likely that MOM's vdsm thread may choose to extract information about NUMA topology (includes dynamic stuff like cpu usage or free memory) from the VDSM (i.e. if they choose to not get it directly from libvirt/libnuma or /proc etc). How MOM interprets that NUMA information along with other statistics that it gathers (along side with user requested SLA requirements for each guest etc) should be left to MOM to decide and direct KSM/ballooning related actions. I don't believe we need to intervene in the MOM related internals.Once we decide to have NUMA-aware MOM policies there will need to be some infrastructure enhancements to enable it. I think Martin and I will take the lead on it since we have been thinking about these kinds of issues for some time now.Ok.I guess we need to start by examining the currently planned use cases. Please feel free to correct me if I am missing something or over-simplifying something: 1) NUMA-aware placement - Try to schedule VMs to run on hosts where the guest will not have to span multiple NUMA nodes.I guess you are referring to the case where the user (and/or the oVirt scheduler) has not explicitly directed libvirt on the host to schedule the VM in some specific way... In those cases the decision is left to the smarts of the host OS scheduler to take care of it (that includes the future/smarter Automatic NUMA balancing enabled scheduler).Yes. For this one, we need a numa-aware placement algorithm on engine, and the autonuma feature available and configured on all virt hosts. In the first phase I don't anticipate any changes to MOM internals. I would prefer to observe the performance characteristics of this and tweak MOM in the future to address actual performance problems we see.Ok.2) Virtual NUMA topology - Emulate a NUMA topology inside the VM.Yes. Irrespective of any NUMA specified for the backing resources of a guest...when the guest size increases it is a "required" practice to have virtual NUMA topology enabled. (This helps the OS running inside the guest to scale/perform much by making NUMA aware decisions etc. Also it helps the applications running in the OS to scale/perform better).Agreed. One point I might make then... Should the VM creation process on engine automatically configure virtual NUMA (even if the user doesn't select it) once a guest reaches a certain memory size?Good point. and yes we have thought about it a little bit... (btw, Its not just the memory size but the # vcpus too ). Perhaps mimic the host topology etc..but there could be some issues...so we wanted to defer this for a future oVirt version. (BTW, We are aware of at least one other competing hypervisor management tool that does this automatically)These two use cases are intertwined because VMs with NUMA can be scheduled with more flexibility (albeit with more sophistication) since the scheduler can fit the VM onto hosts where the memory can be split across multiple Host NUMA nodes. 3) Manual NUMA pinning - Allow advanced admins to schedule a VM to run on a specific host with a manual pinning strategy.YesMost of these use cases involve the engine scheduler and engine UI.Correct.There is not much for MOM to do to support their direct implementation. We should focus on managing interactions with other SLA features that MOM does implement: - How should KSM be adjusted when NUMA is in effect? In a NUMA host, are there numa-aware KSM tunables that we should use? - When ballooning VMs, should we take into account how much memory we need to reclaim from VMs on a node by node basis?If MOM had the NUMA topology information of the host I believe it should be able to determine where the guest related processes are currently running on the host (irrespective of how those guests ended up there etc). MOM can then use all the relevant information (NUMA topology, statistics, SLAs etc etc). to decide and direct KSM and ballooning in a NUMA friendly way...Yes, exactly. For example, only run ksm on nodes where there is memory pressure and only balloon guests whose memory resides on nodes with a memory shortage.That's correct..Ok. I agree that this can be deferred to a later phase (based on furtherLastly, let's see if MOM needs to manage the existing NUMA utilities in place on the system. I don't know much about AutoNUMA. Does it have tunables that should be adjusted or is it completely autonomous?For the most part its automated (that's the whole point of being Automatic...although the technology will mature in phases :)) ...but if someone really really needs it to be disabled the can do so. There are certainly some NUMA related tunables in the kernel today (as shown below)....but at this point I am not very sure about the specific scenarios where one would really need to change these default settings. (As we do more studies of various use cases on different platforms and workload sizes etc there may be a need...but at this point I don't see MOM necessarily getting involved in these settings. Does MOM change other kernel tunables today ? ). # sysctl -a |grep numa kernel.numa_balancing = 1 kernel.numa_balancing_scan_delay_ms = 1000 kernel.numa_balancing_scan_period_max_ms = 60000 kernel.numa_balancing_scan_period_min_ms = 1000 kernel.numa_balancing_scan_size_mb = 256 kernel.numa_balancing_settle_count = 4 vm.numa_zonelist_order = defaultThese remind me of the KSM tunables. Maybe some day we will be clever enough to tune them but you're right, it should not be our first priority. One idea I have for MOM is that it could check up on autonuma by checking /proc/<pid>/numa_maps for each qemu process on the host and seeing if autonuma is keeping the process reasonably balanced. If not, we could actually raise an alarm so that vdsm/engine would try and migrate a VM away from this host if possible. Once that is done, autonuma might be able to make better progress. This is really just a research level idea at the moment.investigation)Yes... however the above examples are still falling in the category ofDoes libvirt have any NUMA tuning APIs that MOM may want to call to enhance performance in certain situations?I am no expert on libvirt's philosophy/goals etc. and have always viewed libvirt as providing APIs for provisioning/controlling the individual guests either on the local or in some cases remote hosts....but not changing the host wide parameters/tunables itself. I shall let libvirt experts comment if that is not the case... If we do identify valid use cases where NUMA related tunables need to be changed then MOM can use mechanisms similar to sysctl etc. to change them... but I am yet to envision such a scenario (beyond the rare use cases where oVirt upon user request may choose to entirely disable automatic NUMA balancing feature on a given host) Hope that makes some sense... Thanks VinodFair enough. You're right that it doesn't want to handle policy, but in some cases it provides APIs that allow a management system to tune things. For example: CPU pinning, IO/Net throttling, CPU shares, balloon.managing guests and not the host itself :) But I get your point... Thanks VinodOne of the main questions I ask when trying to decide if MOM should manage a particular setting is: "Is this something that is set once and stays the same or is it something that must change dynamically in accordance with current system conditions?" In the former case, it is probably best managed by engine or vdsm directly. In the latter case, it fits the MOM model. Hope this was helpful! Please feel free to continue engaging this list with any additional questions that you might have.On engine side, there is only one button with this feature: Sync MoM Policy, right? On vdsm side, I saw the momIF is working for this, right? Best Regards, Jason Liao-- Adam Litke [Jason] +Martin's part Hi,In my understanding, MOM is the collector both from host and guest and set the right policy to KSM and memory ballooning get better performance.Correct. MoM controls the Guest memory allocations using KSM and ballooning and allows overcommitment to work this way. I does not really set the policy thought, it contains the policy and uses it to dynamically update the memory space available for VMs.I am not sure how it has relationship with NUMA, does anyone can explain it to me?In theory MoM might be able to play with ballooning on per node basis. Without NUMA information it would free memory somewhere on the host, but that memory might be too slow to access because it won't be localized on nearby nodes. With NUMA information MoM will know which VMs can be ballooned so the newly released memory segments are a bit more closer to each other.On engine side, there is only one button with this feature: Sync MoM Policy, right?There is also Balloon device checkbox in the Edit VM dialog and Enable ballooning on the Edit Cluster dialog.On vdsm side, I saw the momIF is working for this, right?Yes, momIF is responsible for the MoM specific communication and for creating the policy file with parameters. MoM also uses standard VDSM APIs to get other information and you can see that in MoM's source code in hypervisor_interfaces/vdsm (that interface is then used by collectors). Regards -- Martin Sivak msi...@redhat.com
_______________________________________________ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel