Hi All,

Follow below discussion. I got these points:
1. MOM gathering NUMA information(topology, statistics...) will changed in 
future. (one side using VDSM API, another side using libvirt and system API)
2. Martin and Adam will take a look at MOM policy in ovirt scheduler when NUMA 
feature turn on.
3. ovirt engine will have numa-aware placement algorithm to make the VM run 
within NUMA nodes as best way.
4. ovirt engine will have some algorithm to automatic configure virtual NUMA 
when big VM creation (big memory or vcpus)
5. Investigate on KSM, memory ballooning have the right tuning parameter when 
NUMA feature turn on.
6. Investigate on if Automatic NUMA balancing is keeping the process reasonably 
balanced and notify ovirt engine.
7. Investigate on libvirt have any NUMA tuning APIs 

Please feel free to correct me if I am missing something.

BTW. I think there is no point in ovirt 3.5 release, am I right?

Best Regards,
Jason Liao

-----Original Message-----
From: Vinod, Chegu 
Sent: 2014年3月21日 21:32
To: Adam Litke
Cc: Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); vdsm-devel; Martin Sivak; 
Gilad Chaplik; Liang, Shang-Chun (David Liang, HPservers-Core-OE-PSC); Shi, 
Xiao-Lei (Bruce, HP Servers-PSC-CQ); Doron Fediuck
Subject: Re: FW: Fwd: Question about MOM

On 3/21/2014 6:13 AM, Adam Litke wrote:
> On 20/03/14 18:03 -0700, Chegu Vinod wrote:
>> On 3/19/2014 11:01 PM, Liao, Chuan (Jason Liao,
>> HPservers-Core-OE-PSC) wrote:
>>> Add Vinod in this thread.
>>>
>>> Best Regards, Jason Liao
>>>
>>> -----Original Message----- From: Adam Litke 
>>> [mailto:ali...@redhat.com] Sent: 2014年3月19日 21:23 To: Doron Fediuck 
>>> Cc: vdsm-devel; Liao, Chuan (Jason Liao, HPservers-Core-OE-PSC); 
>>> Martin Sivak; Gilad Chaplik; Liang, Shang-Chun (David Liang, 
>>> HPservers-Core-OE-PSC); Shi, Xiao-Lei (Bruce, HP Servers-PSC-CQ) 
>>> Subject: Re: Fwd: Question about MOM
>>>
>>> On 19/03/14 05:50 -0400, Doron Fediuck wrote:
>>>> Moving this to the vdsm list.
>>>>
>>>> ----- Forwarded Message ----- From: "Chuan Liao (Jason Liao, 
>>>> HPservers-Core-OE-PSC)" <chuan.l...@hp.com> To: "Martin Sivak"
>>>> <msi...@redhat.com>, ali...@redhat.com, "Doron Fediuck"
>>>> <dfedi...@redhat.com>, "Gilad Chaplik" <gchap...@redhat.com> Cc:
>>>> "Shang-Chun Liang (David Liang, HPservers-Core-OE-PSC)"
>>>> <shangchun.li...@hp.com>, "Xiao-Lei Shi (Bruce, HP Servers-PSC-CQ)" 
>>>> <xiao-lei....@hp.com> Sent: Wednesday, March 19,
>>>> 2014 11:28:01 AM Subject: Question about MOM
>>>>
>>>> Hi All,
>>>>
>>>> I am a new with MOM feature.
>>>>
>>>> In my understanding, MOM is the collector both from host and guest 
>>>> and set the right policy to KSM and memory ballooning get better 
>>>> performance.
>>> Yes this is correct.  In oVirt, MOM runs as another vdsm thread and 
>>> uses the vdsm API to collect host and guest statistics.  Those 
>>> statistics are fed into a policy file which can create some outputs 
>>> (such as ksm tuning parameters and guest balloon sizes).  MOM then 
>>> uses the vdsm API to apply those outputs to the system.
>>
>>
>> Ok..Understood about the statistics gathering part and then 
>> initiating policy driven inputs for the ksm and balloning on the host 
>> etc.
>>
>> Perhaps this was already discussed earlier ? Does the MOM thread in 
>> vdsm intend to gather the NUMA topology of the host from the VDSM 
>> (using some new TBD or some enhanced existing API) or does it intend 
>> to collect this directly from the host using libvirt/libnuma etc ?
>
> When MOM is using the VDSM HypervisorInterface, it must get all of its 
> information from vdsm.  It is considered an API layering violation for 
> MOM to access the system or libvirt connection directly.  When running 
> with the Libvirt HypervisorInterface, it should use libvirt and the 
> system directly as necessary.  Your new features should consider this 
> and make use of the HypervisorInterface abstraction to provide both 
> implementations.
>

Thanks for clarifying.  (please include your comment about this in Jason's 
design document that you may have seen)

>>>> I am not sure how it has relationship with NUMA, does anyone can 
>>>> explain it to me?
>>
>> Jason, Here is my understanding (and I believe I am just 
>> paraphrasing/echoing Adam's comments ).
>>
>> MOM's NUMA related enhancements are independent of what the oVirt 
>> UI/oVirt scheduler does.
>>
>> It is likely that MOM's vdsm thread may choose to extract information 
>> about NUMA topology (includes dynamic stuff like cpu usage or free
>> memory) from the VDSM (i.e. if they choose to not get it directly 
>> from libvirt/libnuma or /proc etc).
>>
>> How MOM interprets that NUMA information along with other statistics 
>> that it gathers (along side with user requested SLA requirements for 
>> each guest etc) should be left to MOM to decide and direct 
>> KSM/ballooning related actions. I don't believe we need to intervene 
>> in the MOM related internals.
>
> Once we decide to have NUMA-aware MOM policies there will need to be 
> some infrastructure enhancements to enable it.  I think Martin and I 
> will take the lead on it since we have been thinking about these kinds 
> of issues for some time now.

Ok.

>
>>> I guess we need to start by examining the currently planned use 
>>> cases.  Please feel free to correct me if I am missing something or 
>>> over-simplifying something: 1) NUMA-aware placement - Try to 
>>> schedule VMs to run on hosts where the guest will not have to span 
>>> multiple NUMA nodes.
>>
>> I guess you are referring to the case where the user (and/or the 
>> oVirt scheduler) has not explicitly directed libvirt on the host to 
>> schedule the VM in some specific way... In those cases the decision 
>> is left to the smarts of the host OS scheduler to take care of it 
>> (that includes the future/smarter Automatic NUMA balancing enabled 
>> scheduler).
>
> Yes.  For this one, we need a numa-aware placement algorithm on 
> engine, and the autonuma feature available and configured on all virt 
> hosts.  In the first phase I don't anticipate any changes to MOM 
> internals.  I would prefer to observe the performance characteristics 
> of this and tweak MOM in the future to address actual performance 
> problems we see.

Ok.

>
>>>   2) Virtual NUMA topology - Emulate a NUMA topology inside the VM.
>>
>> Yes. Irrespective of any NUMA specified for the backing resources of 
>> a guest...when the guest size increases it is a "required" practice 
>> to have virtual NUMA topology enabled. (This helps the OS running 
>> inside the guest to scale/perform much by making NUMA aware decisions 
>> etc. Also it helps the applications running in the OS to 
>> scale/perform better).
>
> Agreed.  One point I might make then... Should the VM creation process 
> on engine automatically configure virtual NUMA (even if the user 
> doesn't select it) once a guest reaches a certain memory size?


Good point. and yes we have thought about it a little bit... (btw, Its not just 
the memory size but the # vcpus too ).
Perhaps mimic the host topology etc..but there could be some issues...so we 
wanted to defer this for a future oVirt version. (BTW, We are aware of at least 
one other competing hypervisor management tool  that does this automatically)

>
>>> These two use cases are intertwined because VMs with NUMA can be 
>>> scheduled with more flexibility (albeit with more sophistication) 
>>> since the scheduler can fit the VM onto hosts where the memory can 
>>> be split across multiple Host NUMA nodes.
>>>
>>>   3) Manual NUMA pinning - Allow advanced admins to schedule a VM
>>>   to run on a specific host with a manual pinning strategy.
>>
>> Yes
>>
>>>
>>> Most of these use cases involve the engine scheduler and engine UI.
>>
>> Correct.
>>
>>> There is not much for MOM to do to support their direct 
>>> implementation.  We should focus on managing interactions with other 
>>> SLA features that MOM does implement: - How should KSM be adjusted 
>>> when NUMA is in effect?  In a NUMA host, are there numa-aware KSM 
>>> tunables that we should use?  - When ballooning VMs, should we take 
>>> into account how much memory we need to reclaim from VMs on a node 
>>> by node basis?
>>
>> If MOM had the NUMA topology information of the host I believe it 
>> should be able to determine where the guest related processes are 
>> currently running on the host (irrespective of how those guests ended 
>> up there etc). MOM can then use all the relevant information (NUMA 
>> topology, statistics, SLAs etc etc). to decide and direct KSM and 
>> ballooning in a NUMA friendly way...
>
> Yes, exactly.  For example, only run ksm on nodes where there is 
> memory pressure and only balloon guests whose memory resides on nodes 
> with a memory shortage.

That's correct..

>
>>>
>>> Lastly, let's see if MOM needs to manage the existing NUMA utilities 
>>> in place on the system.  I don't know much about AutoNUMA.  Does it 
>>> have tunables that should be adjusted or is it completely 
>>> autonomous?
>>
>> For the most part its automated (that's the whole point of being 
>> Automatic...although the technology will mature in phases :)) ...but 
>> if someone really really needs it to be disabled the can do so.
>>
>> There are certainly some NUMA related tunables in the kernel today 
>> (as shown below)....but at this point I am not very sure about the 
>> specific scenarios where one would really need to change these 
>> default settings.  (As we do more studies of various use cases on 
>> different platforms and workload sizes etc there may be a need...but 
>> at this point I don't see MOM necessarily getting involved in these 
>> settings. Does MOM change other kernel tunables today ? ).
>>
>>
>> # sysctl -a |grep numa kernel.numa_balancing = 1 
>> kernel.numa_balancing_scan_delay_ms = 1000 
>> kernel.numa_balancing_scan_period_max_ms = 60000 
>> kernel.numa_balancing_scan_period_min_ms = 1000 
>> kernel.numa_balancing_scan_size_mb = 256 
>> kernel.numa_balancing_settle_count = 4 vm.numa_zonelist_order = 
>> default
>
> These remind me of the KSM tunables.  Maybe some day we will be clever 
> enough to tune them but you're right, it should not be our first 
> priority.  One idea I have for MOM is that it could check up on 
> autonuma by checking /proc/<pid>/numa_maps for each qemu process on 
> the host and seeing if autonuma is keeping the process reasonably 
> balanced.  If not, we could actually raise an alarm so that 
> vdsm/engine would try and migrate a VM away from this host if 
> possible.  Once that is done, autonuma might be able to make better 
> progress.  This is really just a research level idea at the moment.

Ok. I agree that this can be deferred to a later phase (based on further
investigation)
>
>>> Does libvirt have any NUMA tuning APIs that MOM may want to call to 
>>> enhance performance in certain situations?
>>
>> I am no expert on libvirt's philosophy/goals etc. and have always 
>> viewed libvirt as providing APIs for provisioning/controlling the 
>> individual guests either on the local or in some cases remote 
>> hosts....but not changing the host wide parameters/tunables itself. I 
>> shall let libvirt experts comment if that is not the case...
>>
>> If we do identify valid use cases where NUMA related tunables need to 
>> be changed then MOM can use mechanisms similar to sysctl etc. to 
>> change them... but I am yet to envision such a scenario (beyond the 
>> rare use cases where oVirt upon user request may choose to entirely 
>> disable automatic NUMA balancing feature on a given host)
>>
>> Hope that makes some sense...  Thanks Vinod
>
> Fair enough.  You're right that it doesn't want to handle policy, but 
> in some cases it provides APIs that allow a management system to tune 
> things.  For example: CPU pinning, IO/Net throttling, CPU shares, 
> balloon.

Yes...   however the above examples are still falling in the category of 
managing guests
and not the host itself :)  But I get your point...

Thanks
Vinod


>
>>> One of the main questions I ask when trying to decide if MOM should 
>>> manage a particular setting is: "Is this something that is set once 
>>> and stays the same or is it something that must change dynamically 
>>> in accordance with current system conditions?"  In the former case, 
>>> it is probably best managed by engine or vdsm directly.  In the 
>>> latter case, it fits the MOM model.
>>>
>>> Hope this was helpful!  Please feel free to continue engaging this 
>>> list with any additional questions that you might have.
>>>
>>>> On engine side, there is only one button with this feature: Sync 
>>>> MoM Policy, right?
>>>>
>>>> On vdsm side, I saw the momIF is working for this, right?
>>>>
>>>> Best Regards, Jason Liao
>>>>
>>> -- Adam Litke
>>>
>>> [Jason] +Martin's part Hi,
>>>
>>>> In my understanding, MOM is the collector both from host and guest 
>>>> and set the right policy to KSM and memory ballooning get better 
>>>> performance.
>>> Correct. MoM controls the Guest memory allocations using KSM and 
>>> ballooning and allows overcommitment to work this way. I does not 
>>> really set the policy thought, it contains the policy and uses it to 
>>> dynamically update the memory space available for VMs.
>>>
>>>> I am not sure how it has relationship with NUMA, does anyone can 
>>>> explain it to me?
>>> In theory MoM might be able to play with ballooning on per node 
>>> basis.
>>>
>>> Without NUMA information it would free memory somewhere on the host, 
>>> but that memory might be too slow to access because it won't be 
>>> localized on nearby nodes.
>>>
>>> With NUMA information MoM will know which VMs can be ballooned so 
>>> the newly released memory segments are a bit more closer to each 
>>> other.
>>>
>>>> On engine side, there is only one button with this feature: Sync 
>>>> MoM Policy, right?
>>> There is also Balloon device checkbox in the Edit VM dialog and 
>>> Enable ballooning on the Edit Cluster dialog.
>>>
>>>> On vdsm side, I saw the momIF is working for this, right?
>>> Yes, momIF is responsible for the MoM specific communication and for 
>>> creating the policy file with parameters.
>>>
>>> MoM also uses standard VDSM APIs to get other information and you 
>>> can see that in MoM's source code in hypervisor_interfaces/vdsm 
>>> (that interface is then used by collectors).
>>>
>>> Regards
>>>
>>> -- Martin Sivak msi...@redhat.com
>>
>

_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Reply via email to