On 03/08/2013 03:30 AM, Mark Wu wrote:
On 03/08/2013 06:11 AM, Dan Kenigsberg wrote:
On Thu, Mar 07, 2013 at 12:25:54PM +0100, Vinzenz Feenstra wrote:
Please find the prettier version on the wiki:

  Proposal VDSM - Engine Data Statistics Retrieval

    VDSM <=> Engine data retrieval optimization


Currently the RHEVM engine is polling the a lot of data from VDSM
every 15 seconds. This should be optimized and the amount of data
requested should be more specific.
It feels like a good idea, but do you have numbers? How much traffic
would be saved? Remember the added computation incurred on each host -
there's always a price to pay.
Well the data of a single really basic simple VM has about 4 KiB data in the output of vdsClient, the XMLRPC formatted body part has almost 16KiB. The thing is that this data is queried every 15 seconds (previously 10) with little value for having ALL data sent all the time, the engine is not even using all of the data all the time. This optimization must be seen on a bigger scale, if you have a datacenter with let's say 1000 VMs then the data needed to be transmitted and parsed by the engine every 15 seconds is about 16MiB. This optimization wouldn't pay off that much in a 2 server 20 VM datacenter however on a larger scale it has quite a big impact.

For each VM the data currently contains much more information than
actually needed which blows up the size of the XML content quite
big. We could optimize this by splitting the reply on the getVmStats
based on the request of the engine into sections. For this reason
Omer Frenkel and me have split up the data into parts based on their

This data can and usually does change during the lifetime of the VM.

        Rarely Changed:

This data is change not very frequent and it should be enough to
update this only once in a while. Most commonly this data changes
after changes made in the UI or after a migration of the VM to
another Host.

    *Status*  = Running
Status does not change much, but when it does, it is important to report
that quickly.
This is done by the list command which is executed every 2 seconds (maybe 3?)
For this kind of data, it is suitable to use an event report, which should be available in the jsonrpc API.

    *acpiEnable*  = true
    *vmType*  = kvm
    *guestName*  = W864GUESTAGENTT
    *displayType*  = qxl
    *guestOs*  = Win 8
    *kvmEnable*  = true #/*this should be constant and never changed*/
    *pauseCode*  = NOERR
    *monitorResponse*  = 0
    *session*  = Locked # unused
*netIfaces* = [{'name': 'Realtek RTL8139C+ Fast Ethernet NIC', 'inet6': ['fe80::490c:92bb:bbcc:9f87'], 'inet': [''], 'hw': '00:1a:4a:22:3c:db'}] *appsList* = ['RHEV-Tools 3.2.4', 'RHEV-Agent64 3.2.3', 'RHEV-Serial64 3.2.3', 'RHEV-Network64 3.2.2', 'RHEV-Network64 3.2.3', 'RHEV-Block64 3.2.3', 'RHEV-Balloon64 3.2.3', 'RHEV-Balloon64 3.2.2', 'RHEV-Agent64 3.2.2', 'RHEV-USB 3.2.3', 'RHEV-Block64 3.2.2', 'RHEV-Serial64 3.2.2']
    *pid*  = 11314
    *guestIPs*  = # duplicated info
    *displayIp*  = 0
    *displayPort*  = 5902
    *displaySecurePort*  = 5903
    *username*  = user@W864GUESTAGENTT
    *clientIp*  =
    *lastLogin*  = 1361976900.67

        Often Changed:

This data is changed quite often however it is not necessary to
update this data every 15 seconds. As this is cumulative data and
reflects the current status, and it does not need to be snapshotted
every 15 seconds to retrieve statistics. The data can be retrieved
in much more generous time slices. (e.g. Every 5 minutes)

*network* = {'vnet1': {'macAddr': '00:1a:4a:22:3c:db', 'rxDropped': '0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'unknown', 'speed': '100', 'name': 'vnet1'}} *disksUsage* = [{'path': 'c:\\', 'total': '64055406592', 'fs': 'NTFS', 'used': '19223846912'}, {'path': 'd:\\', 'total': '3490912256', 'fs': 'UDF', 'used': '3490912256'}]
    *timeOffset*  = 14422
    *elapsedTime*  = 68591
    *hash*  = 2335461227228498964
    *statsAge*  = 0.09 # unused

        Often Changed but unused

This data does not seem to be used in the engine at all. It is *not*
even used in the data warehouse.

*memoryStats* = {'swap_out': '0', 'majflt': '0', 'mem_free': '1466884', 'swap_in': '0', 'pageflt': '0', 'mem_total': '2096736', 'mem_unused': '1466884'}
    *balloonInfo*  = {'balloon_max': 2097152, 'balloon_cur': 2097152}
*disks* = {'vda': {'readLatency': '0', 'apparentsize': '64424509440', 'writeLatency': '1754496', 'imageID': '28abb923-7b89-4638-84f8-1700f0b76482', 'flushLatency': '156549', 'readRate': '0.00', 'truesize': '18855059456', 'writeRate': '952.05'}, 'hdc': {'readLatency': '0', 'apparentsize': '0', 'writeLatency': '0', 'flushLatency': '0', 'readRate': '0.00', 'truesize': '0', 'writeRate': '0.00'}}
I am pretty sure that {read,write,flush}Latency is collected and
reported by Engine. `git grep writeLatency` reinforces my vague memory.
Ok, well we did just a quite quick query about the usage and we searched rather for the keys than for the individual entries. Good to know what we need to be a bit more specific about the individual entries to classify them more appropriate.

        Very frequent uppdates needed by webadmin portal:

This data is mostly needed for the webadmin portal and might be
required to be updated quite often. An exception here is the
statsAge field, which seems to be unused by the Engine. This data
could be requested every 15 seconds to keep things as they are now.

    *cpuSys*  = 2.32
    *cpuUser*  = 1.34
    *memUsage*  = 30

    Proposed Solution for VDSM & Engine:

We will introduce new optional parameters to getVmStats,
getAllVmStats and list to allow a finer grained specification of
data which should be included.

*Parameter:* *statsType*=/*<string>*/ (getVmStats, getAllVmStats
only) *Allowed values:*

  * full (default to keep backwards compatibility)
  * app-list (Just send the application list)
  * rare (include everything from rarely changed to very frequent)
  * often (include everything from often changed to very frequent)
  * frequent (only send the very frequently changed items)
I think that a nice way to think of this, is that Engine ask for a set
of keys it is interested about. Asking for getVmStats(keys=[displayType,
netIfaces]) would return only the requrested values of the VM.
I was thinking of that as well or a way to exclude things from the list.
+1. It could split the information according to different functions, not just change frequency.
I would say to go for either or, both wouldn't make much sense.
"rare", "often" and "frequent" are simply pre-defined sets of key names.

A side effect of this pov is that we can avoid the vague name

*Parameter:* *clientId*=*<string>* The client id is specified by the
client and should be unique however constantly used.

*Parameter:* *diff*=*<boolean>* In combination with the clientId
VDSM will send only differences to the previous request from the
named clientId. (if diff=true)
The semantics of "diff" is not completely defined: how about complex
structures like that of "network"? It is most likely to be reported
every time.
Well the idea was a per key evaluation, maybe in cases like network and disks per device/interface.

Since this requires a caching mechanism on vdsm side, Engine must expect
that the cache may be evicted in any moment, and that a full list is
Well the engine should always expect that.
Every data collector should be responsible to invalidate/update the cache.
It could reduce the time to calculate the diff.

      Additional Change:

Besides the introduction of the new parameters for list, getVmStats
and getAllVmStats it might make sense to include a hash for the
appList into the rarely changed section of the response which would
allow to identify changes and avoid having to sent the complete
appList every so often and only if the hash known to the client is

*Note:* The appList (Application List) reported by the guest agent
could be fully implemented on request only, as long as the guest
agent installed supports this. As there seems to be a request to
have the complete list of installed applications on all guests this
data could be quite extensive and a huge list. On the other hand
this data is only rarely visible and therefore it should not be
requested all the time and only on demand.

      Improvement of the Guest Agent:

As part of the proposed solution it is necessary to improve the
guest agent as well.
Improving the agent may be a good idea, but I do not see the necessity
in it.
The guest agent is doing 'expensive' queries (e.g. "application_list") way too often. And things like network interfaces, disk usage and installed applications won't usually change every n minutes.
Those queries could be much more reactive then proactive.
It's also important to improve the horrible multithreaded
vdsm/libvirt statistics acquisition, but just as unrelated to the core
of this feature.

For the full application list there should be
implemented a caching system which will be fully reactive and should
not poll the application list for example all the time. The guest
can create a prepared data file containing all data in the JSON
format (as used for the communication with VDSM via VIO) and just
have to read that file from disk and directly sends it to VDSM.
However it is quite possible that this list is to big and it might
have to be chunked into pieces. (Multiple messages, which would have
to be supported by VDSM then as well) The solution for this is to
make VDSM request this data and it will retrieve the data necessary
on request only.
vdsm-devel mailing list


Vinzenz Feenstra | Senior Software Engineer
RedHat Engineering Virtualization R & D
Phone: +420 532 294 625
IRC: vfeenstr or evilissimo

Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com

vdsm-devel mailing list

Reply via email to