Public bug reported: When using the libvirt CPU monitor (i.e., virt_driver) for metrics collection, I sporadically noticed cases where the values for cpu.user.percent + cpu.kernel.percent + cpu.idle.percent didn't equal 100, which should be the case. This wasn't happening very often so it was quite difficult to track down, but after adding several debug logs, over time, I was able to track down the problem.
If you look at this code: https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L52 ... you'll notice that there is an inherent assumption that for a given "round" of metrics gathering, there is a built-in assumption that the collective time to call metric_driver.get_metric(metric_name) (keep in mind there are 10 metrics right now) won't exceed 1 second (if it does, it's considered the "next" round of metric collection) ... i.e., so for the first metric collection, we refresh the host CPU stats and the subsequent n-1 calls simply use the cache ... this yielding a coherent answer (i.e., the percentages would all sum up to 100% as you'd expect). However, in some cases (e.g., if the system is undergoing stress, etc.), I've seen cases where this code: https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L60 ... takes more than 1 second to execute, which then causes [within the "same" metrics round] the data to be refreshed, thus yielding potentially incoherent results (e.g., summation of percentages < 100 or > 100 -- makes for some interesting data points). :-) The fix is simple... let's just move the timestamp cache *after* the host stats have been collected... problem solved. P.S. This problem is occurring on Liberty (and I suspect it would happen on older releases too). ** Affects: nova Importance: Undecided Assignee: Joe Cropper (jwcroppe) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1490837 Title: Sporadic incoherent metrics when driver.get_host_cpu_stats takes longer than 1 second to execute Status in OpenStack Compute (nova): In Progress Bug description: When using the libvirt CPU monitor (i.e., virt_driver) for metrics collection, I sporadically noticed cases where the values for cpu.user.percent + cpu.kernel.percent + cpu.idle.percent didn't equal 100, which should be the case. This wasn't happening very often so it was quite difficult to track down, but after adding several debug logs, over time, I was able to track down the problem. If you look at this code: https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L52 ... you'll notice that there is an inherent assumption that for a given "round" of metrics gathering, there is a built-in assumption that the collective time to call metric_driver.get_metric(metric_name) (keep in mind there are 10 metrics right now) won't exceed 1 second (if it does, it's considered the "next" round of metric collection) ... i.e., so for the first metric collection, we refresh the host CPU stats and the subsequent n-1 calls simply use the cache ... this yielding a coherent answer (i.e., the percentages would all sum up to 100% as you'd expect). However, in some cases (e.g., if the system is undergoing stress, etc.), I've seen cases where this code: https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L60 ... takes more than 1 second to execute, which then causes [within the "same" metrics round] the data to be refreshed, thus yielding potentially incoherent results (e.g., summation of percentages < 100 or > 100 -- makes for some interesting data points). :-) The fix is simple... let's just move the timestamp cache *after* the host stats have been collected... problem solved. P.S. This problem is occurring on Liberty (and I suspect it would happen on older releases too). To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1490837/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

