** Description changed: + [Impact] + + * If running a 32 bit kernel (rare these days but still existing for some + upgraders until we full drop it) then /proc/vmstat has 32bit values + + * These values can wrap at 32bit and the open-vm-tools will not "realize" + that as they assume only 64bit values. + + * That causes "just" a spike in the stats being reported, but due to the + fact that there are higher level e.g. VM placement algorithms at work + consuming those numbers this can trigger a mass migration off that + node which in turn can make everything worse. + + * Include the upstream fix to that problem to ensure people are not + affected by it. + + [Test Case] + + * This is a lot of effort to verify explicitly, but since the change is + small once the test is understood code review will in most cases be + enough. + - To trigger the error you'd need a VMWare Guest with 32 bit kernel + since i386 is no more mainstream the easiest way to get there is to + install from + http://releases.ubuntu.com/16.04/ubuntu-16.04.5-server-i386.iso + And then upgrade to Bionic. + - then the next thing you'd need to do is to check the stat values + to do so you can use the script attached to the bug [1] + Run it on the host via: + $ python query_vmgueststats.py --vmname <name of the vm> --host + localhost --user root --password <root password> + These numbers should never "go crazy" due to the wraparound. + - Once all this is set up you'd need to ramp up the numbers of e.g. + pgfaults to cause a wraparound - to do so essentially run a lot of + read I/O + This could be done with: + $ sudo mkdir /data1 + $ sudo fio /tmp/seq-read.fio + While the config is: + $ cat /tmp/seq-read.fio + ; Read 4 files with aio at different depths + [global] + ioengine=libaio + buffered=0 + rw=read + bs=128k + size=128m + directory=/data1 + iodepth=32 + direct=1 + time_based + runtime=60s + + [file1] + + [file2] + + [file3] + + [file4] + + Obviously 60 seconds is not enough, and it is recommended to tune the + path and disk backing to your needs to run as fast as possible. + + - At the same time run on the guest + $ cat /proc/vmstat | grep pgpgin + + - At some point the numbers of the latter will wrap, without the fix + this will make the vmware observed stats spike to huge values. + + + [Regression Potential] + + * Worst case the numbers we try to fix would get worse (due to the new + calculation being wrong). But that would only be "as bad as it is now". + Furthermore the code change is rather small. + Also 64bit wraparounds are not touched (I wonder why but lets stick to + the upstream code) but that means on 64bit systems (=most systems) this + is a no-op further reducing the risk for an regression. + + [Other Info] + + * taking the change was suggested by VMware who owns the tools as well as + most solutions consuming the stats, so we'd like to follow that + request. + + [1]: https://bugs.launchpad.net/ubuntu/+source/open-vm- + tools/+bug/1793219/+attachment/5193417/+files/query_vmgueststats.py + + + --- + Reported at Debian as well, see https://bugs.debian.org/cgi- bin/bugreport.cgi?bug=909146 : There is an unhandled overflow issue in open-vm-tools in the code for guest stats reporting. This cause artifacts (spikes) in rate stats, for example "Guest|Page In Rate per second". This issue only affects 32 bit builds of open-vm-tools. We have a fix for 10.3.x at https://github.com/vmware/open-vm-tools/commit/c7a186e204cdff46b5e02bcb5208ef8979eaf261 The fix has also been backported to 10.2.5 in a special branch: https://github.com/vmware/open-vm-tools/tree/stable-10.2.5-stat-overflow-fix Thanks, Oliver
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1793219 Title: open-vm-tools guest stats overflow To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/open-vm-tools/+bug/1793219/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
