On Sat, Mar 28, 2015 at 10:20:25AM -0400, John Taylor wrote: > Daniel Helgenberger <[email protected]> writes: > > > Hello Everyone, > > > > I did create the original BZ on this. In the mean time, lab system I > > used is dismantled and the production system is yet to deploy. > > > > As I wrote in BZ1147148 [1], I experienced two different issues. One, > > one big mem leak of about 15MiB/h and a smaller one, ~300KiB. These seem > > unrelated. > > > > The larger leak was indeed related to SSL in some way; not necessarily > > M2Crypto. However, after disabling SSL this was gone leaving the smaller > > leak. > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1147148 > > > I think there are, at least for the purpose of this discussion, 3 leaks: > 1. the M2Crypto leak > 2. a slower leak > 3. a large leak that's not M2Crypto related that's part of sampling > > My efforts have been around finding the source of my larger leak, which > I think is #3. I had disabled ssl so I knew that M2Crypto > isn't/shouldn't be the problem as in bz1147148, and ssl is beside the > point as it happens with a deactived host. It's part of sampling which > always runs. > > What I've found is, after trying to get the smallest reproducer, that > it's not the netlink.iter_links that I commented on in [1] that is the > problem. But in the _get_intefaces_and_samples loop is the call to > create an InterfaceSample and that has getLinkSpeed() which, for vlans, > ends up calling ipwrapper.getLink, and that to > netlink.get_link(name) > > netlink.get_link(name) *is* the source of my big leak. This is vdsm > 4.16.10, so it is [2] and it's been changed in master for the removal of > support for libnl v1 so it might not be a problem anymore. > > def get_link(name): > """Returns the information dictionary of the name specified link.""" > with _pool.socket() as sock: > with _nl_link_cache(sock) as cache: > link = _rtnl_link_get_by_name(cache, name) > if not link: > raise IOError(errno.ENODEV, '%s is not present in the system' > % > name) > return _link_info(cache, link) > > > The libnl documentation note at [3] says that for the rtnl_link_get_by_name > function > "Attention > The reference counter of the returned link object will be incremented. > Use rtnl_link_put() to release the reference." > > So I took that hint, and made a change that does the rtnl_link_put() in > get_link(name) and it looks like it works for me. > > diff oldnetlink.py netlink.py > 67d66 > < return _link_info(cache, link) > 68a68,70 > > li = _link_info(cache, link) > > _rtnl_link_put(link) > > return li > 333a336,337 > > > > _rtnl_link_put = _none_proto(('rtnl_link_put', LIBNL_ROUTE)) > > Hope that helps. And if someone else could confirm that would be great. > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1158108 > [2] > https://gerrit.ovirt.org/gitweb?p=vdsm.git;a=blob;f=lib/vdsm/netlink.py;h=afae5cecb5ce701d00fb8f019ec92b3331a39036;hb=5608cfdf43db9186dabac4b2a779f9557e798968 > [3] > http://www.infradead.org/~tgr/libnl/doc/api/group__link.html#ga1d583e4f0b43c89d854e5e681a529fad
Thanks, John, for a great detective work. I'm afraid that with even on the master branch we keep calling rtnl_link_get_link() and rtnl_link_get_by_name() without clearing the reference count, so a fix is due there, too. Would you consider posting a fully-fledged fix to gerrit? I still need to understand what is the use of that refcount, so that we do not release it too early. Regards, Dan. _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

