I promised to Seth on irc last Friday to explain the rpm transaction (python) callback - what's the deal with avoiding headers to save memory etc, so here goes... This is long, so go grab a coffee first.

I'll start with little bit of history first. Please remember this is not the "absolute truth of what really happened" but just my interpretation of things, based on bits and pieces of information from commit logs and other public archives.

Our story starts at the birth of Anaconda in 1999, in this early commit:
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=c6ca9181446e3ad83a46fe031a4688a92f9f0f98

+ts = rpm.TransactionSet(rootPath, db)
+
+for p in comps.selected():
+    ts.add(p.h, (p.h, p.h[1000000]))

...

+def cb(what, amount, total, key, data):
+    if (what == rpm.RPMCALLBACK_INST_OPEN_FILE):
+       (h, key) = key
+       data.setPackage(h[rpm.RPMTAG_NAME])
+ d = os.open("/mnt/redhat/test/6.0/i386/RedHat/RPMS/" + key, os.O_RDONLY)
+       return d

There we go, adding (header, pkg_path) tuples as the "key" argument to what was ts.addInstall() back then (the h[1000000] thing was a custom tag added by the genhdlist thing that wrote out the headerlist used by anaconda, containing the package path). This convention has since then been copied to/carried on in nearly every single user of rpm-python (including anaconda, up2date and yum at least).

Ewt wrote this part of the original rpm bindings (anaconda commit f1da6a4807d44c670453978b53d4b6d18b406ec1), so one would assume he knew what he was doing... and in fact back then, it was a "clever trick" to actually /save/ memory in anaconda, due to how rpm worked at that time. Unfortunately he + others (I dunno exact details of who wrote what) botched up various other aspects of the python callback design pretty badly - more on that later.

Fast-forward a few years and rpm had internally started saving memory by scraping just the information it needs for the transaction calculations (dependency checks, ordering, file conflicts etc) from the header passed in the /first/ argument of ts.addInstall(), instead of keeping the entire header around. Which turned the "clever trick" of saving memory into a huge waste of memory. But the anaconda-habbit of using (h, path) tuples for "package keys" stuck around, maybe because nobody clearly explained what's suddenly so wrong with that. There's a remark about a scaling issue related to header use of rpm-python users here: https://lists.dulug.duke.edu/pipermail/rpm-python-list/2003-October/000012.html, but if (note if) that's the only explanation given to rpm-python users, no wonder it never was understood. I remember boggling at the "headers are deprecated" comments myself back then and completely missing the point.

So what follows is the long, long overdue explanation.

Part of the long-standing confusion has to do with such a silly thing as argument naming, again copied around from anaconda to several places. If you look back at the early anaconda commit snippet above carefully, you'll see the callback arguments are named "what, amount, total, key, data" - no headers in there. This is how it should be (except I'd replace "what" an "event", and "data" with "userdata"). At some point somebody replaced the "key" with "h" as in header, because that's pretty much what they got there in the callback, so it makes sense to call it a header and not some obscure "key", right?

But the "key" argument to ts.addInstall() is the key (pun intended) to this whole thing. The first and third arguments - "header" and "how", are for rpm's consumption. But the second argument, the "key", is for /yourself/. What you pass here as the key is the very same object that you get back in the callback in the "key" argument for the packages to be installed/updated, so that you can open a file descriptor to a package file and return it to rpm. A couple of trivial examples to demonstrate this (pass paths to local package(s) on the cli to install them):
http://laiskiainen.org/rpm/examples/python/minirpm-1.py
http://laiskiainen.org/rpm/examples/python/minirpm-2.py

See - no headers in the callback, and it still works. The sole purpose of the "key" argument is that you can open and close a file, and nothing more. Also there are no "keys" for erased elements at all - rpm doesn't need the help of callback to locate headers of installed packages.

Now, for a real-world callback, you'll want to be able to show things like name/nevra, size, summary etc of the package(s) being installed and removed. And this is where we get to the rather horrible misdesign of the python callback: rpm obviously has more information available, but in the python bindings, apart from the amount/total counters the only information you get is what you passed in as the "key" to ts.addInstall(). Since there are no keys for erased packages at all, rpm "helpfully" passes the name of the package as the key so you have at least some clue of whats going on. Which just isn't enough, especially in the multilib era.

So how do you show more information then? These are conveniently available in the header, so why not pass that along here? Well, in order to return the object back to you, rpm needs to hold a reference to it someplace. So what happens behind the scenes of ts.addInstall() is quite literally:

class TransactionSet:
    def __init__(self, ...):
        self.keys = []
        ...

    def addInstall(self, header, key, how='u'):
        self.keys.append(key)
        ...

Rpm itself never looks at the keys beyond passing around a pointer to them - the key is entirely the caller's business and rpm has no use for it (and could not use it even if it wanted to, for that matter). When you pass a header as (part of) the key, it gets pushend on to that list and never freed until the end of the transaction. To get an idea of the effect, try these two small examples which only differ in the key used:
http://laiskiainen.org/rpm/examples/python/memuse-1.py
http://laiskiainen.org/rpm/examples/python/memuse-2.py

On Fedora 14 DVD contents, I get this:
[pmatilai@localhost pyex]$ ./memuse-1.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 140820 kB
[pmatilai@localhost pyex]$ ./memuse-2.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 252752 kB

That's ~110MB worth of extra babbage that neither rpm or you have use for, when all you want is to a show a few tidbits of information like name, size etc in the callback. It's not entirely unlike lugging your entire personal library (of books, CD's, DVD's or such) around when shopping in order to avoid buying duplicates when all you'd really need is a list of titles and authors. It doesn't make much of a difference when you have, say, half a dozen of them to carry around, but with hundreds and thousands...

The difference is even more dramatic in reality because rpm goes out of its way (especially since >= 4.7.0 but to some extent in older versions too) to free up memory for the actual transaction run: all dependency information and very nearly all file data is thrown out, keeping only a couple of integer arrays per package to remember the actions calculated for each file.

Here's a version of the same memory use example, with just the data that an average callback might want for showing a bit of information to the user (ie the "title and author" from the analogue above):
http://laiskiainen.org/rpm/examples/python/memuse-3.py

With the same F14 package set I get:
[pmatilai@localhost pyex]$ ./memuse-3.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 142624 kB

...which is not that much more than the minimum of what rpm itself needs (case 1), with the information what you want to show in the callback added - contrast with case 2) which is what yum is currently doing. Here's a third version of the minirpm example, with a custom object used for the callback key for providing a bit of data about the package to the callback:
http://laiskiainen.org/rpm/examples/python/minirpm-3.py

The memory use would be similar to that of memuse-3 which uses dict's, the point is just to further demonstrate that the key can be any damn thing that is convenient /to you/.

For yum, the most convenient item to pass there would be a txmbr, as I suggested here: http://lists.baseurl.org/pipermail/yum-devel/2011-February/007964.html. Besides convenient, passing txmbr or txmbr.po as the key, would use even less memory than a partial copy of the header data into a dict, as they'd only be references to data that's already in the memory. As it's yum who calls ts.addInstall(), it's yum who defines the callback convention for it's own API users so it all can't be changed "just like that" while API compatibility is needed. In any case, even the opt-in partial copy of the header is a HUGE step towards stopping the ancient waste of memory.

I hope this helps understanding why I want to change the yum callback convention so badly :) And if something here is not clear to you, please DO ASK. I want to get this straightened out for good, finally.

        - Panu -
_______________________________________________
Yum-devel mailing list
Yum-devel@lists.baseurl.org
http://lists.baseurl.org/mailman/listinfo/yum-devel

Reply via email to