[Yum-devel] Dispelling rpm callback myths

Panu Matilainen Tue, 22 Feb 2011 03:43:15 -0800

I promised to Seth on irc last Friday to explain the rpm transaction(python) callback - what's the deal with avoiding headers to save memoryetc, so here goes... This is long, so go grab a coffee first.

I'll start with little bit of history first. Please remember this is notthe "absolute truth of what really happened" but just my interpretationof things, based on bits and pieces of information from commit logs andother public archives.


Our story starts at the birth of Anaconda in 1999, in this early commit:
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=c6ca9181446e3ad83a46fe031a4688a92f9f0f98

+ts = rpm.TransactionSet(rootPath, db)
+
+for p in comps.selected():
+    ts.add(p.h, (p.h, p.h[1000000]))

...

+def cb(what, amount, total, key, data):
+    if (what == rpm.RPMCALLBACK_INST_OPEN_FILE):
+       (h, key) = key
+       data.setPackage(h[rpm.RPMTAG_NAME])

+ d = os.open("/mnt/redhat/test/6.0/i386/RedHat/RPMS/" + key,os.O_RDONLY)

+       return d

There we go, adding (header, pkg_path) tuples as the "key" argument towhat was ts.addInstall() back then (the h[1000000] thing was a customtag added by the genhdlist thing that wrote out the headerlist used byanaconda, containing the package path). This convention has since thenbeen copied to/carried on in nearly every single user of rpm-python(including anaconda, up2date and yum at least).

Ewt wrote this part of the original rpm bindings (anaconda commitf1da6a4807d44c670453978b53d4b6d18b406ec1), so one would assume he knewwhat he was doing... and in fact back then, it was a "clever trick" toactually /save/ memory in anaconda, due to how rpm worked at that time.Unfortunately he + others (I dunno exact details of who wrote what)botched up various other aspects of the python callback design prettybadly - more on that later.

Fast-forward a few years and rpm had internally started saving memory byscraping just the information it needs for the transaction calculations(dependency checks, ordering, file conflicts etc) from the header passedin the /first/ argument of ts.addInstall(), instead of keeping theentire header around. Which turned the "clever trick" of saving memoryinto a huge waste of memory. But the anaconda-habbit of using (h, path)tuples for "package keys" stuck around, maybe because nobody clearlyexplained what's suddenly so wrong with that. There's a remark about ascaling issue related to header use of rpm-python users here:https://lists.dulug.duke.edu/pipermail/rpm-python-list/2003-October/000012.html,but if (note if) that's the only explanation given to rpm-python users,no wonder it never was understood. I remember boggling at the "headersare deprecated" comments myself back then and completely missing the point.


So what follows is the long, long overdue explanation.

Part of the long-standing confusion has to do with such a silly thing asargument naming, again copied around from anaconda to several places. Ifyou look back at the early anaconda commit snippet above carefully,you'll see the callback arguments arenamed "what, amount, total, key, data" - no headers in there. This ishow it should be (except I'd replace "what" an "event", and "data" with"userdata"). At some point somebody replaced the "key" with "h" as inheader, because that's pretty much what they got there in the callback,so it makes sense to call it a header and not some obscure "key", right?

But the "key" argument to ts.addInstall() is the key (pun intended) tothis whole thing. The first and third arguments - "header" and "how",are for rpm's consumption. But the second argument, the "key", is for/yourself/. What you pass here as the key is the very same object thatyou get back in the callback in the "key" argument for the packages tobe installed/updated, so that you can open a file descriptor to apackage file and return it to rpm. A couple of trivial examples todemonstrate this (pass paths to local package(s) on the cli to installthem):

http://laiskiainen.org/rpm/examples/python/minirpm-1.py
http://laiskiainen.org/rpm/examples/python/minirpm-2.py

See - no headers in the callback, and it still works. The sole purposeof the "key" argument is that you can open and close a file, and nothingmore. Also there are no "keys" for erased elements at all - rpm doesn'tneed the help of callback to locate headers of installed packages.

Now, for a real-world callback, you'll want to be able to show thingslike name/nevra, size, summary etc of the package(s) being installed andremoved. And this is where we get to the rather horrible misdesign ofthe python callback: rpm obviously has more information available, butin the python bindings, apart from the amount/total counters the onlyinformation you get is what you passed in as the "key" tots.addInstall(). Since there are no keys for erased packages at all, rpm"helpfully" passes the name of the package as the key so you have atleast some clue of whats going on. Which just isn't enough, especiallyin the multilib era.

So how do you show more information then? These are convenientlyavailable in the header, so why not pass that along here? Well, in orderto return the object back to you, rpm needs to hold a reference to itsomeplace. So what happens behind the scenes of ts.addInstall() is quiteliterally:


class TransactionSet:
    def __init__(self, ...):
        self.keys = []
        ...

    def addInstall(self, header, key, how='u'):
        self.keys.append(key)
        ...

Rpm itself never looks at the keys beyond passing around a pointer tothem - the key is entirely the caller's business and rpm has no use forit (and could not use it even if it wanted to, for that matter). Whenyou pass a header as (part of) the key, it gets pushend on to that listand never freed until the end of the transaction. To get an idea of theeffect, try these two small examples which only differ in the key used:

http://laiskiainen.org/rpm/examples/python/memuse-1.py
http://laiskiainen.org/rpm/examples/python/memuse-2.py

On Fedora 14 DVD contents, I get this:
[pmatilai@localhost pyex]$ ./memuse-1.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 140820 kB
[pmatilai@localhost pyex]$ ./memuse-2.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 252752 kB

That's ~110MB worth of extra babbage that neither rpm or you have usefor, when all you want is to a show a few tidbits of information likename, size etc in the callback. It's not entirely unlike lugging yourentire personal library (of books, CD's, DVD's or such) around whenshopping in order to avoid buying duplicates when all you'd really needis a list of titles and authors. It doesn't make much of a differencewhen you have, say, half a dozen of them to carry around, but withhundreds and thousands...

The difference is even more dramatic in reality because rpm goes out ofits way (especially since >= 4.7.0 but to some extent in older versionstoo) to free up memory for the actual transaction run: all dependencyinformation and very nearly all file data is thrown out, keeping only acouple of integer arrays per package to remember the actions calculatedfor each file.

Here's a version of the same memory use example, with just the data thatan average callback might want for showing a bit of information to theuser (ie the "title and author" from the analogue above):

http://laiskiainen.org/rpm/examples/python/memuse-3.py

With the same F14 package set I get:
[pmatilai@localhost pyex]$ ./memuse-3.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 142624 kB

...which is not that much more than the minimum of what rpm itself needs(case 1), with the information what you want to show in the callbackadded - contrast with case 2) which is what yum is currently doing.Here's a third version of the minirpm example, with a custom object usedfor the callback key for providing a bit of data about the package tothe callback:

http://laiskiainen.org/rpm/examples/python/minirpm-3.py

The memory use would be similar to that of memuse-3 which uses dict's,the point is just to further demonstrate that the key can be any damnthing that is convenient /to you/.

For yum, the most convenient item to pass there would be a txmbr, as Isuggested here:http://lists.baseurl.org/pipermail/yum-devel/2011-February/007964.html.Besides convenient, passing txmbr or txmbr.po as the key, would use evenless memory than a partial copy of the header data into a dict, asthey'd only be references to data that's already in the memory. As it'syum who calls ts.addInstall(), it's yum who defines the callbackconvention for it's own API users so it all can't be changed "just likethat" while API compatibility is needed. In any case, even the opt-inpartial copy of the header is a HUGE step towards stopping the ancientwaste of memory.

I hope this helps understanding why I want to change the yum callbackconvention so badly :) And if something here is not clear to you, pleaseDO ASK. I want to get this straightened out for good, finally.


        - Panu -
_______________________________________________
Yum-devel mailing list
Yum-devel@lists.baseurl.org
http://lists.baseurl.org/mailman/listinfo/yum-devel

[Yum-devel] Dispelling rpm callback myths

Reply via email to