We have a large dataset of 650,000+ records that I'd like to examine easily in Python. I have figured out how to put this into a ZODB file that totals 4 GB in size. But I'm new to ZODB and very large databases, and have a few questions.

1. The data is in a IOBTree so I can access each item once I know the key, but to get the list of keys I tried:

scores = root['scores']
ids = [id for id in scores.iterkeys()]

This seems to require the entire tree to be loaded into memory which takes more RAM than I have.

If I instead avoid the list comprehension and use an actual loop, I can explicitly call cacheMinimize every n records, and keep the memory reasonable.

So, how and when does the cache normally get minimized? Should I just avoid list comprehensions and explicitly clean the cache the way I'm doing, or is there any tricks to minimize the RAM usage.

2. Obviously I should save my list of keys in the database. I'd also like to have other indexes. It appears the usual technique is to use ZCatalog <http://www.blazingthings.com/dev/zcatalog.html>. Am I correct? Is there any good documentation on how to use that with ZODB? (All the examples I can find either were on using the catalog from within Zope, to using the catalog in a purely standalone manner.) Are there any concerns I should be aware of for using it with large datasets?

3. Are there any guides to how to tune my ZODB usage? I had to dig around a while for to realize I should be using BTrees and the cacheMinimize method. Are there any other knobs I should know?

So far, I've simply read the data from an XML file and converted it. I've set the cache size to 1000, and every 10000 entries, I commit the transaction, and minimize the caches. The conversion takes about 60 hours to run and uses roughly half my memory, which is acceptable, but if I can tune it to be faster at the cost of slightly more memory, I'd be happier. (The performance is roughly O(N^2), although halfway through it's closer to O(N^2.7).)

Thanks in advance.

For more information about ZODB, see the ZODB Wiki:

ZODB-Dev mailing list  -  ZODB-Dev@zope.org

Reply via email to