I started a new packing script for Relstorage (history free,
postgresql). It is based on incoming reference counting. Why do we need
this? First a bit story telling (fast forward down to section "new
script" if you're bored)
So we had a database with ~298.000.000 objects stored, grown over 4-5
months. The beast was never packed. We import/sync data on a weekly
base: A complex Plone based product catalog optimized for web in 27
languages each ~18.000 products, each with several variants (stored in
annotated btrees). Lots of stuff is deleted and recreated on import. So
we had a lot of garbage in DB.
We're using history free Relstorage. Relstorage performs much better on
concurrent writes and speeds up our import a lot. First on MySQL, now on
PostgreSQL. Even if PostgreSQL does not solve all problems it seems to
be the stable choice. At least for one time we had a frozen server. We
blame MySQL, but well, its a CentOS and there are several other problems
injecting side-effects (old unpatched kernel, etc, pp, ... we're dev not
op, its enterprise level and decisions made in past are our constraints,
carved in stone, more to say?)
Anyway: Our Plone with ZODB Relstorage performs very very well - with
this many objects stored including all the garbage!
BUT...we want to get rid of the garbage left on every import.
Relstorage deploys with a packaging script. Running it on the ~300
million objects resulted in RAM-consumption >20GB with 300 days
estimated time of completion (we did not try if this is true). Phew.
So something had to be done. Storage usage grew about 15-20GB per week.
As a quick (took some days) fix we decided to try classical Data.fs
We blocked content managers from editing and stopped imports and cloned
a snapshot of the virtual maschine with the DB.
At this point we had roughly 160GB in Postgres.
On the clone we converted the Relstorage to a Filesystem Storage. This
took 40 hours.
Storage shrinked to a 55GB in size Data.fs
Then we started the classical packing and after some hours the DB was
shrunken to 7.8 GB Data.fs still containing 44 million objects.
Next we converted the Data.fs back to Relstorage (~6 hours) and switched
live over to the packed DB.
Packing on the 44 million object relstorage with the default script
shall take several days (3-4 days) only for the prepacking phase. It
consumes less memory (~5GB) but is still insane.
After analyzing the current (generic - mysql, pgsql, oracle - history
and history free) way to pack the DB I found no good way to optimize the
code here. Other were there before me.
So I tried to rethink the way packing is done and tried to follow the
copy-and-switch method the filestorage uses: Copy all in-use zoids from
object_state to a new table (traverse the graph starting at zoid 0),
drop the old object_state, rename the new table to object_state. The
copy process can be distributed. Several workers can do it in parallel
until the database is at it max transactions/second.
I had this working. 4-6 workers running, a queue with zoids to process
in DB, a master controlling the process and so on. But even with this
approach the total time for a pack of the whole DB is after SQL
optimizations(!) roughly 8 days for a 44 million object graph.
OK - Back to the drawing board.
On thursday evening I had a good conservation (and some beer) with a
buddy from the local linux user group and we discussed the problem. The
outcome was to try it with reference counting.
The idea is simple:
- iterate over all transactions starting with the lowest
transaction id (tid)
- for each transaction load the object states connected with tid
- for each state fetch its outgoing references and fill a table where
all incoming references of an object are stored as an array.
if an state has no references write it anyway to the table with empty
After reaching the highest tid its easy: all entries with no incoming
references are garbage. Delete them and remove all incoming references
of this entry from other entries.
When new transactions are happening its enough to only process
delta-transactions (newer than the last processed in the reference
counting table). This is pretty fast if the delta small, i.e. one day
only. On the deltas another check need to detect references gone
meanwhile, but since the delta isnt that big this is probably not a problem.
I implemented this roughly now. Building the incoming reference counting
table from the scratch takes less than 7 hours. Deleting itself is not
implemented yet, but this is not that difficult :-)
The current code is at:
Did I miss something? Any opinions much appreciated!
Expect updates in this thread :)
Klein & Partner KG, member of BlueDynamics Alliance
For more information about ZODB, see http://zodb.org/
ZODB-Dev mailing list - ZODB-Dev@zope.org