Currently, when a thread loads a non-ghost into its object cache, its
straight from being unpickled. That means that if two threads load the
exact same object, any (immutable) string contained in the object
state will be allocated for in duplicate (or in general, on the count
of the active threads).
If instead, all unpickled strings were made canonical via a weak
dictionary, there would be only one copy in memory, no matter the
thread count, e.g.:
string = weak_string_map.setdefault(string, string)
If the returned string was a different (canonical) copy, the duplicate
would immediately be ready for garbage collection.
This is a real win in memory savings. Using Plone, I experimented with
the approach by using the Python pickle implementation and interning
all byte strings (using ``intern``) directly in the unpickle routine
to the same effect:
len = mloads('i' + self.read(4))
string = self.read(len)
interned = intern(string) # (sic)
With 20 active threads, each having rendered the Plone 4 front page,
this approach reduced the memory usage with 70 MB. Note that unicode
strings aren't internable (but the alternative technique of using a
weak mapping should work fine).
In a long-running operation, dirty objects should be invalidated after
the transaction, to prevent future data redundancy.
For an implementation one needs to have a hook to use a special
reconstructor function for strings. Currently there is a technical
impediment in that BTrees and Persistent objects have their own
internal way to save strings. In my experiments, the ``persistent_id``
function was not called for string objects (which is a different
behavior than the regular cPickle.Pickler.dump has).
For more information about ZODB, see the ZODB Wiki:
ZODB-Dev mailing list - ZODB-Dev@zope.org