On Thu, Jan 2, 2014 at 4:15 AM, Keith Winston <keithw...@gmail.com> wrote: > Thanks for all this Eryksun (and Mark!), but... I don't understand why you > brought gdbm in? Is it something underlying shelve, or a better approach, or > something else? That last part really puts me in a pickle, and I don't > understand why.
A Shelf is backed by a container with the following mapping methods: keys __contains__ __getitem__ __setitem__ __delitem__ __len__ Shelf will also try to call `close` and `sync` on the container if available. For some reason no one has made Shelf into a context manager (i.e. __enter__ and __exit__), so remember to close() it. For demonstration purposes, you can use a dict with Shelf: >>> sh = shelve.Shelf(dict={}) >>> sh['alist'] = [1,2,3] The mapping is referenced in the (badly named) `dict` attribute: >>> sh.dict {b'alist': b'\x80\x03]q\x00(K\x01K\x02K\x03e.'} Keys are encoded as bytes (UTF-8 default) and the value is serialized using pickle. This is to support using a database from the dbm module. shelve.open returns an instance of shelve.DbfilenameShelf, which is a subclass of Shelf specialized to open a dbm database. Here's an overview of Unix dbm databases that Google turned up: http://www.unixpapa.com/incnote/dbm.html Note the size restrictions for keys and values in ndbm, which gdbm doesn't have. Using gdbm lifts the restriction on the size of pickled objects (the docs vaguely suggest to keep them "fairly small"). Unfortunately gdbm isn't always available. On my system, dbm defaults to creating a _gdbm.gdbm database, where _gdbm is an extension module that wraps the GNU gdbm library (e.g. libgdbm.so.3). You can use a different database with Shelf (or a subclass), so long as it has the required methods. For example, shelve.BsdDbShelf is available for use with pybsddb (Debian package "python3-bsddb3"). It exposes the bsddb3 database methods `first`, `next`, `previous`, `last` and `set_location`. > Separately, I'm also curious about how to process big files. > ... > I'm also beginning to think about how to speed it up: I defer to Steven's sage advice. Look into using databases such as sqlite3, numpy, and also add the multiprocessing and concurrent.futures modules to your todo list. Even if you know C/C++, I suggest you use Cython to create CPython extension modules. http://www.cython.org _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor