Hi,
I have a very large dictionary that must be shared across processes and does
not fit in RAM. I need access to this object to be fast. The key is an integer
ID and the value is a list containing two elements, both of them numpy arrays
(one has ints, the other has floats). The key is sequential, starts at 0, and
there are no gaps, so the “outer” layer of this data structure could really
just be a list with the key actually being the index. The lengths of each pair
of arrays may differ across keys.
For a visual:
{
key=0:
[
numpy.array([1,8,15,…, 16000]),
numpy.array([0.1,0.1,0.1,…,0.1])
],
key=1:
[
numpy.array([5,6]),
numpy.array([0.5,0.5])
],
…
}
I’ve tried:
- manager proxy objects, but the object was so big that low-level code
threw an exception due to format and monkey-patching wasn’t successful.
- Redis, which was far too slow due to setting up connections and data
conversion etc.
- Numpy rec arrays + memory mapping, but there is a restriction that the
numpy arrays in each “column” must be of fixed and same size.
- I looked at PyTables, which may be a solution, but seems to have a very
steep learning curve.
- I haven’t tried SQLite3, but I am worried about the time it takes to
query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks,
Ryan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion