Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Nathaniel Smith
I'd try storing the data in hdf5 (probably via h5py, which is a more basic interface without all the bells-and-whistles that pytables adds), though any method you use is going to be limited by the need to do a seek before each read. Storing the data on SSD will probably help a lot if you can

[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Ryan R. Rosario
Hi, I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Francesc Alted
Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily). As a demonstration I did a quick and dirty implementation for such a persistent

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Edison Gustavo Muenz
>From what I know this would be the use case that Dask seems to solve. I think this blog post can help: https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python Notice that I haven't used any of these projects myself. On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Travis Oliphant
On Thu, Jan 14, 2016 at 8:16 AM, Edison Gustavo Muenz < edisongust...@gmail.com> wrote: > From what I know this would be the use case that Dask seems to solve. > > I think this blog post can help: > https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python > > Notice that I

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Stephan Hoyer
On Thu, Jan 14, 2016 at 2:30 PM, Nathaniel Smith wrote: > The reason I didn't suggest dask is that I had the impression that > dask's model is better suited to bulk/streaming computations with > vectorized semantics ("do the same thing to lots of data" kinds of > problems,

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Nathaniel Smith
On Thu, Jan 14, 2016 at 2:13 PM, Stephan Hoyer wrote: > On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant > wrote: >> >> I don't know enough about xray to know whether it supports this kind of >> general labeling to be able to build your entire

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Feng Yu
Hi Ryan, Did you consider packing the arrays into one(two) giant array stored with mmap? That way you only need to store the start & end offsets, and there is no need to use a dictionary. It may allow you to simplify some numerical operations as well. To be more specific, start : numpy.intp

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Benjamin Root
A warning about HDF5. It is not a database format, so you have to be extremely careful if the data is getting updated while it is open for reading by anybody else. If it is strictly read-only, and no body else is updating it, then have at it! Cheers! Ben Root On Thu, Jan 14, 2016 at 9:16 AM,

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Stephan Hoyer
On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant wrote: > I don't know enough about xray to know whether it supports this kind of > general labeling to be able to build your entire data-structure as an x-ray > object. Dask could definitely be used to process your data in