Re: [Numpy-discussion] inconsistency in np.isclose

2016-01-14 Thread Nathaniel Smith
Yeah, that does look like a bug.

On Thu, Jan 14, 2016 at 4:48 PM, Andrew Nelson  wrote:
> Hi all,
> I think there is an inconsistency with np.isclose when I compare two
> numbers:
>
 np.isclose(0, np.inf)
> array([False], dtype=bool)
>
 np.isclose(0, 1)
> False
>
> The first comparison returns a bool array, the second returns a bool.
> Shouldn't they both return the same result?
>
> --
> _
> Dr. Andrew Nelson
>
>
> _
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] inconsistency in np.isclose

2016-01-14 Thread Andrew Nelson
Hi all,
I think there is an inconsistency with np.isclose when I compare two
numbers:

>>> np.isclose(0, np.inf)
array([False], dtype=bool)

>>> np.isclose(0, 1)
False

The first comparison returns a bool array, the second returns a bool.
Shouldn't they both return the same result?

-- 
_
Dr. Andrew Nelson


_
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Stephan Hoyer
On Thu, Jan 14, 2016 at 2:30 PM, Nathaniel Smith  wrote:

> The reason I didn't suggest dask is that I had the impression that
> dask's model is better suited to bulk/streaming computations with
> vectorized semantics ("do the same thing to lots of data" kinds of
> problems, basically), whereas it sounded like the OP's algorithm
> needed lots of one-off unpredictable random access.
>
> Obviously even if this is true then it's useful to point out both
> because the OP's problem might turn out to be a better fit for dask's
> model than they indicated -- the post is somewhat vague :-).
>
> But, I just wanted to check, is the above a good characterization of
> dask's strengths/applicability?
>

Yes, dask is definitely designed around setting up a large streaming
computation and then executing it all at once.

But it is pretty flexible in terms of what those specific computations are,
and can also work for non-vectorized computation (especially via dask
imperative). It's worth taking a look at dask's collections for a sense of
what it can do here. The recently refreshed docs provide a nice overview:
http://dask.pydata.org/

Cheers,
Stephan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Nathaniel Smith
On Thu, Jan 14, 2016 at 2:13 PM, Stephan Hoyer  wrote:
> On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant 
> wrote:
>>
>> I don't know enough about xray to know whether it supports this kind of
>> general labeling to be able to build your entire data-structure as an x-ray
>> object.   Dask could definitely be used to process your data in an easy to
>> describe manner (creating a dask.bag of dask.arrays would work though I'm
>> not sure there are any methods that would buy you from just having a
>> standard dictionary of dask.arrays).   You can definitely use dask
>> imperative to parallelize your data-manipulation algorithms.
>
>
> Indeed, xray's data model is not flexible enough to represent this sort of
> data -- it's designed around cases where multiple arrays use shared axes.
>
> However, I would indeed recommend dask.array (coupled with some sort of
> on-disk storage) as a possible solution for this problem, if you need to be
> able manipulate these arrays with an API that looks like NumPy. That said,
> the fact that your data consists of ragged arrays suggests that the
> dask.array API may be less useful for you.
>
> Tools like dask.imperative, coupled with HDF5 for storage, could still be
> very useful, though.

The reason I didn't suggest dask is that I had the impression that
dask's model is better suited to bulk/streaming computations with
vectorized semantics ("do the same thing to lots of data" kinds of
problems, basically), whereas it sounded like the OP's algorithm
needed lots of one-off unpredictable random access.

Obviously even if this is true then it's useful to point out both
because the OP's problem might turn out to be a better fit for dask's
model than they indicated -- the post is somewhat vague :-).

But, I just wanted to check, is the above a good characterization of
dask's strengths/applicability?

-n

-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Stephan Hoyer
On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant 
wrote:

> I don't know enough about xray to know whether it supports this kind of
> general labeling to be able to build your entire data-structure as an x-ray
> object.   Dask could definitely be used to process your data in an easy to
> describe manner (creating a dask.bag of dask.arrays would work though I'm
> not sure there are any methods that would buy you from just having a
> standard dictionary of dask.arrays).   You can definitely use dask
> imperative to parallelize your data-manipulation algorithms.
>

Indeed, xray's data model is not flexible enough to represent this sort of
data -- it's designed around cases where multiple arrays use shared axes.

However, I would indeed recommend dask.array (coupled with some sort of
on-disk storage) as a possible solution for this problem, if you need to be
able manipulate these arrays with an API that looks like NumPy. That said,
the fact that your data consists of ragged arrays suggests that the
dask.array API may be less useful for you.

Tools like dask.imperative, coupled with HDF5 for storage, could still be
very useful, though.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Should I use pip install numpy in linux?

2016-01-14 Thread Matthew Brett
On Thu, Jan 14, 2016 at 9:14 AM, Chris Barker - NOAA Federal
 wrote:
>>> Also, you have the problem that there is one PyPi -- so where do you put
>>> your nifty wheels that depend on other binary wheels? you may need to fork
>>> every package you want to build :-(
>>
>> Is this a real problem or a theoretical one? Do you know of some
>> situation where this wheel to wheel dependency will occur that won't
>> just be solved in some other way?
>
> It's real -- at least during the whole bootstrapping period. Say I
> build a nifty hdf5 binary wheel -- I could probably just grab the name
> "libhdf5" on PyPI. So far so good. But the goal here would be to have
> netcdf and pytables and GDAL and who knows what else then link against
> that wheel. But those projects are all supported be different people,
> that all have their own distribution strategy. So where do I put
> binary wheels of each of those projects that depend on my libhdf5
> wheel? _maybe_ I would put it out there, and it would all grow
> organically, but neither the culture nor the tooling support that
> approach now, so I'm not very confident you could gather adoption.

I don't think there's a very large amount of cultural work - but some
to be sure.

We already have the following on OSX:

pip install numpy scipy matplotlib scikit-learn scikit-image pandas h5py

where all the wheels come from pypi.  So, I don't think this is really
outside our range, even if the problem is a little more difficult for
Linux.

> Even beyond the adoption period, sometimes you need to do stuff in
> more than one way -- look at the proliferation of channels on
> Anaconda.org.
>
> This is more likely to work if there is a good infrastructure for
> third parties to build and distribute the binaries -- e.g.
> Anaconda.org.

I thought that Anaconda.org allows pypi channels as well?

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Feng Yu
Hi Ryan,

Did you consider packing the arrays into one(two) giant array stored with mmap?

That way you only need to store the start & end offsets, and there is
no need to use a dictionary.
It may allow you to simplify some numerical operations as well.

To be more specific,

start : numpy.intp
end : numpy.intp

data1 : numpy.int32
data2 : numpy.float64

Then your original access to the dictionary can be rewritten as

data1[start[key]:end[key]
data2[start[key]:end[key]

Whether to wrap this as a dictionary-like object is just a matter of
taste -- depending you like it raw or fine.

If you need to apply some global transformation to the data, then
something like data2[...] *= 10 would work.

ufunc.reduceat(data1, ) can be very useful as well. (with some
tricks on start /end)

I was facing a similar issue a few years ago, and you may want to look
at this code (It wasn't very well written I had to admit)

https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L362

Best,

- Yu

On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario  wrote:
> Hi,
>
> I have a very large dictionary that must be shared across processes and does 
> not fit in RAM. I need access to this object to be fast. The key is an 
> integer ID and the value is a list containing two elements, both of them 
> numpy arrays (one has ints, the other has floats). The key is sequential, 
> starts at 0, and there are no gaps, so the “outer” layer of this data 
> structure could really just be a list with the key actually being the index. 
> The lengths of each pair of arrays may differ across keys.
>
> For a visual:
>
> {
> key=0:
> [
> numpy.array([1,8,15,…, 16000]),
> numpy.array([0.1,0.1,0.1,…,0.1])
> ],
> key=1:
> [
> numpy.array([5,6]),
> numpy.array([0.5,0.5])
> ],
> …
> }
>
> I’ve tried:
> -   manager proxy objects, but the object was so big that low-level code 
> threw an exception due to format and monkey-patching wasn’t successful.
> -   Redis, which was far too slow due to setting up connections and data 
> conversion etc.
> -   Numpy rec arrays + memory mapping, but there is a restriction that 
> the numpy arrays in each “column” must be of fixed and same size.
> -   I looked at PyTables, which may be a solution, but seems to have a 
> very steep learning curve.
> -   I haven’t tried SQLite3, but I am worried about the time it takes to 
> query the DB for a sequential ID, and then translate byte arrays.
>
> Any ideas? I greatly appreciate any guidance you can provide.
>
> Thanks,
> Ryan
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Should I use pip install numpy in linux?

2016-01-14 Thread Chris Barker - NOAA Federal
>> Also, you have the problem that there is one PyPi -- so where do you put
>> your nifty wheels that depend on other binary wheels? you may need to fork
>> every package you want to build :-(
>
> Is this a real problem or a theoretical one? Do you know of some
> situation where this wheel to wheel dependency will occur that won't
> just be solved in some other way?

It's real -- at least during the whole bootstrapping period. Say I
build a nifty hdf5 binary wheel -- I could probably just grab the name
"libhdf5" on PyPI. So far so good. But the goal here would be to have
netcdf and pytables and GDAL and who knows what else then link against
that wheel. But those projects are all supported be different people,
that all have their own distribution strategy. So where do I put
binary wheels of each of those projects that depend on my libhdf5
wheel? _maybe_ I would put it out there, and it would all grow
organically, but neither the culture nor the tooling support that
approach now, so I'm not very confident you could gather adoption.

Even beyond the adoption period, sometimes you need to do stuff in
more than one way -- look at the proliferation of channels on
Anaconda.org.

This is more likely to work if there is a good infrastructure for
third parties to build and distribute the binaries -- e.g.
Anaconda.org.

Or the Linux dist to model -- for the most part, the people developing
a given library are not packaging it.

-CHB
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Travis Oliphant
On Thu, Jan 14, 2016 at 8:16 AM, Edison Gustavo Muenz <
edisongust...@gmail.com> wrote:

> From what I know this would be the use case that Dask seems to solve.
>
> I think this blog post can help:
> https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
>
> Notice that I haven't used any of these projects myself.
>

I don't know enough about xray to know whether it supports this kind of
general labeling to be able to build your entire data-structure as an x-ray
object.   Dask could definitely be used to process your data in an easy to
describe manner (creating a dask.bag of dask.arrays would work though I'm
not sure there are any methods that would buy you from just having a
standard dictionary of dask.arrays).   You can definitely use dask
imperative to parallelize your data-manipulation algorithms.

But, dask doesn't take a strong opinion as to how you store your data ---
it can use anything python can read.I believe your question was "how do
I store this?"

If you think of the file-system as a simple key-value store, then you could
easily construct this kind of scenario on disk with directory names for
your keys and two files in each directory for your arrays.  Then, you
could mmap the individual arrays directly for processing.Those
individual arrays could be stored as bcolz, npy files, or anything else.

Will your multiple processes need to write to these files or will they be
read-only?

-Travis




>
> On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted  wrote:
>
>> Well, maybe something like a simple class emulating a dictionary that
>> stores a key-value on disk would be more than enough.  Then you can use
>> whatever persistence layer that you want (even HDF5, but not necessarily).
>>
>> As a demonstration I did a quick and dirty implementation for such a
>> persistent key-store thing (
>> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
>> KeyStore class (less than 40 lines long) is responsible for storing the
>> value (2 arrays) into a key (a directory).  As I am quite a big fan of
>> compression, I implemented a couple of serialization flavors: one using the
>> .npz format (so no other dependencies than NumPy are needed) and the other
>> using the ctable object from the bcolz package (bcolz.blosc.org).  Here
>> are some performance numbers:
>>
>> python key-store.py -f numpy -d __test -l 0
>> ## Checking method: numpy (via .npz files) 
>> Building database.  Wait please...
>> Time (creation) --> 1.906
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.191
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>>
>> 75M __test
>>
>> So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
>> can compress data as well, so let's see how it goes:
>>
>> $ python key-store.py -f numpy -d __test -l 9
>> ## Checking method: numpy (via .npz files) 
>> Building database.  Wait please...
>> Time (creation) --> 6.636
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.384
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 28M __test
>>
>> Ok, in this case we have got almost a 3x compression ratio, which is not
>> bad.  However, the performance has degraded a lot.  Let's use now bcolz.
>> First in non-compressed mode:
>>
>> $ python key-store.py -f bcolz -d __test -l 0
>> ## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> 
>> Building database.  Wait please...
>> Time (creation) --> 0.479
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.103
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 82M __test
>>
>> Without compression, bcolz takes a bit more (~10%) space than NPZ.
>> However, bcolz is actually meant to be used with compression on by default:
>>
>> $ python key-store.py -f bcolz -d __test -l 9
>> ## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
>> 
>> Building database.  Wait please...
>> Time (creation) --> 0.487
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.98
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 29M __test
>>
>> So, the final disk usage is quite similar to NPZ, but it can store and
>> retrieve lots faster.  Also, the data decompression speed is on par to
>> using non-compression.  This is because bcolz uses Blosc behind the scenes,
>> which is much faster than zlib (used by NPZ) --and sometimes faster than a
>> memcpy().  However, even we are doing I/O against the disk, this dataset is
>> so small that fits in the OS filesystem cache, so the benchmark is actually
>> checking I/O at me

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Benjamin Root
A warning about HDF5. It is not a database format, so you have to be
extremely careful if the data is getting updated while it is open for
reading by anybody else. If it is strictly read-only, and no body else is
updating it, then have at it!

Cheers!
Ben Root

On Thu, Jan 14, 2016 at 9:16 AM, Edison Gustavo Muenz <
edisongust...@gmail.com> wrote:

> From what I know this would be the use case that Dask seems to solve.
>
> I think this blog post can help:
> https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
>
> Notice that I haven't used any of these projects myself.
>
> On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted  wrote:
>
>> Well, maybe something like a simple class emulating a dictionary that
>> stores a key-value on disk would be more than enough.  Then you can use
>> whatever persistence layer that you want (even HDF5, but not necessarily).
>>
>> As a demonstration I did a quick and dirty implementation for such a
>> persistent key-store thing (
>> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
>> KeyStore class (less than 40 lines long) is responsible for storing the
>> value (2 arrays) into a key (a directory).  As I am quite a big fan of
>> compression, I implemented a couple of serialization flavors: one using the
>> .npz format (so no other dependencies than NumPy are needed) and the other
>> using the ctable object from the bcolz package (bcolz.blosc.org).  Here
>> are some performance numbers:
>>
>> python key-store.py -f numpy -d __test -l 0
>> ## Checking method: numpy (via .npz files) 
>> Building database.  Wait please...
>> Time (creation) --> 1.906
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.191
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>>
>> 75M __test
>>
>> So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
>> can compress data as well, so let's see how it goes:
>>
>> $ python key-store.py -f numpy -d __test -l 9
>> ## Checking method: numpy (via .npz files) 
>> Building database.  Wait please...
>> Time (creation) --> 6.636
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.384
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 28M __test
>>
>> Ok, in this case we have got almost a 3x compression ratio, which is not
>> bad.  However, the performance has degraded a lot.  Let's use now bcolz.
>> First in non-compressed mode:
>>
>> $ python key-store.py -f bcolz -d __test -l 0
>> ## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> 
>> Building database.  Wait please...
>> Time (creation) --> 0.479
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.103
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 82M __test
>>
>> Without compression, bcolz takes a bit more (~10%) space than NPZ.
>> However, bcolz is actually meant to be used with compression on by default:
>>
>> $ python key-store.py -f bcolz -d __test -l 9
>> ## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
>> 
>> Building database.  Wait please...
>> Time (creation) --> 0.487
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 0.98
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 29M __test
>>
>> So, the final disk usage is quite similar to NPZ, but it can store and
>> retrieve lots faster.  Also, the data decompression speed is on par to
>> using non-compression.  This is because bcolz uses Blosc behind the scenes,
>> which is much faster than zlib (used by NPZ) --and sometimes faster than a
>> memcpy().  However, even we are doing I/O against the disk, this dataset is
>> so small that fits in the OS filesystem cache, so the benchmark is actually
>> checking I/O at memory speeds, not disk speeds.
>>
>> In order to do a more real-life comparison, let's use a dataset that is
>> much larger than the amount of memory in my laptop (8 GB):
>>
>> $ PYTHONPATH=. python key-store.py -f bcolz -m 100 -k 5000 -d
>> /media/faltet/docker/__test -l 0
>> ## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> 
>> Building database.  Wait please...
>> Time (creation) --> 133.650
>> Retrieving 100 keys in arbitrary order...
>> Time (   query) --> 2.881
>> Number of elements out of getitem: 91907396
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
>> /media/faltet/docker/__test
>>
>> 39G /media/faltet/docker/__test
>>
>> and now, with compression on:
>>
>> $ PYTHONPATH=. python key-store.py -f bcolz -m 100 -k 5000 -d
>> /media/faltet/docker/__test -l 9
>> #

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Edison Gustavo Muenz
>From what I know this would be the use case that Dask seems to solve.

I think this blog post can help:
https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python

Notice that I haven't used any of these projects myself.

On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted  wrote:

> Well, maybe something like a simple class emulating a dictionary that
> stores a key-value on disk would be more than enough.  Then you can use
> whatever persistence layer that you want (even HDF5, but not necessarily).
>
> As a demonstration I did a quick and dirty implementation for such a
> persistent key-store thing (
> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
> KeyStore class (less than 40 lines long) is responsible for storing the
> value (2 arrays) into a key (a directory).  As I am quite a big fan of
> compression, I implemented a couple of serialization flavors: one using the
> .npz format (so no other dependencies than NumPy are needed) and the other
> using the ctable object from the bcolz package (bcolz.blosc.org).  Here
> are some performance numbers:
>
> python key-store.py -f numpy -d __test -l 0
> ## Checking method: numpy (via .npz files) 
> Building database.  Wait please...
> Time (creation) --> 1.906
> Retrieving 100 keys in arbitrary order...
> Time (   query) --> 0.191
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>
> 75M __test
>
> So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
> can compress data as well, so let's see how it goes:
>
> $ python key-store.py -f numpy -d __test -l 9
> ## Checking method: numpy (via .npz files) 
> Building database.  Wait please...
> Time (creation) --> 6.636
> Retrieving 100 keys in arbitrary order...
> Time (   query) --> 0.384
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
> 28M __test
>
> Ok, in this case we have got almost a 3x compression ratio, which is not
> bad.  However, the performance has degraded a lot.  Let's use now bcolz.
> First in non-compressed mode:
>
> $ python key-store.py -f bcolz -d __test -l 0
> ## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
> 
> Building database.  Wait please...
> Time (creation) --> 0.479
> Retrieving 100 keys in arbitrary order...
> Time (   query) --> 0.103
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
> 82M __test
>
> Without compression, bcolz takes a bit more (~10%) space than NPZ.
> However, bcolz is actually meant to be used with compression on by default:
>
> $ python key-store.py -f bcolz -d __test -l 9
> ## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
> 
> Building database.  Wait please...
> Time (creation) --> 0.487
> Retrieving 100 keys in arbitrary order...
> Time (   query) --> 0.98
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
> 29M __test
>
> So, the final disk usage is quite similar to NPZ, but it can store and
> retrieve lots faster.  Also, the data decompression speed is on par to
> using non-compression.  This is because bcolz uses Blosc behind the scenes,
> which is much faster than zlib (used by NPZ) --and sometimes faster than a
> memcpy().  However, even we are doing I/O against the disk, this dataset is
> so small that fits in the OS filesystem cache, so the benchmark is actually
> checking I/O at memory speeds, not disk speeds.
>
> In order to do a more real-life comparison, let's use a dataset that is
> much larger than the amount of memory in my laptop (8 GB):
>
> $ PYTHONPATH=. python key-store.py -f bcolz -m 100 -k 5000 -d
> /media/faltet/docker/__test -l 0
> ## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
> 
> Building database.  Wait please...
> Time (creation) --> 133.650
> Retrieving 100 keys in arbitrary order...
> Time (   query) --> 2.881
> Number of elements out of getitem: 91907396
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
> /media/faltet/docker/__test
>
> 39G /media/faltet/docker/__test
>
> and now, with compression on:
>
> $ PYTHONPATH=. python key-store.py -f bcolz -m 100 -k 5000 -d
> /media/faltet/docker/__test -l 9
> ## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
> 
> Building database.  Wait please...
> Time (creation) --> 145.633
> Retrieving 100 keys in arbitrary order...
> Time (   query) --> 1.339
> Number of elements out of getitem: 91907396
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
> /media/faltet/docker/__test
>
> 12G /media/faltet/docker/__test
>
> So, we are still seeing the 3x compression

Re: [Numpy-discussion] Should I use pip install numpy in linux?

2016-01-14 Thread James E.H. Turner

On 09/01/16 00:13, Nathaniel Smith wrote:

Right. There's a small problem which is that the base linux system isn't just
"CentOS 5", it's "CentOS 5 and here's the list of libraries that you're
allowed to link to: ...", where that list is empirically chosen to include
only stuff that really is installed on ~all linux machines and for which the
ABI really has been stable in practice over multiple years and distros (so
e.g. no OpenSSL).

So the key next step is for someone to figure out and write down that list.
Continuum and Enthought both have versions of it that we know are good...


You mean something more empirical than
http://refspecs.linuxfoundation.org/lsb.shtml ? I tend to
cross-reference with that when adding stuff to Ureka and just err
on the side of including things where feasible, then of course test
it on the main target platforms. We have also been building on
CentOS 5-6 BTW (I believe the former is about to be unsupported).

Just skimming the thread...

Cheers,

James.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Francesc Alted
Well, maybe something like a simple class emulating a dictionary that
stores a key-value on disk would be more than enough.  Then you can use
whatever persistence layer that you want (even HDF5, but not necessarily).

As a demonstration I did a quick and dirty implementation for such a
persistent key-store thing (
https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
KeyStore class (less than 40 lines long) is responsible for storing the
value (2 arrays) into a key (a directory).  As I am quite a big fan of
compression, I implemented a couple of serialization flavors: one using the
.npz format (so no other dependencies than NumPy are needed) and the other
using the ctable object from the bcolz package (bcolz.blosc.org).  Here are
some performance numbers:

python key-store.py -f numpy -d __test -l 0
## Checking method: numpy (via .npz files) 
Building database.  Wait please...
Time (creation) --> 1.906
Retrieving 100 keys in arbitrary order...
Time (   query) --> 0.191
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test

75M __test

So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
can compress data as well, so let's see how it goes:

$ python key-store.py -f numpy -d __test -l 9
## Checking method: numpy (via .npz files) 
Building database.  Wait please...
Time (creation) --> 6.636
Retrieving 100 keys in arbitrary order...
Time (   query) --> 0.384
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
28M __test

Ok, in this case we have got almost a 3x compression ratio, which is not
bad.  However, the performance has degraded a lot.  Let's use now bcolz.
First in non-compressed mode:

$ python key-store.py -f bcolz -d __test -l 0
## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')

Building database.  Wait please...
Time (creation) --> 0.479
Retrieving 100 keys in arbitrary order...
Time (   query) --> 0.103
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
82M __test

Without compression, bcolz takes a bit more (~10%) space than NPZ.
However, bcolz is actually meant to be used with compression on by default:

$ python key-store.py -f bcolz -d __test -l 9
## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')

Building database.  Wait please...
Time (creation) --> 0.487
Retrieving 100 keys in arbitrary order...
Time (   query) --> 0.98
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
29M __test

So, the final disk usage is quite similar to NPZ, but it can store and
retrieve lots faster.  Also, the data decompression speed is on par to
using non-compression.  This is because bcolz uses Blosc behind the scenes,
which is much faster than zlib (used by NPZ) --and sometimes faster than a
memcpy().  However, even we are doing I/O against the disk, this dataset is
so small that fits in the OS filesystem cache, so the benchmark is actually
checking I/O at memory speeds, not disk speeds.

In order to do a more real-life comparison, let's use a dataset that is
much larger than the amount of memory in my laptop (8 GB):

$ PYTHONPATH=. python key-store.py -f bcolz -m 100 -k 5000 -d
/media/faltet/docker/__test -l 0
## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')

Building database.  Wait please...
Time (creation) --> 133.650
Retrieving 100 keys in arbitrary order...
Time (   query) --> 2.881
Number of elements out of getitem: 91907396
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
/media/faltet/docker/__test

39G /media/faltet/docker/__test

and now, with compression on:

$ PYTHONPATH=. python key-store.py -f bcolz -m 100 -k 5000 -d
/media/faltet/docker/__test -l 9
## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')

Building database.  Wait please...
Time (creation) --> 145.633
Retrieving 100 keys in arbitrary order...
Time (   query) --> 1.339
Number of elements out of getitem: 91907396
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
/media/faltet/docker/__test

12G /media/faltet/docker/__test

So, we are still seeing the 3x compression ratio.  But the interesting
thing here is that the compressed version works a 50% faster than the
uncompressed one (13 ms/query vs 29 ms/query).  In this case I was using a
SSD (hence the low query times), so the compression advantage is even more
noticeable than when using memory as above (as expected).

But anyway, this is just a demonstration that you don't need heavy tools to
achieve what you want.  And as a corollary, (fast) compressors can save you
not only storage, but processing time

Re: [Numpy-discussion] Should I use pip install numpy in linux?

2016-01-14 Thread Oscar Benjamin
On 13 January 2016 at 22:23, Chris Barker  wrote:
> On Mon, Jan 11, 2016 at 5:29 PM, Nathaniel Smith  wrote:
>>
>> I agree that talking about such things on distutils-sig tends to elicit a
>> certain amount of puzzled incomprehension, but I don't think it matters --
>> wheels already have everything you need to support this.
>
> well, that's what I figured -- and I started down that path a while back and
> got no support whatsoever (OK, some from Matthew Brett -- thanks!). But I
> know myself well enough to know I wasn't going to get the critical mass
> required to make it useful by myself, so I've moved on to an ecosystem that
> is doing most of the work already.

I think the problem with discussing these things on distutils-sig is
that the discussions are often very theoretical. In reality PyPA are
waiting for people to adopt the infrastructure that they have created
so far by uploading sets of binary wheels. Once that process really
kicks off then as issues emerge there will be real specific problems
to solve and a more concrete discussion of what changes are needed to
wheel/pip/PyPI can emerge.

The main exceptions to this are wheels for Linux and non-setuptools
build dependencies for sdists so it's definitely good to pursue those
problems and try to complete the basic infrastructure.

> Also, you have the problem that there is one PyPi -- so where do you put
> your nifty wheels that depend on other binary wheels? you may need to fork
> every package you want to build :-(

Is this a real problem or a theoretical one? Do you know of some
situation where this wheel to wheel dependency will occur that won't
just be solved in some other way?

--
Oscar
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Nathaniel Smith
I'd try storing the data in hdf5 (probably via h5py, which is a more
basic interface without all the bells-and-whistles that pytables
adds), though any method you use is going to be limited by the need to
do a seek before each read. Storing the data on SSD will probably help
a lot if you can afford it for your data size.

On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario  wrote:
> Hi,
>
> I have a very large dictionary that must be shared across processes and does 
> not fit in RAM. I need access to this object to be fast. The key is an 
> integer ID and the value is a list containing two elements, both of them 
> numpy arrays (one has ints, the other has floats). The key is sequential, 
> starts at 0, and there are no gaps, so the “outer” layer of this data 
> structure could really just be a list with the key actually being the index. 
> The lengths of each pair of arrays may differ across keys.
>
> For a visual:
>
> {
> key=0:
> [
> numpy.array([1,8,15,…, 16000]),
> numpy.array([0.1,0.1,0.1,…,0.1])
> ],
> key=1:
> [
> numpy.array([5,6]),
> numpy.array([0.5,0.5])
> ],
> …
> }
>
> I’ve tried:
> -   manager proxy objects, but the object was so big that low-level code 
> threw an exception due to format and monkey-patching wasn’t successful.
> -   Redis, which was far too slow due to setting up connections and data 
> conversion etc.
> -   Numpy rec arrays + memory mapping, but there is a restriction that 
> the numpy arrays in each “column” must be of fixed and same size.
> -   I looked at PyTables, which may be a solution, but seems to have a 
> very steep learning curve.
> -   I haven’t tried SQLite3, but I am worried about the time it takes to 
> query the DB for a sequential ID, and then translate byte arrays.
>
> Any ideas? I greatly appreciate any guidance you can provide.
>
> Thanks,
> Ryan
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion



-- 
Nathaniel J. Smith -- http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Ryan R. Rosario
Hi,

I have a very large dictionary that must be shared across processes and does 
not fit in RAM. I need access to this object to be fast. The key is an integer 
ID and the value is a list containing two elements, both of them numpy arrays 
(one has ints, the other has floats). The key is sequential, starts at 0, and 
there are no gaps, so the “outer” layer of this data structure could really 
just be a list with the key actually being the index. The lengths of each pair 
of arrays may differ across keys. 

For a visual:

{
key=0:
[
numpy.array([1,8,15,…, 16000]),
numpy.array([0.1,0.1,0.1,…,0.1])
],
key=1:
[
numpy.array([5,6]),
numpy.array([0.5,0.5])
],
…
}

I’ve tried:
-   manager proxy objects, but the object was so big that low-level code 
threw an exception due to format and monkey-patching wasn’t successful. 
-   Redis, which was far too slow due to setting up connections and data 
conversion etc.
-   Numpy rec arrays + memory mapping, but there is a restriction that the 
numpy arrays in each “column” must be of fixed and same size.
-   I looked at PyTables, which may be a solution, but seems to have a very 
steep learning curve.
-   I haven’t tried SQLite3, but I am worried about the time it takes to 
query the DB for a sequential ID, and then translate byte arrays.

Any ideas? I greatly appreciate any guidance you can provide.

Thanks,
Ryan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion