Re: [Numpy-discussion] About the npz format

2014-04-17 Thread David Palao
2014-04-16 20:26 GMT+02:00 onefire onefire.mys...@gmail.com:
 Hi all,

 I have been playing with the idea of using Numpy's binary format as a
 lightweight alternative to HDF5 (which I believe is the right way to do if
 one does not have a problem with the dependency).

 I am pretty happy with the npy format, but the npz format seems to be broken
 as far as performance is concerned (or I am missing obvious!). The following
 ipython session illustrates the issue:

 ln [1]: import numpy as np

 In [2]: x = np.linspace(1, 10, 5000)

 In [3]: %time np.save(x.npy, x)
 CPU times: user 40 ms, sys: 230 ms, total: 270 ms
 Wall time: 488 ms

 In [4]: %time np.savez(x.npz, data = x)
 CPU times: user 657 ms, sys: 707 ms, total: 1.36 s
 Wall time: 7.7 s


Hi,
In my case (python-2.7.3, numpy-1.6.1):

In [23]: %time save(xx.npy, x)
CPU times: user 0.00 s, sys: 0.23 s, total: 0.23 s
Wall time: 4.07 s

In [24]: %time savez(xx.npz, data = x)
CPU times: user 0.42 s, sys: 0.61 s, total: 1.02 s
Wall time: 4.26 s

In my case I don't see the unbelievable amount of overhead of the npz thing.

Best
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Nathaniel Smith
On 17 Apr 2014 01:57, onefire onefire.mys...@gmail.com wrote:

 What I cannot understand is why savez takes more than 10 times longer
than saving the data to a npy file. The only reason that I could come up
with was the computation of the crc32.

We can all make guesses but the solution is just to profile it :-). %prun
in ipython (and then if you need more granularity installing line_profiler
is useful).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Nathaniel Smith
On Wed, Apr 16, 2014 at 4:17 PM, R Hattersley rhatters...@gmail.com wrote:
 For some reason the Python issue 21223 didn't show any activity until I
 logged in to post my patch. At which point I saw that haypo had already
 submitted pretty much exactly the same patch. *sigh* That was pretty much a
 waste of time then. :-|

Oh, that sucks :-(. I knew that there was a patch posted there, but I
was travelling yesterday when you posted :-/.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ImportError: /usr/local/lib/python2.7/site-packages/numpy-1.8.0-py2.7-linux-x86_64.egg/numpy/core/multiarray.so: undefined symbol: PyUnicodeUCS2_AsASCIIString

2014-04-17 Thread jaylene
No. I didn't rebuild numpy after rebuilding python. I searched online about
this error. It said that this error might be caused by building python with
USC-4. Is there a way to check if the python was built with USC-4 or USC-2?
Will rebuilding python with USC-2 work? I'm really reluctant to recompile
the entire scipy stack.



--
View this message in context: 
http://numpy-discussion.10968.n7.nabble.com/ImportError-usr-local-lib-python2-7-site-packages-numpy-1-8-0-py2-7-linux-x86-64-egg-numpy-core-multg-tp37372p37383.html
Sent from the Numpy-discussion mailing list archive at Nabble.com.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Aron Ahmadia
 On the one hand it would be nice to actually know whether posix_memalign
is important, before making api decisions on this basis.

FWIW: On the lightweight IBM cores that the extremely popular BlueGene
machines were based on, accessing unaligned memory raised system faults.
 The default behavior of these machines was to terminate the program if
more than 1000 such errors occurred on a given process, and an environment
variable allowed you to terminate the program if *any* unaligned memory
access occurred.  This is because unaligned memory accesses were 15x (or
more) slower than aligned memory access.

The newer /Q chips seem to be a little more forgiving of this, but I think
one can in general expect allocated memory alignment to be an important
performance technique for future high performance computing architectures.

A


On Thu, Apr 17, 2014 at 9:09 AM, Nathaniel Smith n...@pobox.com wrote:

 On Wed, Apr 16, 2014 at 4:17 PM, R Hattersley rhatters...@gmail.com
 wrote:
  For some reason the Python issue 21223 didn't show any activity until I
  logged in to post my patch. At which point I saw that haypo had already
  submitted pretty much exactly the same patch. *sigh* That was pretty
 much a
  waste of time then. :-|

 Oh, that sucks :-(. I knew that there was a patch posted there, but I
 was travelling yesterday when you posted :-/.

 -n

 --
 Nathaniel J. Smith
 Postdoctoral researcher - Informatics - University of Edinburgh
 http://vorpus.org
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Nathaniel Smith
On 17 Apr 2014 15:09, Aron Ahmadia a...@ahmadia.net wrote:

  On the one hand it would be nice to actually know whether
posix_memalign is important, before making api decisions on this basis.

 FWIW: On the lightweight IBM cores that the extremely popular BlueGene
machines were based on, accessing unaligned memory raised system faults.
 The default behavior of these machines was to terminate the program if
more than 1000 such errors occurred on a given process, and an environment
variable allowed you to terminate the program if *any* unaligned memory
access occurred.  This is because unaligned memory accesses were 15x (or
more) slower than aligned memory access.

 The newer /Q chips seem to be a little more forgiving of this, but I
think one can in general expect allocated memory alignment to be an
important performance technique for future high performance computing
architectures.

Right, this much is true on lots of architectures, and so malloc is careful
to always return values with sufficient alignment (e.g. 8 bytes) to make
sure that any standard operation can succeed.

The question here is whether it will be important to have *even more*
alignment than malloc gives us by default. A 16 or 32 byte wide SIMD
instruction might prefer that data have 16 or 32 byte alignment, even if
normal memory access for the types being operated on only requires 4 or 8
byte alignment.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Aron Ahmadia
Hmnn, I wasn't being clear :)

The default malloc on BlueGene/Q only returns 8 byte alignment, but the
SIMD units need 32-byte alignment for loads, stores, and operations or
performance suffers.  On the /P the required alignment was 16-bytes, but
malloc only gave you 8, and trying to perform vectorized loads/stores
generated alignment exceptions on unaligned memory.

See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and
https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides 14
for overview, 15 for the effective performance difference between the
unaligned/aligned code) for some notes on this.

A




On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith n...@pobox.com wrote:

 On 17 Apr 2014 15:09, Aron Ahmadia a...@ahmadia.net wrote:
 
   On the one hand it would be nice to actually know whether
 posix_memalign is important, before making api decisions on this basis.
 
  FWIW: On the lightweight IBM cores that the extremely popular BlueGene
 machines were based on, accessing unaligned memory raised system faults.
  The default behavior of these machines was to terminate the program if
 more than 1000 such errors occurred on a given process, and an environment
 variable allowed you to terminate the program if *any* unaligned memory
 access occurred.  This is because unaligned memory accesses were 15x (or
 more) slower than aligned memory access.
 
  The newer /Q chips seem to be a little more forgiving of this, but I
 think one can in general expect allocated memory alignment to be an
 important performance technique for future high performance computing
 architectures.

 Right, this much is true on lots of architectures, and so malloc is
 careful to always return values with sufficient alignment (e.g. 8 bytes) to
 make sure that any standard operation can succeed.

 The question here is whether it will be important to have *even more*
 alignment than malloc gives us by default. A 16 or 32 byte wide SIMD
 instruction might prefer that data have 16 or 32 byte alignment, even if
 normal memory access for the types being operated on only requires 4 or 8
 byte alignment.

 -n

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: Bokeh 0.4.4 released

2014-04-17 Thread Bryan Van de Ven
I am happy to announce the release of Bokeh version 0.4.4!

Bokeh is a Python library for visualizing large and realtime datasets on the 
web. Its goal is to provide elegant, concise construction of novel graphics in 
the style of Protovis/D3, while delivering high-performance interactivity to 
thin clients. Bokeh includes its own Javascript library (BokehJS) that 
implements a reactive scenegraph representation of the plot, and renders 
efficiently to HTML5 Canvas. Bokeh works well with IPython Notebook, but can 
generate standalone graphics that embed into regular HTML. If you are a 
Matplotlib user, you can just use %bokeh magic to start interacting with your 
plots in the notebook immediately!

Check out the full documentation, interactive gallery, and tutorial at

http://bokeh.pydata.org

If you are using Anaconda, you can install with conda:

conda install bokeh

Alternatively, you can install with pip:

pip install bokeh

We are still working on some bigger features but want to get new fixes and 
functionality out to users as soon as we can. Some notable features of this 
release are:

* Additional Matplotlib, ggplot, and Seaborn compatibility (styling, 
more examples)
* TravisCI testing integration at 
https://travis-ci.org/ContinuumIO/bokeh
* Tool enhancements, constrained pan/zoom, more hover glyphs 
* Server remote data and downsampling examples 
* Initial work for Bokeh app concept 

Also, we've also made lots of little bug fixes and enhancements - see the 
CHANGELOG for full details.

BokehJS is also available by CDN for use in standalone javascript applications:

http://cdn.pydata.org/bokeh-0.4.4.js
http://cdn.pydata.org/bokeh-0.4.4.css
http://cdn.pydata.org/bokeh-0.4.4.min.js
http://cdn.pydata.org/bokeh-0.4.4.min.css

Some examples of BokehJS use can be found on the Bokeh JSFiddle page:

http://jsfiddle.net/user/bokeh/fiddles/

The release of Bokeh 0.5 is planned for early May. Some notable features we 
plan to include are:

* Abstract Rendering for semantically meaningful downsampling of large 
datasets
* Better grid-based layout system, using Cassowary.js
* More MPL/Seaborn/ggplot.py compatibility and examples, using 
MPLExporter
* Additional tools, improved interactions, and better plot frame
* Touch support

Issues, enhancement requests, and pull requests can be made on the Bokeh Github 
page: https://github.com/continuumio/bokeh

Questions can be directed to the Bokeh mailing list: bo...@continuum.io

If you have interest in helping to develop Bokeh, please get involved! Special 
thanks to recent contributors: Amy Troschinetz and Gerald Dalley

Bryan Van de Ven
Continuum Analytics
http://continuum.io


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Francesc Alted
Uh, 15x slower for unaligned access is quite a lot.  But Intel (and AMD) 
arquitectures are much more tolerant in this aspect (and improving).  
For example, with a Xeon(R) CPU E5-2670 (2 years old) I get:


In [1]: import numpy as np

In [2]: shape = (1, 1)

In [3]: x_aligned = np.zeros(shape, 
dtype=[('x',np.float64),('y',np.int64)])['x']


In [4]: x_unaligned = np.zeros(shape, 
dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']


In [5]: %timeit res = x_aligned ** 2
1 loops, best of 3: 289 ms per loop

In [6]: %timeit res = x_unaligned ** 2
1 loops, best of 3: 664 ms per loop

so the added cost in this case is just a bit more than 2x.  But you can 
also alleviate this overhead if you do a copy that fits in cache prior 
to do computations.  numexpr does this:


https://github.com/pydata/numexpr/blob/master/numexpr/interp_body.cpp#L203

and the results are pretty good:

In [8]: import numexpr as ne

In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
10 loops, best of 3: 133 ms per loop

In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
10 loops, best of 3: 134 ms per loop

i.e. there is not a significant difference between aligned and unaligned 
access to data.


I wonder if the same technique could be applied to NumPy.

Francesc


El 17/04/14 16:26, Aron Ahmadia ha escrit:

Hmnn, I wasn't being clear :)

The default malloc on BlueGene/Q only returns 8 byte alignment, but 
the SIMD units need 32-byte alignment for loads, stores, and 
operations or performance suffers.  On the /P the required alignment 
was 16-bytes, but malloc only gave you 8, and trying to perform 
vectorized loads/stores generated alignment exceptions on unaligned 
memory.


See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and 
https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides 
14 for overview, 15 for the effective performance difference between 
the unaligned/aligned code) for some notes on this.


A




On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith n...@pobox.com 
mailto:n...@pobox.com wrote:


On 17 Apr 2014 15:09, Aron Ahmadia a...@ahmadia.net
mailto:a...@ahmadia.net wrote:

  On the one hand it would be nice to actually know whether
posix_memalign is important, before making api decisions on this
basis.

 FWIW: On the lightweight IBM cores that the extremely popular
BlueGene machines were based on, accessing unaligned memory raised
system faults.  The default behavior of these machines was to
terminate the program if more than 1000 such errors occurred on a
given process, and an environment variable allowed you to
terminate the program if *any* unaligned memory access occurred.
 This is because unaligned memory accesses were 15x (or more)
slower than aligned memory access.

 The newer /Q chips seem to be a little more forgiving of this,
but I think one can in general expect allocated memory alignment
to be an important performance technique for future high
performance computing architectures.

Right, this much is true on lots of architectures, and so malloc
is careful to always return values with sufficient alignment (e.g.
8 bytes) to make sure that any standard operation can succeed.

The question here is whether it will be important to have *even
more* alignment than malloc gives us by default. A 16 or 32 byte
wide SIMD instruction might prefer that data have 16 or 32 byte
alignment, even if normal memory access for the types being
operated on only requires 4 or 8 byte alignment.

-n


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion



--
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] min depth to nonzero in 3d array

2014-04-17 Thread Alan G Isaac
Given an array A of shape m x n x n
(i.e., a stack of square matrices),
I want an n x n array that gives the
minimum depth to a nonzero element.
E.g., the 0,0 element of the result is
np.flatnonzero(A[:,0,0])[0]
Can this be vectorized?
(Assuming a nonzero element exists is ok,
but dealing nicely with its absence is even better.)

Thanks,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Julian Taylor
On 17.04.2014 18:06, Francesc Alted wrote:

 
 In [4]: x_unaligned = np.zeros(shape,
 dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']

on arrays of this size you won't see alignment issues you are dominated
by memory bandwidth. If at all you will only see it if the data fits
into the cache.
Its also about unaligned to simd vectors not unaligned to basic types.
But it doesn't matter anymore on modern x86 cpus. I guess for array data
cache line splits should also not be a big concern.

Aligned allocators are not the only allocator which might be useful in
numpy. Modern CPUs also support larger pages than 4K (huge pages up to
1GB in size) which reduces TLB cache misses. Memory of this type
typically needs to be allocated with special mmap flags, though newer
kernel versions can now also provide this memory to transparent
anonymous pages (normal non-file mmaps).

 
 In [8]: import numexpr as ne
 
 In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
 10 loops, best of 3: 133 ms per loop
 
 In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
 10 loops, best of 3: 134 ms per loop
 
 i.e. there is not a significant difference between aligned and unaligned
 access to data.
 
 I wonder if the same technique could be applied to NumPy.


you already can do so with relatively simple means:
http://nbviewer.ipython.org/gist/anonymous/10942132

If you change the blocking function to get a function as input and use
inplace operations numpy can even beat numexpr (though I used the
numexpr Ubuntu package which might not be compiled optimally)
This type of transformation can probably be applied on the AST quite easily.

 
 Francesc
 
 
 El 17/04/14 16:26, Aron Ahmadia ha escrit:
 Hmnn, I wasn't being clear :)

 The default malloc on BlueGene/Q only returns 8 byte alignment, but
 the SIMD units need 32-byte alignment for loads, stores, and
 operations or performance suffers.  On the /P the required alignment
 was 16-bytes, but malloc only gave you 8, and trying to perform
 vectorized loads/stores generated alignment exceptions on unaligned
 memory.

 See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and
 https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides
 14 for overview, 15 for the effective performance difference between
 the unaligned/aligned code) for some notes on this.

 A




 On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith n...@pobox.com
 mailto:n...@pobox.com wrote:

 On 17 Apr 2014 15:09, Aron Ahmadia a...@ahmadia.net
 mailto:a...@ahmadia.net wrote:
 
   On the one hand it would be nice to actually know whether
 posix_memalign is important, before making api decisions on this
 basis.
 
  FWIW: On the lightweight IBM cores that the extremely popular
 BlueGene machines were based on, accessing unaligned memory raised
 system faults.  The default behavior of these machines was to
 terminate the program if more than 1000 such errors occurred on a
 given process, and an environment variable allowed you to
 terminate the program if *any* unaligned memory access occurred.
  This is because unaligned memory accesses were 15x (or more)
 slower than aligned memory access.
 
  The newer /Q chips seem to be a little more forgiving of this,
 but I think one can in general expect allocated memory alignment
 to be an important performance technique for future high
 performance computing architectures.

 Right, this much is true on lots of architectures, and so malloc
 is careful to always return values with sufficient alignment (e.g.
 8 bytes) to make sure that any standard operation can succeed.

 The question here is whether it will be important to have *even
 more* alignment than malloc gives us by default. A 16 or 32 byte
 wide SIMD instruction might prefer that data have 16 or 32 byte
 alignment, even if normal memory access for the types being
 operated on only requires 4 or 8 byte alignment.

 -n


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion




 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 
 
 -- 
 Francesc Alted
 
 
 
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
 

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] min depth to nonzero in 3d array

2014-04-17 Thread Stephan Hoyer
Hi Alan,

You can abuse np.argmax to calculate the first nonzero element in a
vectorized manner:

import numpy as np
A = (2 * np.random.rand(100, 50, 50)).astype(int)

Compare:

np.argmax(A != 0, axis=0)
np.array([[np.flatnonzero(A[:,i,j])[0] for j in range(50)] for i in
range(50)])

You'll also want to check for all zero arrays with np.all:

np.all(A == 0, axis=0)

Cheers,
Stephan


On Thu, Apr 17, 2014 at 9:32 AM, Alan G Isaac alan.is...@gmail.com wrote:

 Given an array A of shape m x n x n
 (i.e., a stack of square matrices),
 I want an n x n array that gives the
 minimum depth to a nonzero element.
 E.g., the 0,0 element of the result is
 np.flatnonzero(A[:,0,0])[0]
 Can this be vectorized?
 (Assuming a nonzero element exists is ok,
 but dealing nicely with its absence is even better.)

 Thanks,
 Alan Isaac
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] min depth to nonzero in 3d array

2014-04-17 Thread Eelco Hoogendoorn
I agree; argmax would the best option here; though I would hardly call it
abuse. It seems perfectly readable and idiomatic to me. Though the !=
comparison requires an extra pass over the array, that's the kind of
tradeoff you make in using numpy.


On Thu, Apr 17, 2014 at 7:45 PM, Stephan Hoyer sho...@gmail.com wrote:

 Hi Alan,

 You can abuse np.argmax to calculate the first nonzero element in a
 vectorized manner:

 import numpy as np
 A = (2 * np.random.rand(100, 50, 50)).astype(int)

 Compare:

 np.argmax(A != 0, axis=0)
 np.array([[np.flatnonzero(A[:,i,j])[0] for j in range(50)] for i in
 range(50)])

 You'll also want to check for all zero arrays with np.all:

 np.all(A == 0, axis=0)

 Cheers,
 Stephan


 On Thu, Apr 17, 2014 at 9:32 AM, Alan G Isaac alan.is...@gmail.comwrote:

 Given an array A of shape m x n x n
 (i.e., a stack of square matrices),
 I want an n x n array that gives the
 minimum depth to a nonzero element.
 E.g., the 0,0 element of the result is
 np.flatnonzero(A[:,0,0])[0]
 Can this be vectorized?
 (Assuming a nonzero element exists is ok,
 but dealing nicely with its absence is even better.)

 Thanks,
 Alan Isaac
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Francesc Alted
El 17/04/14 19:28, Julian Taylor ha escrit:
 On 17.04.2014 18:06, Francesc Alted wrote:

 In [4]: x_unaligned = np.zeros(shape,
 dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
 on arrays of this size you won't see alignment issues you are dominated
 by memory bandwidth. If at all you will only see it if the data fits
 into the cache.
 Its also about unaligned to simd vectors not unaligned to basic types.
 But it doesn't matter anymore on modern x86 cpus. I guess for array data
 cache line splits should also not be a big concern.

Yes, that was my point, that in x86 CPUs this is not such a big 
problem.  But still a factor of 2 is significant, even for CPU-intensive 
tasks.  For example, computing sin() is affected similarly (sin() is 
using SIMD, right?):

In [6]: shape = (1, 1)

In [7]: x_aligned = np.zeros(shape, 
dtype=[('x',np.float64),('y',np.int64)])['x']

In [8]: x_unaligned = np.zeros(shape, 
dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']

In [9]: %timeit res = np.sin(x_aligned)
1 loops, best of 3: 654 ms per loop

In [10]: %timeit res = np.sin(x_unaligned)
1 loops, best of 3: 1.08 s per loop

and again, numexpr can deal with that pretty well (using 8 physical 
cores here):

In [6]: %timeit res = ne.evaluate('sin(x_aligned)')
10 loops, best of 3: 149 ms per loop

In [7]: %timeit res = ne.evaluate('sin(x_unaligned)')
10 loops, best of 3: 151 ms per loop


 Aligned allocators are not the only allocator which might be useful in
 numpy. Modern CPUs also support larger pages than 4K (huge pages up to
 1GB in size) which reduces TLB cache misses. Memory of this type
 typically needs to be allocated with special mmap flags, though newer
 kernel versions can now also provide this memory to transparent
 anonymous pages (normal non-file mmaps).

That's interesting.  In which scenarios do you think that could improve 
performance?

 In [8]: import numexpr as ne

 In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
 10 loops, best of 3: 133 ms per loop

 In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
 10 loops, best of 3: 134 ms per loop

 i.e. there is not a significant difference between aligned and unaligned
 access to data.

 I wonder if the same technique could be applied to NumPy.

 you already can do so with relatively simple means:
 http://nbviewer.ipython.org/gist/anonymous/10942132

 If you change the blocking function to get a function as input and use
 inplace operations numpy can even beat numexpr (though I used the
 numexpr Ubuntu package which might not be compiled optimally)
 This type of transformation can probably be applied on the AST quite easily.

That's smart.  Yeah, I don't see a reason why numexpr would be 
performing badly on Ubuntu.  But I am not getting your performance for 
blocked_thread on my AMI linux vbox:

http://nbviewer.ipython.org/gist/anonymous/11000524

oh well, threads are always tricky beasts.  By the way, apparently the 
optimal block size for my machine is something like 1 MB, not 128 KB, 
although the difference is not big:

http://nbviewer.ipython.org/gist/anonymous/11002751

(thanks to Stefan Van der Walt for the script).

-- Francesc Alted
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

2014-04-17 Thread Julian Taylor
On 17.04.2014 20:30, Francesc Alted wrote:
 El 17/04/14 19:28, Julian Taylor ha escrit:
 On 17.04.2014 18:06, Francesc Alted wrote:

 In [4]: x_unaligned = np.zeros(shape,
 dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
 on arrays of this size you won't see alignment issues you are dominated
 by memory bandwidth. If at all you will only see it if the data fits
 into the cache.
 Its also about unaligned to simd vectors not unaligned to basic types.
 But it doesn't matter anymore on modern x86 cpus. I guess for array data
 cache line splits should also not be a big concern.
 
 Yes, that was my point, that in x86 CPUs this is not such a big 
 problem.  But still a factor of 2 is significant, even for CPU-intensive 
 tasks.  For example, computing sin() is affected similarly (sin() is 
 using SIMD, right?):
 
 In [6]: shape = (1, 1)
 
 In [7]: x_aligned = np.zeros(shape, 
 dtype=[('x',np.float64),('y',np.int64)])['x']
 
 In [8]: x_unaligned = np.zeros(shape, 
 dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x']
 
 In [9]: %timeit res = np.sin(x_aligned)
 1 loops, best of 3: 654 ms per loop
 
 In [10]: %timeit res = np.sin(x_unaligned)
 1 loops, best of 3: 1.08 s per loop
 
 and again, numexpr can deal with that pretty well (using 8 physical 
 cores here):
 
 In [6]: %timeit res = ne.evaluate('sin(x_aligned)')
 10 loops, best of 3: 149 ms per loop
 
 In [7]: %timeit res = ne.evaluate('sin(x_unaligned)')
 10 loops, best of 3: 151 ms per loop

in this case the unaligned triggers a strided memcpy calling loop to
copy the data into a aligned buffer which is terrible for performance,
even compared to the expensive sin call.
numexpr handles this well as it allows the compiler to replace the
memcpy with inline assembly (a mov instruction).
We could fix that in numpy, though I don't consider it very important,
you usually always have base type aligned memory.

(sin is not a SIMD using function unless you use a vector math library
not supported by numpy directly yet)

 
 
 Aligned allocators are not the only allocator which might be useful in
 numpy. Modern CPUs also support larger pages than 4K (huge pages up to
 1GB in size) which reduces TLB cache misses. Memory of this type
 typically needs to be allocated with special mmap flags, though newer
 kernel versions can now also provide this memory to transparent
 anonymous pages (normal non-file mmaps).
 
 That's interesting.  In which scenarios do you think that could improve 
 performance?

it might improve all numpy operations dealing with big arrays.
big arrays trigger many large temporaries meaning glibc uses mmap
meaning lots of moving of address space between the kernel and userspace.
but I haven't benchmarked it, so it could also be completely irrelevant.

Also memory fragments really fast, so after a few hours of operation you
often can't allocate any huge pages anymore, so you need to reserve
space for them which requires special setup of machines.

Another possibility for special allocators are numa allocators that
ensure you get memory local to a specific compute node regardless of the
system numa policy.
But again its probably not very important as python has poor thread
scalability anyway, these are just examples for keeping flexibility of
our allocators in numpy and not binding us to what python does.


 
 In [8]: import numexpr as ne

 In [9]: %timeit res = ne.evaluate('x_aligned ** 2')
 10 loops, best of 3: 133 ms per loop

 In [10]: %timeit res = ne.evaluate('x_unaligned ** 2')
 10 loops, best of 3: 134 ms per loop

 i.e. there is not a significant difference between aligned and unaligned
 access to data.

 I wonder if the same technique could be applied to NumPy.

 you already can do so with relatively simple means:
 http://nbviewer.ipython.org/gist/anonymous/10942132

 If you change the blocking function to get a function as input and use
 inplace operations numpy can even beat numexpr (though I used the
 numexpr Ubuntu package which might not be compiled optimally)
 This type of transformation can probably be applied on the AST quite easily.
 
 That's smart.  Yeah, I don't see a reason why numexpr would be 
 performing badly on Ubuntu.  But I am not getting your performance for 
 blocked_thread on my AMI linux vbox:
 
 http://nbviewer.ipython.org/gist/anonymous/11000524


my numexpr amd64 package does not make use of SIMD e.g. sqrt which is
vectorized in numpy:

numexpr:
  1.30 │ 4638:   sqrtss (%r14),%xmm0
  0.01 │ ucomis %xmm0,%xmm0
  0.00 │   ↓ jp 11ec4
  4.14 │ 4646:   movss  %xmm0,(%r15,%r12,1)
   │ add%rbp,%r14
   │ add$0x4,%r12
(unrolled a couple times)

vs numpy:
 83.25 │190:   sqrtps (%rbx,%r12,4),%xmm0
  0.52 │   movaps %xmm0,0x0(%rbp,%r12,4)
 14.63 │   add$0x4,%r12
  1.60 │   cmp%rdx,%r12
   │ ↑ jb 190

(note the ps vs ss suffix, packed vs scalar)
___
NumPy-Discussion mailing 

Re: [Numpy-discussion] About the npz format

2014-04-17 Thread onefire
Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not using
Python.

But following your suggestion, I used %prun. Here's (part of) the output
(when I use savez):

 195503 function calls in 4.466 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
22.2841.1422.2841.142 {method 'close' of
'_io.BufferedWriter' objects}
10.9180.9180.9180.918 {built-in method remove}
488410.5680.0000.5680.000 {method 'write' of
'_io.BufferedWriter' objects}
488290.3790.0000.3790.000 {built-in method crc32}
488300.1480.0000.1480.000 {method 'read' of
'_io.BufferedReader' objects}
10.0900.0900.9930.993 zipfile.py:1315(write)
10.0720.0720.0720.072 {method 'tostring' of
'numpy.ndarray' objects}
488480.0050.0000.0050.000 {built-in method len}
10.0010.0010.2700.270 format.py:362(write_array)
30.0000.0000.0000.000 {built-in method open}
10.0000.0004.4664.466 npyio.py:560(_savez)
20.0000.0000.0000.000 zipfile.py:1459(close)
10.0000.0004.4664.466 {built-in method exec}

Here's the output when I use save to save to a npy file:

 39 function calls in 0.266 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
40.1960.0490.1960.049 {method 'write' of
'_io.BufferedWriter' objects}
10.0690.0690.0690.069 {method 'tostring' of
'numpy.ndarray' objects}
10.0010.0010.2660.266 format.py:362(write_array)
10.0000.0000.0000.000 {built-in method open}
10.0000.0000.2660.266 npyio.py:406(save)
10.0000.0000.0000.000
format.py:261(write_array_header_1_0)
10.0000.0000.0000.000 {method 'close' of
'_io.BufferedWriter' objects}
10.0000.0000.2660.266 {built-in method exec}
10.0000.0000.0000.000 format.py:154(magic)
10.0000.0000.0000.000
format.py:233(header_data_from_array_1_0)
10.0000.0000.2660.266 string:1(module)
10.0000.0000.0000.000 numeric.py:462(asanyarray)
10.0000.0000.0000.000 py3k.py:28(asbytes)

The calls to close and the built-in method remove seem to be the
responsible for the inefficiency of  the Numpy implementation (compared to
the Julia package that I mentioned before). This was tested using Python
3.4 and Numpy 1.8.1.
However if I do the tests with Python 3.3.5 and Numpy 1.8.0, savez becomes
much faster, so I think there is something wrong with this combination
Python 3.4/Numpy 1.8.1.
Also, if I use Python 2.4 and Numpy 1.2 (from my school's cluster) I get
that np.save takes about 3.5 seconds and np.savez takes about 7 seconds, so
all these timings seem to be hugely dependent on the system/version (maybe
this explain David Palao's results?).

However, they all point out that a significant amount of time is spent
computing the crc32. Notice that prun reports that it takes 0.379 second to
compute the crc32 of an array that takes 0.2 seconds to save to a npy file.
I believe this is too much! And it get worse if you try to save bigger
arrays.


On Thu, Apr 17, 2014 at 5:23 AM, Nathaniel Smith n...@pobox.com wrote:

 On 17 Apr 2014 01:57, onefire onefire.mys...@gmail.com wrote:
 
  What I cannot understand is why savez takes more than 10 times longer
 than saving the data to a npy file. The only reason that I could come up
 with was the computation of the crc32.

 We can all make guesses but the solution is just to profile it :-). %prun
 in ipython (and then if you need more granularity installing line_profiler
 is useful).

 -n

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Julian Taylor
On 17.04.2014 21:30, onefire wrote:
 Hi Nathaniel,
 
 Thanks for the suggestion. I did profile the program before, just not
 using Python.

one problem of npz is that the zipfile module does not support streaming
data in (or if it does now we aren't using it).
So numpy writes the file uncompressed to disk and then zips it which is
horrible for performance and disk usage.

It would be nice if we could add support for different compression
modules like gzip or xz which allow streaming data directly into a file
without an intermediate.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Valentin Haenel
Hi again,

* David Palao dpalao.pyt...@gmail.com [2014-04-17]:
 2014-04-16 20:26 GMT+02:00 onefire onefire.mys...@gmail.com:
  Hi all,
 
  I have been playing with the idea of using Numpy's binary format as a
  lightweight alternative to HDF5 (which I believe is the right way to do if
  one does not have a problem with the dependency).
 
  I am pretty happy with the npy format, but the npz format seems to be broken
  as far as performance is concerned (or I am missing obvious!). The following
  ipython session illustrates the issue:
 
  ln [1]: import numpy as np
 
  In [2]: x = np.linspace(1, 10, 5000)
 
  In [3]: %time np.save(x.npy, x)
  CPU times: user 40 ms, sys: 230 ms, total: 270 ms
  Wall time: 488 ms
 
  In [4]: %time np.savez(x.npz, data = x)
  CPU times: user 657 ms, sys: 707 ms, total: 1.36 s
  Wall time: 7.7 s
 
 
 Hi,
 In my case (python-2.7.3, numpy-1.6.1):
 
 In [23]: %time save(xx.npy, x)
 CPU times: user 0.00 s, sys: 0.23 s, total: 0.23 s
 Wall time: 4.07 s
 
 In [24]: %time savez(xx.npz, data = x)
 CPU times: user 0.42 s, sys: 0.61 s, total: 1.02 s
 Wall time: 4.26 s
 
 In my case I don't see the unbelievable amount of overhead of the npz thing.

When profiling IO operations, there are many factors that can influence
measurements. In my experience on Linux: these may include: the filesystem
cache, the cpu govenor, the system load, power saving features, the type
of hard drive and how it is connected, any powersaving features (e.g.
laptop-mode tools) and any cron-jobs that might be running (e.g.
updating locate DB).

So for example when measuring the time it takes to write something to
disk on Linux, I always at least include a call to ``sync``
which will ensure that all kernel filesystem buffers will be written to
disk. Even then, you may still have a lot of variability.

As part of bloscpack.sysutil I have wrapped this to be available from
Python (needs root though). So, to re-rurn the benchmarks, doing each
one twice:

In [1]: import numpy as np

In [2]: import bloscpack.sysutil as bps

In [3]: x = np.linspace(1, 10, 5000)

In [4]: %time np.save(x.npy, x)
CPU times: user 12 ms, sys: 356 ms, total: 368 ms
Wall time: 1.41 s

In [5]: %time np.save(x.npy, x)
CPU times: user 0 ns, sys: 368 ms, total: 368 ms
Wall time: 811 ms

In [6]: %time np.savez(x.npz, data = x)
CPU times: user 540 ms, sys: 864 ms, total: 1.4 s
Wall time: 4.74 s

In [7]: %time np.savez(x.npz, data = x)
CPU times: user 580 ms, sys: 808 ms, total: 1.39 s
Wall time: 9.47 s

In [8]: bps.sync()

In [9]: %time np.save(x.npy, x) ; bps.sync()
CPU times: user 0 ns, sys: 368 ms, total: 368 ms
Wall time: 2.2 s

In [10]: %time np.save(x.npy, x) ; bps.sync()
CPU times: user 0 ns, sys: 356 ms, total: 356 ms
Wall time: 2.16 s

In [11]: bps.sync()

In [12]: %time np.savez(x.npz, x) ; bps.sync()
CPU times: user 564 ms, sys: 816 ms, total: 1.38 s
Wall time: 8.21 s

In [13]: %time np.savez(x.npz, x) ; bps.sync()
CPU times: user 588 ms, sys: 772 ms, total: 1.36 s
Wall time: 6.83 s

As you can see, even when using ``sync`` the values might vary, so in
addition it might be worth using %timeit, which will at least run it
three times and select the best one in its default setting:

In [14]: %timeit np.save(x.npy, x)
1 loops, best of 3: 2.4 s per loop

In [15]: %timeit np.savez(x.npz, x)
1 loops, best of 3: 7.1 s per loop

In [16]: %timeit np.save(x.npy, x) ; bps.sync()
1 loops, best of 3: 3.11 s per loop

In [17]: %timeit np.savez(x.npz, x) ; bps.sync()
1 loops, best of 3: 7.36 s per loop

So, anyway, given these readings,  I would tend to support the claim
that there is something slowing down writing when using plain NPZ w/o
compression.

FYI: when reading, the kernel keeps files that were recently read in the
filesystem buffers and so when measuring reads, I tend to drop those
caches using ``drop_caches()`` from bloscpack.sysutil (which wraps using
the linux proc fs).

best,

V-
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Valentin Haenel
Hi,

* Julian Taylor jtaylor.deb...@googlemail.com [2014-04-17]:
 On 17.04.2014 21:30, onefire wrote:
  Hi Nathaniel,
  
  Thanks for the suggestion. I did profile the program before, just not
  using Python.
 
 one problem of npz is that the zipfile module does not support streaming
 data in (or if it does now we aren't using it).
 So numpy writes the file uncompressed to disk and then zips it which is
 horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to
cStringIO instances and then use ``ZipFile.writestr`` with the
``getvalue()`` of the cStringIO object. However that approach may
require some memory. In python 2.7, for each array: one copy inside the
cStringIO instance and then another copy of when calling getvalue on the
cString, I believe.

best,

V-
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Valentin Haenel
* Valentin Haenel valen...@haenel.co [2014-04-17]:
 * Valentin Haenel valen...@haenel.co [2014-04-17]:
  Hi,
  
  * Julian Taylor jtaylor.deb...@googlemail.com [2014-04-17]:
   On 17.04.2014 21:30, onefire wrote:
Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not
using Python.
   
   one problem of npz is that the zipfile module does not support streaming
   data in (or if it does now we aren't using it).
   So numpy writes the file uncompressed to disk and then zips it which is
   horrible for performance and disk usage.
  
  As a workaround may also be possible to write the temporary NPY files to
  cStringIO instances and then use ``ZipFile.writestr`` with the
  ``getvalue()`` of the cStringIO object. However that approach may
  require some memory. In python 2.7, for each array: one copy inside the
  cStringIO instance and then another copy of when calling getvalue on the
  cString, I believe.
 
 There is a proof-of-concept implementation here:
 
 https://github.com/esc/numpy/compare/feature;npz_no_temp_file
 
 Here are the timings, again using ``sync()`` from bloscpack (but it's
 just a ``os.system('sync')``, in case you want to run your own
 benchmarks):
 
 In [1]: import numpy as np
 
 In [2]: import bloscpack.sysutil as bps
 
 In [3]: x = np.linspace(1, 10, 5000)
 
 In [4]: %timeit np.save(x.npy, x) ; bps.sync()
 1 loops, best of 3: 1.93 s per loop
 
 In [5]: %timeit np.savez(x.npz, x) ; bps.sync()
 1 loops, best of 3: 7.88 s per loop
 
 In [6]: %timeit np._savez_no_temp(x.npy, [x], {}, False) ; bps.sync()
 1 loops, best of 3: 3.22 s per loop
 
 Not too bad, but still slower than plain NPY, memory copies would be my
 guess.

 PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master

Also, in cae you were wondering, here is the profiler output:

In [2]: %prun -l 10 np._savez_no_temp(x.npy, [x], {}, False)
 943 function calls (917 primitive calls) in 1.139 seconds

   Ordered by: internal time
   List reduced from 99 to 10 due to restriction 10

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
10.3860.3860.3860.386 {zlib.crc32}
80.2340.0290.2340.029 {method 'write' of 'file' objects}
   270.1620.0060.1620.006 {method 'write' of 
'cStringIO.StringO' objects}
10.1580.1580.1580.158 {method 'getvalue' of 
'cStringIO.StringO' objects}
10.0910.0910.0910.091 {method 'close' of 'file' objects}
   240.0640.0030.0640.003 {method 'tobytes' of 
'numpy.ndarray' objects}
10.0220.0221.1191.119 npyio.py:608(_savez_no_temp)
10.0190.0191.1391.139 string:1(module)
10.0020.0020.2270.227 format.py:362(write_array)
10.0010.0010.0010.001 zipfile.py:433(_GenerateCRCTable)

V-
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Valentin Haenel
Hi,

* Valentin Haenel valen...@haenel.co [2014-04-17]:
 * Valentin Haenel valen...@haenel.co [2014-04-17]:
  * Valentin Haenel valen...@haenel.co [2014-04-17]:
   Hi,
  
   * Julian Taylor jtaylor.deb...@googlemail.com [2014-04-17]:
On 17.04.2014 21:30, onefire wrote:
 Hi Nathaniel,

 Thanks for the suggestion. I did profile the program before, just not
 using Python.
   
one problem of npz is that the zipfile module does not support streaming
data in (or if it does now we aren't using it).
So numpy writes the file uncompressed to disk and then zips it which is
horrible for performance and disk usage.
  
   As a workaround may also be possible to write the temporary NPY files to
   cStringIO instances and then use ``ZipFile.writestr`` with the
   ``getvalue()`` of the cStringIO object. However that approach may
   require some memory. In python 2.7, for each array: one copy inside the
   cStringIO instance and then another copy of when calling getvalue on the
   cString, I believe.
 
  There is a proof-of-concept implementation here:
 
  https://github.com/esc/numpy/compare/feature;npz_no_temp_file
 
  Here are the timings, again using ``sync()`` from bloscpack (but it's
  just a ``os.system('sync')``, in case you want to run your own
  benchmarks):
 
  In [1]: import numpy as np
 
  In [2]: import bloscpack.sysutil as bps
 
  In [3]: x = np.linspace(1, 10, 5000)
 
  In [4]: %timeit np.save(x.npy, x) ; bps.sync()
  1 loops, best of 3: 1.93 s per loop
 
  In [5]: %timeit np.savez(x.npz, x) ; bps.sync()
  1 loops, best of 3: 7.88 s per loop
 
  In [6]: %timeit np._savez_no_temp(x.npy, [x], {}, False) ; bps.sync()
  1 loops, best of 3: 3.22 s per loop
 
  Not too bad, but still slower than plain NPY, memory copies would be my
  guess.

  PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master

 Also, in cae you were wondering, here is the profiler output:

 In [2]: %prun -l 10 np._savez_no_temp(x.npy, [x], {}, False)
  943 function calls (917 primitive calls) in 1.139 seconds

Ordered by: internal time
List reduced from 99 to 10 due to restriction 10

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10.3860.3860.3860.386 {zlib.crc32}
 80.2340.0290.2340.029 {method 'write' of 'file' 
 objects}
270.1620.0060.1620.006 {method 'write' of 
 'cStringIO.StringO' objects}
 10.1580.1580.1580.158 {method 'getvalue' of 
 'cStringIO.StringO' objects}
 10.0910.0910.0910.091 {method 'close' of 'file' 
 objects}
240.0640.0030.0640.003 {method 'tobytes' of 
 'numpy.ndarray' objects}
 10.0220.0221.1191.119 npyio.py:608(_savez_no_temp)
 10.0190.0191.1391.139 string:1(module)
 10.0020.0020.2270.227 format.py:362(write_array)
 10.0010.0010.0010.001 
 zipfile.py:433(_GenerateCRCTable)

And, to shed some more light on this, the kernprofiler (line-by-line)
output (of a slightly modified version):

zsh» cat mp.py
import numpy as np
x = np.linspace(1, 10, 5000)
np._savez_no_temp(x.npy, [x], {}, False)

zsh» ./kernprof.py -v -l mp.py
Wrote profile results to mp.py.lprof
Timer unit: 1e-06 s

File: numpy/lib/npyio.py
Function: _savez_no_temp at line 608
Total time: 1.16438 s

Line #  Hits Time  Per Hit   % Time  Line Contents
==
   608   @profile
   609   def _savez_no_temp(file, args, 
kwds, compress):
   610   # Import is postponed to 
here since zipfile depends on gzip, an optional
   611   # component of the 
so-called standard library.
   612 1 5655   5655.0  0.5  import zipfile
   613
   614 16  6.0  0.0  from cStringIO import 
StringIO
   615
   616 12  2.0  0.0  if isinstance(file, 
basestring):
   617 12  2.0  0.0  if not 
file.endswith('.npz'):
   618 11  1.0  0.0  file = file + 
'.npz'
   619
   620 11  1.0  0.0  namedict = kwds
   621 24  2.0  0.0  for i, val in 
enumerate(args):
   622 16  6.0  0.0  key = 'arr_%d' % i
   623 11  1.0  0.0  if key in 
namedict.keys():
   624   raise ValueError(
   625   Cannot use 
un-named variables and keyword %s % key)
   626 11  1.0  0.0  namedict[key] = val
   627
   628 1

Re: [Numpy-discussion] About the npz format

2014-04-17 Thread Valentin Haenel
Hello,

* Valentin Haenel valen...@haenel.co [2014-04-17]:
 As part of bloscpack.sysutil I have wrapped this to be available from
 Python (needs root though). So, to re-rurn the benchmarks, doing each
 one twice:

Actually, I just realized, that doing a ``sync`` doesn't require root.

my bad,

V-
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread onefire
Interesting! Using sync() as you suggested makes every write slower,  and
it decreases the time difference between save and savez,
so maybe I was observing the 10 times difference because the file system
buffers were being flushed immediately after a call to savez, but not right
after a call to np.save.

I think your workaround might help, but a better solution would be to not
use Python's zipfile module at all. This would make it possible to, say,
let the user choose the checksum algorithm or to turn that off.
Or maybe the compression stuff makes this route too complicated to be worth
the trouble? (after all, the zip format is not that hard to understand)

Gilberto



On Thu, Apr 17, 2014 at 6:45 PM, Valentin Haenel valen...@haenel.co wrote:

 Hello,

 * Valentin Haenel valen...@haenel.co [2014-04-17]:
  As part of bloscpack.sysutil I have wrapped this to be available from
  Python (needs root though). So, to re-rurn the benchmarks, doing each
  one twice:

 Actually, I just realized, that doing a ``sync`` doesn't require root.

 my bad,

 V-
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] About the npz format

2014-04-17 Thread onefire
I found this github issue (https://github.com/numpy/numpy/pull/3465) where
someone mentions the idea of forking the zip library.

Gilberto


On Thu, Apr 17, 2014 at 8:09 PM, onefire onefire.mys...@gmail.com wrote:

 Interesting! Using sync() as you suggested makes every write slower,  and
 it decreases the time difference between save and savez,
 so maybe I was observing the 10 times difference because the file system
 buffers were being flushed immediately after a call to savez, but not right
 after a call to np.save.

 I think your workaround might help, but a better solution would be to not
 use Python's zipfile module at all. This would make it possible to, say,
 let the user choose the checksum algorithm or to turn that off.
 Or maybe the compression stuff makes this route too complicated to be
 worth the trouble? (after all, the zip format is not that hard to
 understand)

 Gilberto



 On Thu, Apr 17, 2014 at 6:45 PM, Valentin Haenel valen...@haenel.cowrote:

 Hello,

 * Valentin Haenel valen...@haenel.co [2014-04-17]:
  As part of bloscpack.sysutil I have wrapped this to be available from
  Python (needs root though). So, to re-rurn the benchmarks, doing each
  one twice:

 Actually, I just realized, that doing a ``sync`` doesn't require root.

 my bad,

 V-
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion