### Re: [Numpy-discussion] PyData Barcelona this May

2017-03-17 12:37 GMT+01:00 Jaime Fernández del Río <jaime.f...@gmail.com>: > Last night I gave a short talk to the PyData Zürich meetup on Julian's > temporary elision PR, and Pauli's overlapping memory one. My learnings from > that experiment are: > >- there is no way to talk about both things in a 30 minute talk: I >barely scraped the surface and ended up needing 25 minutes. >- many people that use numpy in their daily work don't know what >strides are, this was a BIG surprise for me. > > Based on that experience, I was thinking that maybe a good topic for a > workshop would be NumPy's memory model: views, reshaping, strides, some > hints of buffering in the iterator... > Yeah, I think that workshop would represent a very valuable insight to many people using NumPy. > > And Julian's temporary work lends itself to a very nice talk, more on > Python internals than on NumPy, but it's a very cool subject nonetheless. > > So my thinking is that I am going to propose those two, as a workshop and > a talk. Thoughts? > +1 > > Jaime > > On Thu, Mar 9, 2017 at 8:29 PM, Sebastian Berg <sebast...@sipsolutions.net > > wrote: > >> On Thu, 2017-03-09 at 15:45 +0100, Jaime Fernández del Río wrote: >> > There will be a PyData conference in Barcelona this May: >> > >> > http://pydata.org/barcelona2017/ >> > >> > I am planning on attending, and was thinking of maybe proposing to >> > organize a numpy-themed workshop or tutorial. >> > >> > My personal inclination would be to look at some advanced topic that >> > I know well, like writing gufuncs in Cython, but wouldn't mind doing >> > a more run of the mill thing. Anyone has any thoughts or experiences >> > on what has worked well in similar situations? Any specific topic you >> > always wanted to attend a workshop on, but were afraid to ask? >> > >> > Alternatively, or on top of the workshop, I could propose to do a >> > talk: talking last year at PyData Madrid about the new indexing was a >> > lot of fun! Thing is, I have been quite disconnected from the project >> > this past year, and can't really think of any worthwhile topic. Is >> > there any message that we as a project would like to get out to the >> > larger community? >> > >> >> Francesc already pointed out the temporary optimization. From what I >> remember, my personal highlight would probably be Pauli's work on the >> memory overlap detection. Though both are rather passive improvements I >> guess (you don't really have to learn them to use them), its very cool! >> And if its about highlighting new stuff, these can probably easily fill >> a talk. >> >> > And if you are planning on attending, please give me a shout. >> > >> >> Barcelona :). Maybe I should think about it, but probably not. >> >> >> > Thanks, >> > >> > Jaime >> > >> > -- >> > (\__/) >> > ( O.o) >> > ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus >> > planes de dominación mundial. >> > ___ >> > NumPy-Discussion mailing list >> > NumPy-Discussion@scipy.org >> > https://mail.scipy.org/mailman/listinfo/numpy-discussion >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > > -- > (\__/) > ( O.o) > ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes > de dominación mundial. > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] caching large allocations on gnu/linux

2017-03-13 18:11 GMT+01:00 Julian Taylor: > On 13.03.2017 16:21, Anne Archibald wrote: > > > > > > On Mon, Mar 13, 2017 at 12:21 PM Julian Taylor > > > > > wrote: > > > > Should it be agreed that caching is worthwhile I would propose a very > > simple implementation. We only really need to cache a small handful > of > > array data pointers for the fast allocate deallocate cycle that > appear > > in common numpy usage. > > For example a small list of maybe 4 pointers storing the 4 largest > > recent deallocations. New allocations just pick the first memory > block > > of sufficient size. > > The cache would only be active on systems that support MADV_FREE > (which > > is linux 4.5 and probably BSD too). > > > > So what do you think of this idea? > > > > > > This is an interesting thought, and potentially a nontrivial speedup > > with zero user effort. But coming up with an appropriate caching policy > > is going to be tricky. The thing is, for each array, numpy grabs a block > > "the right size", and that size can easily vary by orders of magnitude, > > even within the temporaries of a single expression as a result of > > broadcasting. So simply giving each new array the smallest cached block > > that will fit could easily result in small arrays in giant allocated > > blocks, wasting non-reclaimable memory. So really you want to recycle > > blocks of the same size, or nearly, which argues for a fairly large > > cache, with smart indexing of some kind. > > > > The nice thing about MADV_FREE is that we don't need any clever cache. > The same process that marked the pages free can reclaim them in another > allocation, at least that is what my testing indicates it allows. > So a small allocation getting a huge memory block does not waste memory > as the top unused part will get reclaimed when needed, either by numpy > itself doing another allocation or a different program on the system. > Well, what you say makes a lot of sense to me, so if you have tested that then I'd say that this is worth a PR and see how it works on different workloads. > > An issue that does arise though is that this memory is not available for > the page cache used for caching on disk data. A too large cache might > then be detrimental for IO heavy workloads that rely on the page cache. > Yeah. Also, memory mapped arrays use the page cache intensively, so we should test this use case and see how the caching affects memory map performance. > So we might want to cap it to some max size, provide an explicit on/off > switch and/or have numpy IO functions clear the cache. > Definitely dynamically allowing the disabling this feature would be desirable. That would provide an easy path for testing how it affects performance. Would that be feasible? Francesc ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] PyData Barcelona this May

Hola Jaime! 2017-03-09 15:45 GMT+01:00 Jaime Fernández del Río <jaime.f...@gmail.com>: > There will be a PyData conference in Barcelona this May: > > http://pydata.org/barcelona2017/ > > I am planning on attending, and was thinking of maybe proposing to > organize a numpy-themed workshop or tutorial. > > My personal inclination would be to look at some advanced topic that I > know well, like writing gufuncs in Cython, but wouldn't mind doing a more > run of the mill thing. Anyone has any thoughts or experiences on what has > worked well in similar situations? Any specific topic you always wanted to > attend a workshop on, but were afraid to ask? > Writing gufuncs in Cython seems a quite advanced topic for a workshop, but an interesting one indeed. Numba also supports creating gufuncs ( http://numba.pydata.org/numba-doc/dev/reference/numpysupported.html), so this perhaps may work as a first approach before going deeper into Cython. > > Alternatively, or on top of the workshop, I could propose to do a talk: > talking last year at PyData Madrid about the new indexing was a lot of fun! > Thing is, I have been quite disconnected from the project this past year, > and can't really think of any worthwhile topic. Is there any message that > we as a project would like to get out to the larger community? > Not a message in particular, but perhaps it would be nice talking about the temporaries removal in expressions that Julian implemented recently ( https://github.com/numpy/numpy/pull/7997) and that is to be released in 1.13. It is a really cool (and somewhat scary) patch ;) > > And if you are planning on attending, please give me a shout. > It would be nice to attend and see you again, but unfortunately I am quite swamped. Will see. Have fun in Barcelona! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Fortran order in recarray.

2017-02-22 16:30 GMT+01:00 Kiko <kikocorre...@gmail.com>: > > > 2017-02-22 16:23 GMT+01:00 Alex Rogozhnikov <alex.rogozhni...@yandex.ru>: > >> Hi Francesc, >> thanks a lot for you reply and for your impressive job on bcolz! >> >> Bcolz seems to make stress on compression, which is not of much interest >> for me, but the *ctable*, and chunked operations look very appropriate >> to me now. (Of course, I'll need to test it much before I can say this for >> sure, that's current impression). >> > You can disable compression for bcolz by default too: http://bcolz.blosc.org/en/latest/defaults.html#list-of-default-values > >> The strongest concern with bcolz so far is that it seems to be completely >> non-trivial to install on windows systems, while pip provides binaries for >> most (or all?) OS for numpy. >> I didn't build pip binary wheels myself, but is it hard / impossible to >> cook pip-installabel binaries? >> > > http://www.lfd.uci.edu/~gohlke/pythonlibs/#bcolz > Check if the link solves the issue with installing. > Yeah. Also, there are binaries for conda: http://bcolz.blosc.org/en/latest/install.html#installing-from-conda-forge > >> You can change shapes of numpy arrays, but that usually involves copies >> of the whole container. >> >> sure, but this is ok for me, as I plan to organize column editing in >> 'batches', so this should require seldom copying. >> It would be nice to see an example to understand how deep I need to go >> inside numpy. >> > Well, if copying is not a problem for you, then you can just create a new numpy container and do the copy by yourself. Francesc > >> Cheers, >> Alex. >> >> >> >> >> 22 февр. 2017 г., в 17:03, Francesc Alted <fal...@gmail.com> написал(а): >> >> Hi Alex, >> >> 2017-02-22 12:45 GMT+01:00 Alex Rogozhnikov <alex.rogozhni...@yandex.ru>: >> >>> Hi Nathaniel, >>> >>> >>> pandas >>> >>> >>> yup, the idea was to have minimal pandas.DataFrame-like storage (which I >>> was using for a long time), >>> but without irritating problems with its row indexing and some other >>> problems like interaction with matplotlib. >>> >>> A dict of arrays? >>> >>> >>> that's what I've started from and implemented, but at some point I >>> decided that I'm reinventing the wheel and numpy has something already. In >>> principle, I can ignore this 'column-oriented' storage requirement, but >>> potentially it may turn out to be quite slow-ish if dtype's size is large. >>> >>> Suggestions are welcome. >>> >> >> You may want to try bcolz: >> >> https://github.com/Blosc/bcolz >> >> bcolz is a columnar storage, basically as you require, but data is >> compressed by default even when stored in-memory (although you can disable >> compression if you want to). >> >> >> >>> >>> Another strange question: >>> in general, it is considered that once numpy.array is created, it's >>> shape not changed. >>> But if i want to keep the same recarray and change it's dtype and/or >>> shape, is there a way to do this? >>> >> >> You can change shapes of numpy arrays, but that usually involves copies >> of the whole container. With bcolz you can change length and add/del >> columns without copies. If your containers are large, it is better to >> inform bcolz on its final estimated size. See: >> >> http://bcolz.blosc.org/en/latest/opt-tips.html >> >> Francesc >> >> >>> >>> Thanks, >>> Alex. >>> >>> >>> >>> 22 февр. 2017 г., в 3:53, Nathaniel Smith <n...@pobox.com> написал(а): >>> >>> On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <alex.rogozhni...@yandex.ru> >>> wrote: >>> >>> Ah, got it. Thanks, Chris! >>> I thought recarray can be only one-dimensional (like tables with named >>> columns). >>> >>> Maybe it's better to ask directly what I was looking for: >>> something that works like a table with named columns (but no labelling >>> for rows), and keeps data (of different dtypes) in a column-by-column way >>> (and this is numpy, not pandas). >>> >>> Is there such a magic thing? >>> >>> >>> Well, that's what pandas is for... >>> >>> A dict of arrays? >>> >>> -n >>> __

### Re: [Numpy-discussion] Fortran order in recarray.

Hi Alex, 2017-02-22 12:45 GMT+01:00 Alex Rogozhnikov <alex.rogozhni...@yandex.ru>: > Hi Nathaniel, > > > pandas > > > yup, the idea was to have minimal pandas.DataFrame-like storage (which I > was using for a long time), > but without irritating problems with its row indexing and some other > problems like interaction with matplotlib. > > A dict of arrays? > > > that's what I've started from and implemented, but at some point I decided > that I'm reinventing the wheel and numpy has something already. In > principle, I can ignore this 'column-oriented' storage requirement, but > potentially it may turn out to be quite slow-ish if dtype's size is large. > > Suggestions are welcome. > You may want to try bcolz: https://github.com/Blosc/bcolz bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored in-memory (although you can disable compression if you want to). > > Another strange question: > in general, it is considered that once numpy.array is created, it's shape > not changed. > But if i want to keep the same recarray and change it's dtype and/or > shape, is there a way to do this? > You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See: http://bcolz.blosc.org/en/latest/opt-tips.html Francesc > > Thanks, > Alex. > > > > 22 февр. 2017 г., в 3:53, Nathaniel Smith <n...@pobox.com> написал(а): > > On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <alex.rogozhni...@yandex.ru> > wrote: > > Ah, got it. Thanks, Chris! > I thought recarray can be only one-dimensional (like tables with named > columns). > > Maybe it's better to ask directly what I was looking for: > something that works like a table with named columns (but no labelling for > rows), and keeps data (of different dtypes) in a column-by-column way (and > this is numpy, not pandas). > > Is there such a magic thing? > > > Well, that's what pandas is for... > > A dict of arrays? > > -n > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] ANN: NumExpr3 Alpha

Yes, Julian is doing an amazing work on getting rid of temporaries inside NumPy. However, NumExpr still has the advantage of using multi-threading right out of the box, as well as integration with Intel VML. Hopefully these features will eventually arrive to NumPy, but meanwhile there is still value in pushing NumExpr. Francesc 2017-02-19 18:21 GMT+01:00 Marten van Kerkwijk <m.h.vankerkw...@gmail.com>: > Hi All, > > Just a side note that at a smaller scale some of the benefits of > numexpr are coming to numpy: Julian Taylor has been working on > identifying temporary arrays in > https://github.com/numpy/numpy/pull/7997. Julian also commented > (https://github.com/numpy/numpy/pull/7997#issuecomment-246118772) that > with PEP 523 in python 3.6, this should indeed become a lot easier. > > All the best, > > Marten > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] ANN: NumExpr3 Alpha

--- > > * strided complex functions > * Intel VML support (less necessary now with gcc auto-vectorization) > * bytes and unicode support > * reductions (mean, sum, prod, std) > > > What I'm looking for feedback on > > > * String arrays: How do you use them? How would unicode differ from bytes > strings? > * Interface: We now have a more object-oriented interface underneath the > familiar > evaluate() interface. How would you like to use this interface? > Francesc suggested > generator support, as currently it's more difficult to use NumExpr > within a loop than > it should be. > > > Ideas for the future > - > > * vectorize real functions (such as exp, sqrt, log) similar to the > complex_functions.hpp vectorization. > * Add a keyword (likely 'yield') to indicate that a token is intended to > be changed by a generator inside a loop with each call to NumExpr.run() > > If you have any thoughts or find any issues please don't hesitate to open > an issue at the Github repo. Although unit tests have been run over the > operation space there are undoubtedly a number of bugs to squash. > > Sincerely, > > Robert > > -- > Robert McLeod, Ph.D. > Center for Cellular Imaging and Nano Analytics (C-CINA) > Biozentrum der Universität Basel > Mattenstrasse 26, 4058 Basel > Work: +41.061.387.3225 <061%20387%2032%2025> > robert.mcl...@unibas.ch > robert.mcl...@bsse.ethz.ch <robert.mcl...@ethz.ch> > robbmcl...@gmail.com > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.6.2 released!

= Announcing Numexpr 2.6.2 = What's new == This is a maintenance release that fixes several issues, with special emphasis in keeping compatibility with newer NumPy versions. Also, initial support for POWER processors is here. Thanks to Oleksandr Pavlyk, Alexander Shadchin, Breno Leitao, Fernando Seiti Furusato and Antonio Valentino for their nice contributions. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst What's Numexpr == Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] array comprehension

2016-11-04 14:36 GMT+01:00 Neal Becker <ndbeck...@gmail.com>: > Francesc Alted wrote: > > > 2016-11-04 13:06 GMT+01:00 Neal Becker <ndbeck...@gmail.com>: > > > >> I find I often write: > >> np.array ([some list comprehension]) > >> > >> mainly because list comprehensions are just so sweet. > >> > >> But I imagine this isn't particularly efficient. > >> > > > > Right. Using a generator and np.fromiter() will avoid the creation of > the > > intermediate list. Something like: > > > > np.fromiter((i for i in range(x))) # use xrange for Python 2 > > > > > Does this generalize to >1 dimensions? > A reshape() is not enough? What do you want to do exactly? > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] array comprehension

2016-11-04 13:06 GMT+01:00 Neal Becker <ndbeck...@gmail.com>: > I find I often write: > np.array ([some list comprehension]) > > mainly because list comprehensions are just so sweet. > > But I imagine this isn't particularly efficient. > Right. Using a generator and np.fromiter() will avoid the creation of the intermediate list. Something like: np.fromiter((i for i in range(x))) # use xrange for Python 2 > > I wonder if numpy has a "better" way, and if not, maybe it would be a nice > addition? > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.6.1 released

= Announcing Numexpr 2.6.1 = What's new == This is a maintenance release that fixes a performance regression in some situations. More specifically, the BLOCK_SIZE1 constant has been set to 1024 (down from 8192). This allows for better cache utilization when there are many operands and with VML. Fixes #221. Also, support for NetBSD has been added. Thanks to Thomas Klausner. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst What's Numexpr == Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.6.0 released

= Announcing Numexpr 2.6.0 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == This is a minor version bump because it introduces a new function. Also some minor fine tuning for recent CPUs has been done. More specifically: - Introduced a new re_evaluate() function for re-evaluating the previous executed array expression without any check. This is meant for accelerating loops that are re-evaluating the same expression repeatedly without changing anything else than the operands. If unsure, use evaluate() which is safer. - The BLOCK_SIZE1 and BLOCK_SIZE2 constants have been re-checked in order to find a value maximizing most of the benchmarks in bench/ directory. The new values (8192 and 16 respectively) give somewhat better results (~5%) overall. The CPU used for fine tuning is a relatively new Haswell processor (E3-1240 v3). In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Calling C code that assumes SIMD aligned data.

2016-05-05 22:10 GMT+02:00 Øystein Schønning-Johansen <oyste...@gmail.com>: > Thanks for your answer, Francesc. Knowing that there is no numpy solution > saves the work of searching for this. I've not tried the solution described > at SO, but it looks like a real performance killer. I'll rather try to > override malloc with glibs malloc_hooks or LD_PRELOAD tricks. Do you think > that will do it? I'll try it and report back. > I don't think you need that much weaponry. Just create an array with some spare space for alignment. Realize that you want a 64-byte aligned double precision array. With that, create your desired array + 64 additional bytes (8 doubles): In [92]: a = np.zeros(int(1e6) + 8) In [93]: a.ctypes.data % 64 Out[93]: 16 and compute the elements to shift this: In [94]: shift = (64 / a.itemsize) - (a.ctypes.data % 64) / a.itemsize In [95]: shift Out[95]: 6 now, create a view with the required elements less: In [98]: b = a[shift:-((64 / a.itemsize)-shift)] In [99]: len(b) Out[99]: 100 In [100]: b.ctypes.data % 64 Out[100]: 0 and voila, b is now aligned to 64 bytes. As the view is a copy-free operation, this is fast, and you only wasted 64 bytes. Pretty cheap indeed. Francesc > > Thanks, > -Øystein > > On Thu, May 5, 2016 at 1:55 PM, Francesc Alted <fal...@gmail.com> wrote: > >> 2016-05-05 11:38 GMT+02:00 Øystein Schønning-Johansen <oyste...@gmail.com >> >: >> >>> Hi! >>> >>> I've written a little code of numpy code that does a neural network >>> feedforward calculation: >>> >>> def feedforward(self,x): >>> for activation, w, b in zip( self.activations, self.weights, >>> self.biases ): >>> x = activation( np.dot(w, x) + b) >>> >>> This works fine when my activation functions are in Python, however I've >>> wrapped the activation functions from a C implementation that requires the >>> array to be memory aligned. (due to simd instructions in the C >>> implementation.) So I need the operation np.dot( w, x) + b to return a >>> ndarray where the data pointer is aligned. How can I do that? Is it >>> possible at all? >>> >> >> Yes. np.dot() does accept an `out` parameter where you can pass your >> aligned array. The way for testing if numpy is returning you an aligned >> array is easy: >> >> In [15]: x = np.arange(6).reshape(2,3) >> >> In [16]: x.ctypes.data % 16 >> Out[16]: 0 >> >> but: >> >> In [17]: x.ctypes.data % 32 >> Out[17]: 16 >> >> so, in this case NumPy returned a 16-byte aligned array which should be >> enough for 128 bit SIMD (SSE family). This kind of alignment is pretty >> common in modern computers. If you need 256 bit (32-byte) alignment then >> you will need to build your container manually. See here for an example: >> http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays >> >> Francesc >> >> >>> >>> (BTW: the function works correctly about 20% of the time I run it, and >>> else it segfaults on the simd instruction in the the C function) >>> >>> Thanks, >>> -Øystein >>> >>> ___ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >> >> >> -- >> Francesc Alted >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Calling C code that assumes SIMD aligned data.

2016-05-05 11:38 GMT+02:00 Øystein Schønning-Johansen <oyste...@gmail.com>: > Hi! > > I've written a little code of numpy code that does a neural network > feedforward calculation: > > def feedforward(self,x): > for activation, w, b in zip( self.activations, self.weights, > self.biases ): > x = activation( np.dot(w, x) + b) > > This works fine when my activation functions are in Python, however I've > wrapped the activation functions from a C implementation that requires the > array to be memory aligned. (due to simd instructions in the C > implementation.) So I need the operation np.dot( w, x) + b to return a > ndarray where the data pointer is aligned. How can I do that? Is it > possible at all? > Yes. np.dot() does accept an `out` parameter where you can pass your aligned array. The way for testing if numpy is returning you an aligned array is easy: In [15]: x = np.arange(6).reshape(2,3) In [16]: x.ctypes.data % 16 Out[16]: 0 but: In [17]: x.ctypes.data % 32 Out[17]: 16 so, in this case NumPy returned a 16-byte aligned array which should be enough for 128 bit SIMD (SSE family). This kind of alignment is pretty common in modern computers. If you need 256 bit (32-byte) alignment then you will need to build your container manually. See here for an example: http://stackoverflow.com/questions/9895787/memory-alignment-for-fast-fft-in-python-using-shared-arrrays Francesc > > (BTW: the function works correctly about 20% of the time I run it, and > else it segfaults on the simd instruction in the the C function) > > Thanks, > -Øystein > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 1.0.0 (final) released

= Announcing bcolz 1.0.0 final = What's new == Yeah, 1.0.0 is finally here. We are not introducing any exciting new feature (just some optimizations and bug fixes), but bcolz is already 6 years old and it implements most of the capabilities that it was designed for, so I decided to release a 1.0.0 meaning that the format is declared stable and that people can be assured that future bcolz releases will be able to read bcolz 1.0 data files (and probably much earlier ones too) for a long while. Such a format is fully described at: https://github.com/Blosc/bcolz/blob/master/DISK_FORMAT_v1.rst Also, a 1.0.0 release means that bcolz 1.x series will be based on C-Blosc 1.x series (https://github.com/Blosc/c-blosc). After C-Blosc 2.x (https://github.com/Blosc/c-blosc2) would be out, a new bcolz 2.x is expected taking advantage of shiny new features of C-Blosc2 (more compressors, more filters, native variable length support and the concept of super-chunks), which should be very beneficial for next bcolz generation. Important: this is a final release and there are no important known bugs there, so this is recommended to be used in production. Enjoy! For a more detailed change log, see: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst For some comparison between bcolz and other compressed data containers, see: https://github.com/FrancescAlted/DataContainersTutorials specially chapters 3 (in-memory containers) and 4 (on-disk containers). Also, if it happens that you are in Madrid during this weekend, you can drop by my tutorial and talk: http://pydata.org/madrid2016/schedule/ See you! What it is == *bcolz* provides columnar and compressed data containers that can live either on-disk or in-memory. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, an extremely fast meta-compressor that is optimized for binary data. Lastly, high-performance iterators (like ``iter()``, ``where()``) for querying the objects are provided. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, since the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Together, bcolz and the Blosc compressor, are finally fulfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) , Quantopian (https://www.quantopian.com/) and Scikit-Allel ( https://github.com/cggh/scikit-allel) which you can read more about by pointing your browser at the links below. * Visualfabriq: * *bquery*, A query and aggregation framework for Bcolz: * https://github.com/visualfabriq/bquery * Quantopian: * Using compressed data containers for faster backtesting at scale: * https://quantopian.github.io/talks/NeedForSpeed/slides.html * Scikit-Allel * Provides an alternative backend to work with compressed arrays * https://scikit-allel.readthedocs.org/en/latest/model/bcolz.html Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt Release notes can be found in the Git repository: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.3.1

= Announcing python-blosc 1.3.1 = What is new? This is an important release in terms of stability. Now, the -O1 flag for compiling the included C-Blosc sources on Linux. This represents slower performance, but fixes the nasty issue #110. In case maximum speed is needed, please `compile python-blosc with an external C-Blosc library < https://github.com/Blosc/python-blosc#compiling-with-an-installed-blosc-library-recommended )>`_. Also, symbols like BLOSC_MAX_BUFFERSIZE have been replaced for allowing backward compatibility with python-blosc 1.2.x series. For whetting your appetite, look at some benchmarks here: https://github.com/Blosc/python-blosc#benchmarking For more info, you can have a look at the release notes in: https://github.com/Blosc/python-blosc/blob/master/RELEASE_NOTES.rst More docs and examples are available in the documentation site: http://python-blosc.blosc.org What is it? === Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. python-blosc (http://python-blosc.blosc.org/) is the Python wrapper for the Blosc compression library, with added functions (`compress_ptr()` and `pack_array()`) for efficiently compressing NumPy arrays, minimizing the number of memory copies during the process. python-blosc can be used to compress in-memory data buffers for transmission to other machines, persistence or just as a compressed cache. There is also a handy tool built on top of python-blosc called Bloscpack (https://github.com/Blosc/bloscpack). It features a commmand line interface that allows you to compress large binary datafiles on-disk. It also comes with a Python API that has built-in support for serializing and deserializing Numpy arrays both on-disk and in-memory at speeds that are competitive with regular Pickle/cPickle machinery. Sources repository == The sources and documentation are managed through github services at: http://github.com/Blosc/python-blosc **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.5.2 released

= Announcing Numexpr 2.5.2 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == This is a maintenance release shaking some remaining problems with VML (it is nice to see how Anaconda VML's support helps raising hidden issues). Now conj() and abs() are actually added as VML-powered functions, preventing the same problems than log10() before (PR #212); thanks to Tom Kooij. Upgrading to this release is highly recommended. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 1.0.0 RC2 is out!

== Announcing bcolz 1.0.0 RC2 == What's new == Yeah, 1.0.0 is finally here. We are not introducing any exciting new feature (just some optimizations and bug fixes), but bcolz is already 6 years old and it implements most of the capabilities that it was designed for, so I decided to release a 1.0.0 meaning that the format is declared stable and that people can be assured that future bcolz releases will be able to read bcolz 1.0 data files (and probably much earlier ones too) for a long while. Such a format is fully described at: https://github.com/Blosc/bcolz/blob/master/DISK_FORMAT_v1.rst Also, a 1.0.0 release means that bcolz 1.x series will be based on C-Blosc 1.x series (https://github.com/Blosc/c-blosc). After C-Blosc 2.x (https://github.com/Blosc/c-blosc2) would be out, a new bcolz 2.x is expected taking advantage of shiny new features of C-Blosc2 (more compressors, more filters, native variable length support and the concept of super-chunks), which should be very beneficial for next bcolz generation. Important: this is a Release Candidate, so please test it as much as you can. If no issues would appear in a week or so, I will proceed to tag and release 1.0.0 final. Enjoy! For a more detailed change log, see: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst What it is == *bcolz* provides columnar and compressed data containers that can live either on-disk or in-memory. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, an extremely fast meta-compressor that is optimized for binary data. Lastly, high-performance iterators (like ``iter()``, ``where()``) for querying the objects are provided. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, since the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Together, bcolz and the Blosc compressor, are finally fulfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the Blaze project (http://blaze.pydata.org/), Quantopian (https://www.quantopian.com/) and Scikit-Allel (https://github.com/cggh/scikit-allel) which you can read more about by pointing your browser at the links below. * Visualfabriq: * *bquery*, A query and aggregation framework for Bcolz: * https://github.com/visualfabriq/bquery * Blaze: * Notebooks showing Blaze + Pandas + BColz interaction: * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb * Quantopian: * Using compressed data containers for faster backtesting at scale: * https://quantopian.github.io/talks/NeedForSpeed/slides.html * Scikit-Allel * Provides an alternative backend to work with compressed arrays * https://scikit-allel.readthedocs.org/en/latest/model/bcolz.html Installing == bcolz is in the PyPI repository, so installing it is easy:: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt Release notes can be found in the Git repository: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.5.1 released

= Announcing Numexpr 2.5.1 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == Fixed a critical bug that caused wrong evaluations of log10() and conj(). These produced wrong results when numexpr was compiled with Intel's MKL (which is a popular build since Anaconda ships it by default) and non-contiguous data. This is considered a *critical* bug and upgrading is highly recommended. Thanks to Arne de Laat and Tom Kooij for reporting and providing a test unit. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] [ANN] bcolz 1.0.0 RC1 released

== Announcing bcolz 1.0.0 RC1 == What's new == Yeah, 1.0.0 is finally here. We are not introducing any exciting new feature (just some optimizations and bug fixes), but bcolz is already 6 years old and it implements most of the capabilities that it was designed for, so I decided to release a 1.0.0 meaning that the format is declared stable and that people can be assured that future bcolz releases will be able to read bcolz 1.0 data files (and probably much earlier ones too) for a long while. Such a format is fully described at: https://github.com/Blosc/bcolz/blob/master/DISK_FORMAT_v1.rst Also, a 1.0.0 release means that bcolz 1.x series will be based on C-Blosc 1.x series (https://github.com/Blosc/c-blosc). After C-Blosc 2.x (https://github.com/Blosc/c-blosc2) would be out, a new bcolz 2.x is expected taking advantage of shiny new features of C-Blosc2 (more compressors, more filters, native variable length support and the concept of super-chunks), which should be very beneficial for next bcolz generation. Important: this is a Release Candidate, so please test it as much as you can. If no issues would appear in a week or so, I will proceed to tag and release 1.0.0 final. Enjoy! For a more detailed change log, see: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst What it is == *bcolz* provides columnar and compressed data containers that can live either on-disk or in-memory. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, an extremely fast meta-compressor that is optimized for binary data. Lastly, high-performance iterators (like ``iter()``, ``where()``) for querying the objects are provided. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, since the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Together, bcolz and the Blosc compressor, are finally fulfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the Blaze project (http://blaze.pydata.org/), Quantopian (https://www.quantopian.com/) and Scikit-Allel (https://github.com/cggh/scikit-allel) which you can read more about by pointing your browser at the links below. * Visualfabriq: * *bquery*, A query and aggregation framework for Bcolz: * https://github.com/visualfabriq/bquery * Blaze: * Notebooks showing Blaze + Pandas + BColz interaction: * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb * Quantopian: * Using compressed data containers for faster backtesting at scale: * https://quantopian.github.io/talks/NeedForSpeed/slides.html * Scikit-Allel * Provides an alternative backend to work with compressed arrays * https://scikit-allel.readthedocs.org/en/latest/model/bcolz.html Installing == bcolz is in the PyPI repository, so installing it is easy:: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt Release notes can be found in the Git repository: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Fwd: Numexpr-3.0 proposal

2016-02-16 10:04 GMT+01:00 Robert McLeod <robbmcl...@gmail.com>: > On Mon, Feb 15, 2016 at 10:43 AM, Gregor Thalhammer < > gregor.thalham...@gmail.com> wrote: > >> >> Dear Robert, >> >> thanks for your effort on improving numexpr. Indeed, vectorized math >> libraries (VML) can give a large boost in performance (~5x), except for a >> couple of basic operations (add, mul, div), which current compilers are >> able to vectorize automatically. With recent gcc even more functions are >> vectorized, see https://sourceware.org/glibc/wiki/libmvec But you need >> special flags depending on the platform (SSE, AVX present?), runtime >> detection of processor capabilities would be nice for distributing >> binaries. Some time ago, since I lost access to Intels MKL, I patched >> numexpr to use Accelerate/Veclib on os x, which is preinstalled on each >> mac, see https://github.com/geggo/numexpr.git veclib_support branch. >> >> As you increased the opcode size, I could imagine providing a bit to >> switch (during runtime) between internal functions and vectorized ones, >> that would be handy for tests and benchmarks. >> > > Dear Gregor, > > Your suggestion to separate the opcode signature from the library used to > execute it is very clever. Based on your suggestion, I think that the > natural evolution of the opcodes is to specify them by function signature > and library, using a two-level dict, i.e. > > numexpr.interpreter.opcodes['exp_f8f8f8'][gnu] = some_enum > numexpr.interpreter.opcodes['exp_f8f8f8'][msvc] = some_enum +1 > numexpr.interpreter.opcodes['exp_f8f8f8'][vml] = some_enum + 2 > numexpr.interpreter.opcodes['exp_f8f8f8'][yeppp] = some_enum +3 > Yes, by using a two level dictionary you can access the functions implementing opcodes much faster and hence you can add much more opcodes without too much slow-down. > > I want to procedurally generate opcodes.cpp and interpreter_body.cpp. If > I do it the way you suggested funccodes.hpp and all the many #define's > regarding function codes in the interpreter can hopefully be removed and > hence simplify the overall codebase. One could potentially take it a step > further and plan (optimize) each expression, similar to what FFTW does with > regards to matrix shape. That is, the basic way to control the library > would be with a singleton library argument, i.e.: > > result = ne.evaluate( "A*log(foo**2 / bar**2", lib=vml ) > > However, we could also permit a tuple to be passed in, where each element > of the tuple reflects the library to use for each operation in the AST tree: > > result = ne.evaluate( "A*log(foo**2 / bar**2", lib=(gnu,gnu,gnu,yeppp,gnu) > ) > > In this case the ops are (mul,mul,div,log,mul). The op-code picking is > done by the Python side, and this tuple could be potentially optimized by > numexpr rather than hand-optimized, by trying various permutations of the > linked C math libraries. The wisdom from the planning could be pickled and > saved in a wisdom file. Currently Numexpr has cacheDict in util.py but > there's no reason this can't be pickled and saved to disk. I've done a > similar thing by creating wrappers for PyFFTW already. > I like the idea of various permutations of linked C math libraries to be probed by numexpr during the initial iteration and then cached somehow. That will probably require run-time detection of available C math libraries (think that a numexpr binary will be able to run on different machines with different libraries and computing capabilities), but in exchange, it will allow for the fastest execution paths independently of the machine that runs the code. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.5

= Announcing Numexpr 2.5 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == In this version, a lock has been added so that numexpr can be called from multithreaded apps. Mind that this does not prevent numexpr to use multiple cores internally. Also, a new min() and max() functions have been added. Thanks to contributors! In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

you not only storage, but processing time too. Francesc 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <n...@pobox.com>: > I'd try storing the data in hdf5 (probably via h5py, which is a more > basic interface without all the bells-and-whistles that pytables > adds), though any method you use is going to be limited by the need to > do a seek before each read. Storing the data on SSD will probably help > a lot if you can afford it for your data size. > > On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com> > wrote: > > Hi, > > > > I have a very large dictionary that must be shared across processes and > does not fit in RAM. I need access to this object to be fast. The key is an > integer ID and the value is a list containing two elements, both of them > numpy arrays (one has ints, the other has floats). The key is sequential, > starts at 0, and there are no gaps, so the “outer” layer of this data > structure could really just be a list with the key actually being the > index. The lengths of each pair of arrays may differ across keys. > > > > For a visual: > > > > { > > key=0: > > [ > > numpy.array([1,8,15,…, 16000]), > > numpy.array([0.1,0.1,0.1,…,0.1]) > > ], > > key=1: > > [ > > numpy.array([5,6]), > > numpy.array([0.5,0.5]) > > ], > > … > > } > > > > I’ve tried: > > - manager proxy objects, but the object was so big that low-level > code threw an exception due to format and monkey-patching wasn’t successful. > > - Redis, which was far too slow due to setting up connections and > data conversion etc. > > - Numpy rec arrays + memory mapping, but there is a restriction > that the numpy arrays in each “column” must be of fixed and same size. > > - I looked at PyTables, which may be a solution, but seems to have > a very steep learning curve. > > - I haven’t tried SQLite3, but I am worried about the time it > takes to query the DB for a sequential ID, and then translate byte arrays. > > > > Any ideas? I greatly appreciate any guidance you can provide. > > > > Thanks, > > Ryan > > ___ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > -- > Nathaniel J. Smith -- http://vorpus.org > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] performance solving system of equations in numpy and MATLAB

2015-12-17 12:00 GMT+01:00 Daπid <davidmen...@gmail.com>: > On 16 December 2015 at 18:59, Francesc Alted <fal...@gmail.com> wrote: > >> Probably MATLAB is shipping with Intel MKL enabled, which probably is the >> fastest LAPACK implementation out there. NumPy supports linking with MKL, >> and actually Anaconda does that by default, so switching to Anaconda would >> be a good option for you. > > > A free alternative is OpenBLAS. I am getting 20 s in an i7 Haswell with 8 > cores. > Pretty good. I did not know that OpenBLAS was so close in performance to MKL. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] performance solving system of equations in numpy and MATLAB

Sorry, I have to correct myself, as per: http://docs.continuum.io/mkl-optimizations/index it seems that Anaconda is not linking with MKL by default (I thought that was the case before?). After installing MKL (conda install mkl), I am getting: In [1]: import numpy as np Vendor: Continuum Analytics, Inc. Package: mkl Message: trial mode expires in 30 days In [2]: testA = np.random.randn(15000, 15000) In [3]: testb = np.random.randn(15000) In [4]: %time testx = np.linalg.solve(testA, testb) CPU times: user 1min, sys: 468 ms, total: 1min 1s Wall time: 15.3 s so, it looks like you will need to buy a MKL license separately (which makes sense for a commercial product). Sorry for the confusion. Francesc 2015-12-16 18:59 GMT+01:00 Francesc Alted <fal...@gmail.com>: > Hi, > > Probably MATLAB is shipping with Intel MKL enabled, which probably is the > fastest LAPACK implementation out there. NumPy supports linking with MKL, > and actually Anaconda does that by default, so switching to Anaconda would > be a good option for you. > > Here you have what I am getting with Anaconda's NumPy and a machine with 8 > cores: > > In [1]: import numpy as np > > In [2]: testA = np.random.randn(15000, 15000) > > In [3]: testb = np.random.randn(15000) > > In [4]: %time testx = np.linalg.solve(testA, testb) > CPU times: user 5min 36s, sys: 4.94 s, total: 5min 41s > Wall time: 46.1 s > > This is not 20 sec, but it is not 3 min either (but of course that depends > on your machine). > > Francesc > > 2015-12-16 18:34 GMT+01:00 Edward Richards <edwardlricha...@gmail.com>: > >> I recently did a conceptual experiment to estimate the computational time >> required to solve an exact expression in contrast to an approximate >> solution (Helmholtz vs. Helmholtz-Kirchhoff integrals). The exact solution >> requires a matrix inversion, and in my case the matrix would contain ~15000 >> rows. >> >> On my machine MATLAB seems to perform this matrix inversion with random >> matrices about 9x faster (20 sec vs 3 mins). I thought the performance >> would be roughly the same because I presume both rely on the same LAPACK >> solvers. >> >> I will not actually need to solve this problem (even at 20 sec it is >> prohibitive for broadband simulation), but if I needed to I would >> reluctantly choose MATLAB . I am simply wondering why there is this >> performance gap, and if there is a better way to solve this problem in >> numpy? >> >> Thank you, >> >> Ned >> >> #Python version >> >> import numpy as np >> >> testA = np.random.randn(15000, 15000) >> >> testb = np.random.randn(15000) >> >> %time testx = np.linalg.solve(testA, testb) >> >> %MATLAB version >> >> testA = randn(15000); >> >> testb = randn(15000, 1); >> tic(); testx = testA \ testb; toc(); >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > > -- > Francesc Alted > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] performance solving system of equations in numpy and MATLAB

Hi, Probably MATLAB is shipping with Intel MKL enabled, which probably is the fastest LAPACK implementation out there. NumPy supports linking with MKL, and actually Anaconda does that by default, so switching to Anaconda would be a good option for you. Here you have what I am getting with Anaconda's NumPy and a machine with 8 cores: In [1]: import numpy as np In [2]: testA = np.random.randn(15000, 15000) In [3]: testb = np.random.randn(15000) In [4]: %time testx = np.linalg.solve(testA, testb) CPU times: user 5min 36s, sys: 4.94 s, total: 5min 41s Wall time: 46.1 s This is not 20 sec, but it is not 3 min either (but of course that depends on your machine). Francesc 2015-12-16 18:34 GMT+01:00 Edward Richards <edwardlricha...@gmail.com>: > I recently did a conceptual experiment to estimate the computational time > required to solve an exact expression in contrast to an approximate > solution (Helmholtz vs. Helmholtz-Kirchhoff integrals). The exact solution > requires a matrix inversion, and in my case the matrix would contain ~15000 > rows. > > On my machine MATLAB seems to perform this matrix inversion with random > matrices about 9x faster (20 sec vs 3 mins). I thought the performance > would be roughly the same because I presume both rely on the same LAPACK > solvers. > > I will not actually need to solve this problem (even at 20 sec it is > prohibitive for broadband simulation), but if I needed to I would > reluctantly choose MATLAB . I am simply wondering why there is this > performance gap, and if there is a better way to solve this problem in > numpy? > > Thank you, > > Ned > > #Python version > > import numpy as np > > testA = np.random.randn(15000, 15000) > > testb = np.random.randn(15000) > > %time testx = np.linalg.solve(testA, testb) > > %MATLAB version > > testA = randn(15000); > > testb = randn(15000, 1); > tic(); testx = testA \ testb; toc(); > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 0.12.0 released

=== Announcing bcolz 0.12.0 === What's new == This release copes with some compatibility issues with NumPy 1.10. Also, several improvements have happened in the installation procedure, allowing for a smoother process. Last but not least, the tutorials haven been migrated to the IPython notebook format (a huge thank you to Francesc Elies for this!). This will hopefully will allow users to better exercise the different features of bcolz. For a more detailed change log, see: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst What it is == *bcolz* provides columnar and compressed data containers that can live either on-disk or in-memory. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, an extremely fast meta-compressor that is optimized for binary data. Lastly, high-performance iterators (like ``iter()``, ``where()``) for querying the objects are provided. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, since the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Together, bcolz and the Blosc compressor, are finally fulfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the Blaze project (http://blaze.pydata.org/), Quantopian (https://www.quantopian.com/) and Scikit-Allel (https://github.com/cggh/scikit-allel) which you can read more about by pointing your browser at the links below. * Visualfabriq: * *bquery*, A query and aggregation framework for Bcolz: * https://github.com/visualfabriq/bquery * Blaze: * Notebooks showing Blaze + Pandas + BColz interaction: * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb * Quantopian: * Using compressed data containers for faster backtesting at scale: * https://quantopian.github.io/talks/NeedForSpeed/slides.html * Scikit-Allel * Provides an alternative backend to work with compressed arrays * https://scikit-allel.readthedocs.org/en/latest/model/bcolz.html Installing == bcolz is in the PyPI repository, so installing it is easy:: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt Release notes can be found in the Git repository: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4.6 released

Hi, This is a quick release fixing some reported problems in the 2.4.5 version that I announced a few hours ago. Hope I have fixed the main issues now. Now, the official announcement: = Announcing Numexpr 2.4.6 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == This is a quick maintenance version that offers better handling of MSVC symbols (#168, Francesc Alted), as well as fising some UserWarnings in Solaris (#189, Graham Jones). In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4.5 released

= Announcing Numexpr 2.4.5 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == This is a maintenance release where an important bug in multithreading code has been fixed (#185 Benedikt Reinartz, Francesc Alted). Also, many harmless warnings (overflow/underflow, divide by zero and others) in the test suite have been silenced (#183, Francesc Alted). In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 0.11.3 released!

=== Announcing bcolz 0.11.3 === What's new == Implemented new feature (#255): bcolz.zeros() can create new ctables too, either empty or filled with zeros. (#256 @FrancescElies @FrancescAlted). Also, in previous, non announced versions (0.11.1 and 0.11.2), new dependencies were added and other fixes are there too. For a more detailed change log, see: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst What it is == *bcolz* provides columnar and compressed data containers that can live either on-disk or in-memory. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, an extremely fast meta-compressor that is optimized for binary data. Lastly, high-performance iterators (like ``iter()``, ``where()``) for querying the objects are provided. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, since the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Together, bcolz and the Blosc compressor, are finally fulfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the Blaze project (http://blaze.pydata.org/), Quantopian (https://www.quantopian.com/) and Scikit-Allel (https://github.com/cggh/scikit-allel) which you can read more about by pointing your browser at the links below. * Visualfabriq: * *bquery*, A query and aggregation framework for Bcolz: * https://github.com/visualfabriq/bquery * Blaze: * Notebooks showing Blaze + Pandas + BColz interaction: * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb * Quantopian: * Using compressed data containers for faster backtesting at scale: * https://quantopian.github.io/talks/NeedForSpeed/slides.html * Scikit-Allel * Provides an alternative backend to work with compressed arrays * https://scikit-allel.readthedocs.org/en/latest/model/bcolz.html Installing == bcolz is in the PyPI repository, so installing it is easy:: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt Release notes can be found in the Git repository: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Governance model request

, famous and powerful as Travis may be, > he's still our colleague, a member of our community, and *a human being*, > so let's remember that as well... > > > 2. Conflicts of interest are a fact of life, in fact, I would argue that > every healthy and sufficiently interconnected community eventually *should* > have conflicts of interest. They are a sign that there is activity across > multiple centers of interest, and individuals with connections in multiple > areas of the community. And we *want* folks who are engaged enough > precisely to have such interests! > > For conflict of interest management, we don't need to reinvent the wheel, > this is actually something where our beloved institutions, blessed be their > bureaucratic souls, have tons of training materials that happen to be not > completely useless. Most universities and the national labs have > information on COIs that provides guidelines, and Numpy could include in > its governance model more explicit language about COIs if desired. > > So, the issue is not to view COIs as something evil or undesirable, but > rather as the very real consequence of operating in an interconnected set > of institutions. And once you take that stance, you deal with that > rationally and realistically. > > For example, you accept that companies aren't the only ones with potential > COIs: *all* entities have them. As Ryan May aptly pointed out, the notion > that academic institutions are somehow immune to hidden agendas or other > interests is naive at best... And I say that as someone who has happily > stayed in academia, resisting multiple overtures from industry over the > years, but not out of some quaint notion that academia is a pristine haven > of principled purity. Quite the opposite: in building large and complex > projects, I've seen painfully close how the university/government research > world has its own flavor of the same power, financial and political > ugliness that we attribute to the commercial side. > > > 3. Commercial actors. Following up on the last paragraph, we should > accept that *all* institutions have agendas, not just companies. We live > in a world with companies, and I think it's naive to take a knee-jerk > anti-commercial stance: our community has had a productive and successful > history of interaction with industry in the past, and hopefully that will > continue in the future. > > What is true, however, is that community projects should maintain the > "seat of power" in the community, and *not* in any single company. In > fact, this is important even to ensure that many companies feel comfortable > engaging the projects, precisely so they know that the technology is driven > in an open and neutral way even if some of their competitors participate. > > That's why a governance model that is anchored in neutral ground is so > important. We've worked hard to make Numfocus the legal entity that can > play that role (that's why it's a 501(c)3), and that's why we've framed our > governance model for Jupyter in a way that makes all the institutions > (including Berkeley and Cal Poly) simply 'partners' that contribute by > virtue of supporting employees. But the owners of the decisions are the > *individuals* who do the work and form the community, not the > companies/institutions. > > > If we accept these premises, then hopefully we can have a rational > conversation about how to build a community, where at any point in time, > any of us should be judged on the merit of our actions, not the > hypotheticals of our intentions or our affiliations (commercial, > government, academic, etc). > > > Sorry for the long wall of text, I rarely post on this list anymore. But > I was saddened to see the turn of this thread, and I hope I can contribute > some perspective (and not make things worse :) > > > Cheers, > > -- > Fernando Perez (@fperez_org; http://fperez.org) > fperez.net-at-gmail: mailing lists only (I ignore this when swamped!) > fernando.perez-at-berkeley: contact me here for any direct mail > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: Numexpr 2.4.4 is out

= Announcing Numexpr 2.4.4 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == This is a maintenance release which contains several bug fixes, like better testing on Python3 platform and some harmless data race. Among the enhancements, AppVeyor support is here and OMP_NUM_THREADS is honored as a fallback in case NUMEXPR_NUM_THREADS is not set. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/blob/master/RELEASE_NOTES.rst Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.2.8 released

= Announcing python-blosc 1.2.8 = What is new? This is a maintenance release. Internal C-Blosc has been upgraded to 1.7.0 (although new bitshuffle support has not been made public, as it seems not ready for production yet). Also, there is support for bytes-like objects that support the buffer interface as input to ``compress`` and ``decompress``. On Python 2.x this includes unicode, on Python 3.x it doesn't. Thanks to Valentin Haenel. Finally, a memory leak in ``decompress```has been hunted and fixed. And new tests have been added to catch possible leaks in the future. Thanks to Santi Villalba. For more info, you can have a look at the release notes in: https://github.com/Blosc/python-blosc/blob/master/RELEASE_NOTES.rst More docs and examples are available in the documentation site: http://python-blosc.blosc.org What is it? === Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate object manipulations that are memory-bound (http://www.blosc.org/docs/StarvingCPUs.pdf). See http://www.blosc.org/synthetic-benchmarks.html for some benchmarks on how much speed it can achieve in some datasets. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. python-blosc (http://python-blosc.blosc.org/) is the Python wrapper for the Blosc compression library. There is also a handy tool built on Blosc called Bloscpack (https://github.com/Blosc/bloscpack). It features a commmand line interface that allows you to compress large binary datafiles on-disk. It also comes with a Python API that has built-in support for serializing and deserializing Numpy arrays both on-disk and in-memory at speeds that are competitive with regular Pickle/cPickle machinery. Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you must omit the 'python-' prefix Download sources The sources are managed through github services at: http://github.com/Blosc/python-blosc Documentation = There is Sphinx-based documentation site at: http://python-blosc.blosc.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/Blosc/python-blosc/blob/master/LICENSES for more details. **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 0.11.0 released

=== Announcing bcolz 0.11.0 === What's new == Although this is mostly a maintenance release that fixes some bugs, the setup.py is entirely based now in setuptools and has been greatly modernized to use a new versioning system. Just this deserves a bump in the minor version. Thanks to Gabi Davar (@mindw) for such a nice improvement. Also, many improvements to the Continuous Integration part (and hence not directly visible to users) and others have been made by Francesc Elies (@FrancescElies). Thanks for his quiet but effective work. And last but not least, I would like to announce that Valentin Haenel (@esc) just stepped down as release manager. Thanks Valentin for all the hard work that you put in making bcolz a better piece of software! What it is == *bcolz* provides columnar and compressed data containers that can live either on-disk or in-memory. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, an extremely fast meta-compressor that is optimized for binary data. Lastly, high-performance iterators (like ``iter()``, ``where()``) for querying the objects are provided. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, since the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Together, bcolz and the Blosc compressor, are finally fulfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots Other users of bcolz are Visualfabriq (http://www.visualfabriq.com/) the Blaze project (http://blaze.pydata.org/), Quantopian (https://www.quantopian.com/) and Scikit-Allel (https://github.com/cggh/scikit-allel) which you can read more about by pointing your browser at the links below. * Visualfabriq: * *bquery*, A query and aggregation framework for Bcolz: * https://github.com/visualfabriq/bquery * Blaze: * Notebooks showing Blaze + Pandas + BColz interaction: * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-csv.ipynb * http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/timings-bcolz.ipynb * Quantopian: * Using compressed data containers for faster backtesting at scale: * https://quantopian.github.io/talks/NeedForSpeed/slides.html * Scikit-Allel * Provides an alternative backend to work with compressed arrays * https://scikit-allel.readthedocs.org/en/latest/bcolz.html Installing == bcolz is in the PyPI repository, so installing it is easy:: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt Release notes can be found in the Git repository: https://github.com/Blosc/bcolz/blob/master/RELEASE_NOTES.rst **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Comments on governance proposal (was: Notes from the numpy dev meeting at scipy 2015)

and deeper in (technical) debt. Jaime -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

Hi, Thanks Nathaniel and others for sparking this discussion as I think it is very timely. 2015-08-25 12:03 GMT+02:00 Nathaniel Smith n...@pobox.com: Let's focus on evolving numpy as far as we can without major break-the-world changes (no numpy 2.0, at least in the foreseeable future). And, as a target for that evolution, let's change our focus from numpy as NumPy is the library that gives you the np.ndarray object (plus some attached infrastructure), to NumPy provides the standard framework for working with arrays and array-like objects in Python Sorry to disagree here, but in my opinion NumPy *already* provides the standard framework for working with arrays and array-like objects in Python as its huge popularity shows. If what you mean is that there are too many efforts trying to provide other, specialized data containers (things like DataFrame in pandas, DataArray/Dataset in xarray or carray/ctable in bcolz just to mention a few), then let me say that I am of the opinion that there can't be a silver bullet for tackling all the problems that the PyData community is facing. The libraries using specialized data containers (pandas, xray, bcolz...) may have more or less machinery on top of them so that conversion to NumPy not necessarily happens internally (many times we don't want conversions for efficiency), but it is the capability of producing NumPy arrays out of them (or parts of them) what makes these specialized containers to be incredible more useful to users because they can use NumPy to fill the missing gaps, or just use NumPy as an intermediate container that acts as input for other libraries. On the subject on why I don't think a universal data container is feasible for PyData, you just have to have a look at how many data structures Python is providing in the language itself (tuples, lists, dicts, sets...), and how many are added in the standard library (like those in the collections sub-package). Every data container is designed to do a couple of things (maybe three) well, but for other use cases it is the responsibility of the user to choose the more appropriate depending on her needs. In the same vein, I also think that it makes little sense to try to come with a standard solution that is going to satisfy everyone's need. IMHO, and despite all efforts, neither NumPy, NumPy 2.0, DyND, bcolz or any other is going to offer the universal data container. Instead of that, let me summarize what users/developers like me need from NumPy for continue creating more specialized data containers: 1) Keep NumPy simple. NumPy is the truly cornerstone of PyData right now, and it will be for the foreseeable future, so please keep it usable and *minimal*. Before adding any more feature the increase in complexity should carefully weighted. 2) Make NumPy more flexible. Any rewrite that allows arrays or dtypes to be subclassed and extended more easily will be a huge win. *But* if in order to allow flexibility you have to make NumPy much more complex, then point 1) should prevail. 3) Make of NumPy a sustainable project. Historically NumPy depended on heroic efforts of individuals to make it what it is now: *an industry standard*. But individual efforts, while laudable, are not enough, so please, please, please continue the effort of constituting a governance team that ensures the future of NumPy (and with it, the whole PyData community). Finally, the question on whether NumPy 2.0 or projects like DyND should be chosen instead for implementing new features is still legitimate, and while I have my own opinions (favourable to DyND), I still see (such is the price of technological debt) a distant future where we will find NumPy as we know it, allowing more innovation to happen in Python Data space. Again, thanks to all those braves that are allowing others to build on top of NumPy's shoulders. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] UTC-based datetime64

Hi, We've found that NumPy uses the local TZ for printing datetime64 timestamps: In [22]: t = datetime.utcnow() In [23]: print t 2015-08-26 11:52:10.662745 In [24]: np.array([t], dtype=datetime64[s]) Out[24]: array(['2015-08-26T13:52:10+0200'], dtype='datetime64[s]') Googling for a way to print UTC out of the box, the best thing I could find is: In [40]: [str(i.item()) for i in np.array([t], dtype=datetime64[s])] Out[40]: ['2015-08-26 11:52:10'] Now, is there a better way to specify that I want the datetimes printed always in UTC? Thanks, -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Question about unaligned access

2015-07-06 18:04 GMT+02:00 Jaime Fernández del Río jaime.f...@gmail.com: On Mon, Jul 6, 2015 at 10:18 AM, Francesc Alted fal...@gmail.com wrote: Hi, I have stumbled into this: In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0', np.int64), ('f1', np.int32)]) In [63]: %timeit sa['f0'].sum() 100 loops, best of 3: 4.52 ms per loop In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0', np.int64), ('f1', np.int64)]) In [65]: %timeit sa['f0'].sum() 1000 loops, best of 3: 896 µs per loop The first structured array is made of 12-byte records, while the second is made by 16-byte records, but the latter performs 5x faster. Also, using an structured array that is made of 8-byte records is the fastest (expected): In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0', np.int64)]) In [67]: %timeit sa['f0'].sum() 1000 loops, best of 3: 567 µs per loop Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform quite well on unaligned data: http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ So, if 4 years-old Intel architectures do not have a penalty for unaligned access, why I am seeing that in NumPy? That strikes like a quite strange thing to me. I believe that the way numpy is setup, it never does unaligned access, regardless of the platform, in case it gets run on one that would go up in flames if you tried to. So my guess would be that you are seeing chunked copies into a buffer, as opposed to bulk copying or no copying at all, and that would explain your timing differences. But Julian or Sebastian can probably give you a more informed answer. Yes, my guess is that you are right. I suppose that it is possible to improve the numpy codebase to accelerate this particular access pattern on Intel platforms, but provided that structured arrays are not that used (pandas is probably leading this use case by far, and as far as I know, they are not using structured arrays internally in DataFrames), then maybe it is not worth to worry about this too much. Thanks anyway, Francesc Jaime Thanks, Francesc -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- (\__/) ( O.o) ( ) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Question about unaligned access

Oops, forgot to mention my NumPy version: In [72]: np.__version__ Out[72]: '1.9.2' Francesc 2015-07-06 17:18 GMT+02:00 Francesc Alted fal...@gmail.com: Hi, I have stumbled into this: In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0', np.int64), ('f1', np.int32)]) In [63]: %timeit sa['f0'].sum() 100 loops, best of 3: 4.52 ms per loop In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0', np.int64), ('f1', np.int64)]) In [65]: %timeit sa['f0'].sum() 1000 loops, best of 3: 896 µs per loop The first structured array is made of 12-byte records, while the second is made by 16-byte records, but the latter performs 5x faster. Also, using an structured array that is made of 8-byte records is the fastest (expected): In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0', np.int64)]) In [67]: %timeit sa['f0'].sum() 1000 loops, best of 3: 567 µs per loop Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform quite well on unaligned data: http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ So, if 4 years-old Intel architectures do not have a penalty for unaligned access, why I am seeing that in NumPy? That strikes like a quite strange thing to me. Thanks, Francesc -- Francesc Alted -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] Question about unaligned access

Hi, I have stumbled into this: In [62]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0', np.int64), ('f1', np.int32)]) In [63]: %timeit sa['f0'].sum() 100 loops, best of 3: 4.52 ms per loop In [64]: sa = np.fromiter(((i,i) for i in range(1000*1000)), dtype=[('f0', np.int64), ('f1', np.int64)]) In [65]: %timeit sa['f0'].sum() 1000 loops, best of 3: 896 µs per loop The first structured array is made of 12-byte records, while the second is made by 16-byte records, but the latter performs 5x faster. Also, using an structured array that is made of 8-byte records is the fastest (expected): In [66]: sa = np.fromiter(((i,) for i in range(1000*1000)), dtype=[('f0', np.int64)]) In [67]: %timeit sa['f0'].sum() 1000 loops, best of 3: 567 µs per loop Now, my laptop has a Ivy Bridge processor (i5-3380M) that should perform quite well on unaligned data: http://lemire.me/blog/archives/2012/05/31/data-alignment-for-speed-myth-or-reality/ So, if 4 years-old Intel architectures do not have a penalty for unaligned access, why I am seeing that in NumPy? That strikes like a quite strange thing to me. Thanks, Francesc -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.2.7 released

= Announcing python-blosc 1.2.7 = What is new? Updated to use c-blosc v1.6.1. Although that this supports AVX2, it is not enabled in python-blosc because we still need a way to devise how to detect AVX2 in the underlying platform. At any rate, c-blosc 1.6.1 fixed an important bug in the blosclz codec that a release was deemed important. For more info, you can have a look at the release notes in: https://github.com/Blosc/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://python-blosc.blosc.org What is it? === Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate object manipulations that are memory-bound (http://www.blosc.org/docs/StarvingCPUs.pdf). See http://www.blosc.org/synthetic-benchmarks.html for some benchmarks on how much speed it can achieve in some datasets. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. python-blosc (http://python-blosc.blosc.org/) is the Python wrapper for the Blosc compression library. There is also a handy tool built on Blosc called Bloscpack (https://github.com/Blosc/bloscpack). It features a commmand line interface that allows you to compress large binary datafiles on-disk. It also comes with a Python API that has built-in support for serializing and deserializing Numpy arrays both on-disk and in-memory at speeds that are competitive with regular Pickle/cPickle machinery. Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources The sources are managed through github services at: http://github.com/Blosc/python-blosc Documentation = There is Sphinx-based documentation site at: http://python-blosc.blosc.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/Blosc/python-blosc/blob/master/LICENSES for more details. **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: PyTables 3.2.0 (final) released!

=== Announcing PyTables 3.2.0 === We are happy to announce PyTables 3.2.0. *** IMPORTANT NOTICE: If you are a user of PyTables, it needs your help to keep going. Please read the next thread as it contains important information about the future (or the lack of it) of the project: https://groups.google.com/forum/#!topic/pytables-users/yY2aUa4H7W4 Thanks! *** What's new == This is a major release of PyTables and it is the result of more than a year of accumulated patches, but most specially it fixes a couple of nasty problem with indexed queries not returning the correct results in some scenarios. There are many usablity and performance improvements too. In case you want to know more in detail what has changed in this version, please refer to: http://www.pytables.org/release_notes.html You can install it via pip or download a source package with generated PDF and HTML docs from: http://sourceforge.net/projects/pytables/files/pytables/3.2.0 For an online version of the manual, visit: http://www.pytables.org/usersguide/index.html What it is? === PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. PyTables includes OPSI, a new indexing technology, allowing to perform data lookups in tables exceeding 10 gigarows (10**10 rows) in less than a tenth of a second. Resources = About PyTables: http://www.pytables.org About the HDF5 library: http://hdfgroup.org/HDF5/ About NumPy: http://numpy.scipy.org/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Most specially, a lot of kudos go to the HDF5 and NumPy makers. Without them, PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Developers ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: PyTables 3.2.0 RC2 is out

=== Announcing PyTables 3.2.0rc2 === We are happy to announce PyTables 3.2.0rc2. *** IMPORTANT NOTICE: If you are a user of PyTables, it needs your help to keep going. Please read the next thread as it contains important information about the future (or lack of it) of the project: https://groups.google.com/forum/#!topic/pytables-users/yY2aUa4H7W4 Thanks! *** What's new == This is a major release of PyTables and it is the result of more than a year of accumulated patches, but most specially it fixes a couple of nasty problem with indexed queries not returning the correct results in some scenarios (mainly pandas users). There are many usability and performance improvements too. In case you want to know more in detail what has changed in this version, please refer to: http://www.pytables.org/release_notes.html You can install it via pip or download a source package with generated PDF and HTML docs from: http://sourceforge.net/projects/pytables/files/pytables/3.2.0rc2 For an online version of the manual, visit: http://www.pytables.org/usersguide/index.html What it is? === PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. PyTables includes OPSI, a new indexing technology, allowing to perform data lookups in tables exceeding 10 gigarows (10**10 rows) in less than a tenth of a second. Resources = About PyTables: http://www.pytables.org About the HDF5 library: http://hdfgroup.org/HDF5/ About NumPy: http://numpy.scipy.org/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Most specially, a lot of kudos go to the HDF5 and NumPy makers. Without them, PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Developers ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] ANN: numexpr 2.4.3 released

2015-04-28 4:59 GMT+02:00 Neil Girdhar mistersh...@gmail.com: I don't think I'm asking for so much. Somewhere inside numexpr it builds an AST of its own, which it converts into the optimized code. It would be more useful to me if that AST were in the same format as the one returned by Python's ast module. This way, I could glue in the bits of numexpr that I like with my code. For my purpose, this would have been the more ideal design. I don't think implementing this for numexpr would be that complex. So for example, one could add a new numexpr.eval_ast(ast_expr) function. Pull requests are welcome. At any rate, which is your use case? I am curious. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4.3 released

Announcing Numexpr 2.4.3 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == This is a maintenance release to cope with an old bug affecting comparisons with empty strings. Fixes #121 and PyTables #184. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: PyTables 3.2.0 release candidate 1 is out

=== Announcing PyTables 3.2.0rc1 === We are happy to announce PyTables 3.2.0rc1. *** IMPORTANT NOTICE: If you are a user of PyTables, it needs your help to keep going. Please read the next thread as it contains important information about the future of the project: https://groups.google.com/forum/#!topic/pytables-users/yY2aUa4H7W4 Thanks! *** What's new == This is a major release of PyTables and it is the result of more than a year of accumulated patches, but most specially it fixes a nasty problem with indexed queries not returning the correct results in some scenarios. There are many usablity and performance improvements too. In case you want to know more in detail what has changed in this version, please refer to: http://pytables.github.io/release_notes.html You can download a source package with generated PDF and HTML docs, as well as binaries for Windows, from: http://sourceforge.net/projects/pytables/files/pytables/3.2.0rc1 For an online version of the manual, visit: http://pytables.github.io/usersguide/index.html What it is? === PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. PyTables includes OPSI, a new indexing technology, allowing to perform data lookups in tables exceeding 10 gigarows (10**10 rows) in less than a tenth of a second. Resources = About PyTables: http://www.pytables.org About the HDF5 library: http://hdfgroup.org/HDF5/ About NumPy: http://numpy.scipy.org/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Most specially, a lot of kudos go to the HDF5 and NumPy makers. Without them, PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Developers ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4.1 released

= Announcing Numexpr 2.4.1 = Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == In this version there is improved support for newer MKL library as well as other minor improvements. This version is meant for production. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Introductory mail and GSoc Project Vector math library integration

2015-03-08 21:47 GMT+01:00 Dp Docs sdpa...@gmail.com: Hi all, I am a CS 3rd Undergrad. Student from an Indian Institute (III T). I believe I am good in Programming languages like C/C++, Python as I have already done Some Projects using these language as a part of my academics. I really like Coding (Competitive as well as development). I really want to get involved in Numpy Development Project and want to take Vector math library integration as a part of my project. I want to here any idea from your side for this project. Thanks For your time for reading this email and responding back. As Sturla and Gregor suggested, there are quite a few attempts to solve this shortcoming in NumPy. In particular Gregor integrated MKL/VML support in numexpr quite a long time ago, and when combined with my own implementation of pooled threads (behaving better than Intel's implementation in VML), then the thing literally flies: https://github.com/pydata/numexpr/wiki/NumexprMKL numba is also another interesting option and it shows much better compiling times than the integrated compiler in numexpr. You can see a quick comparison about expected performances between numexpr and numba: http://nbviewer.ipython.org/gist/anonymous/4117896 In general, numba wins for small arrays, but numexpr can achieve very good performance for larger ones. I think there are interesting things to discover in both projects, as for example, how they manage memory in order to avoid temporaries or how they deal with unaligned data efficiently. I would advise to look at existing docs and presentations explaining things in more detail too. All in all, I would really love to see such a vector math library support integrated in NumPy because frankly, I don't have bandwidth for maintaining numexpr anymore (and I am afraid that nobody else would jump in this ship ;). Good luck! Francesc My IRCnickname: dp Real Name: Durgesh Pandey. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] Vectorizing computation

Hi, I would like to vectorize the next computation: nx, ny, nz = 720, 180, 3 outheight = np.arange(nz) * 3 oro = np.arange(nx * ny).reshape((nx, ny)) def compute1(outheight, oro): result = np.zeros((nx, ny, nz)) for ix in range(nx): for iz in range(nz): result[ix, :, iz] = outheight[iz] + oro[ix, :] return result I think this should be possible by using an advanced use of broadcasting in numpy. Anyone willing to post a solution? Thanks, -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Vectorizing computation

2015-02-13 12:51 GMT+01:00 Julian Taylor jtaylor.deb...@googlemail.com: On 02/13/2015 11:51 AM, Francesc Alted wrote: Hi, I would like to vectorize the next computation: nx, ny, nz = 720, 180, 3 outheight = np.arange(nz) * 3 oro = np.arange(nx * ny).reshape((nx, ny)) def compute1(outheight, oro): result = np.zeros((nx, ny, nz)) for ix in range(nx): for iz in range(nz): result[ix, :, iz] = outheight[iz] + oro[ix, :] return result I think this should be possible by using an advanced use of broadcasting in numpy. Anyone willing to post a solution? result = outheight + oro.reshape(nx, ny, 1) And 4x faster for my case. Oh my, I am afraid that my mind will never scratch all the amazing possibilities that broadcasting is offering :) Thank you very much for such an elegant solution! Francesc ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Vectorizing computation

2015-02-13 13:25 GMT+01:00 Julian Taylor jtaylor.deb...@googlemail.com: On 02/13/2015 01:03 PM, Francesc Alted wrote: 2015-02-13 12:51 GMT+01:00 Julian Taylor jtaylor.deb...@googlemail.com mailto:jtaylor.deb...@googlemail.com: On 02/13/2015 11:51 AM, Francesc Alted wrote: Hi, I would like to vectorize the next computation: nx, ny, nz = 720, 180, 3 outheight = np.arange(nz) * 3 oro = np.arange(nx * ny).reshape((nx, ny)) def compute1(outheight, oro): result = np.zeros((nx, ny, nz)) for ix in range(nx): for iz in range(nz): result[ix, :, iz] = outheight[iz] + oro[ix, :] return result I think this should be possible by using an advanced use of broadcasting in numpy. Anyone willing to post a solution? result = outheight + oro.reshape(nx, ny, 1) And 4x faster for my case. Oh my, I am afraid that my mind will never scratch all the amazing possibilities that broadcasting is offering :) Thank you very much for such an elegant solution! if speed is a concern this is faster as it has a better data layout for numpy during the computation, but the result may be worse layed out for further processing result = outheight.reshape(nz, 1, 1) + oro return np.rollaxis(result, 0, 3) Holly cow, this makes for another 4x speed improvement! I don't think I need that much in my scenario, so I will stick with the first one (more readable and the expected data layout), but thanks a lot! Francesc ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 0.7.1 released

== Announcing bcolz 0.7.1 == What's new == This is maintenance release, where bcolz got rid of the nose dependency for Python 2.6 (only unittest2 should be required). Also, some small fixes for the test suite, specially in 32-bit has been done. Thanks to Ilan Schnell for pointing out the problems and for suggesting fixes. ``bcolz`` is a renaming of the ``carray`` project. The new goals for the project are to create simple, yet flexible compressed containers, that can live either on-disk or in-memory, and with some high-performance iterators (like `iter()`, `where()`) for querying them. Together, bcolz and the Blosc compressor, are finally fullfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots For more detailed info, see the release notes in: https://github.com/Blosc/bcolz/wiki/Release-Notes What it is == bcolz provides columnar and compressed data containers. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Installing == bcolz is in the PyPI repository, so installing it is easy: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: bcolz 0.7.0 released

== Announcing bcolz 0.7.0 == What's new == In this release, support for Python 3 has been added, Pandas and HDF5/PyTables conversion, support for different compressors via latest release of Blosc, and a new `iterblocks()` iterator. Also, intensive benchmarking has lead to an important tuning of buffer sizes parameters so that compression and evaluation goes faster than ever. Together, bcolz and the Blosc compressor, are finally fullfilling the promise of accelerating memory I/O, at least for some real scenarios: http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb#Plots ``bcolz`` is a renaming of the ``carray`` project. The new goals for the project are to create simple, yet flexible compressed containers, that can live either on-disk or in-memory, and with some high-performance iterators (like `iter()`, `where()`) for querying them. For more detailed info, see the release notes in: https://github.com/Blosc/bcolz/wiki/Release-Notes What it is == bcolz provides columnar and compressed data containers. Column storage allows for efficiently querying tables with a large number of columns. It also allows for cheap addition and removal of column. In addition, bcolz objects are compressed by default for reducing memory/disk I/O needs. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. bcolz can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr optimizes the memory usage and use several cores for doing the computations, so it is blazing fast. Moreover, the carray/ctable containers can be disk-based, and it is possible to use them for seamlessly performing out-of-memory computations. bcolz has minimal dependencies (NumPy), comes with an exhaustive test suite and fully supports both 32-bit and 64-bit platforms. Also, it is typically tested on both UNIX and Windows operating systems. Installing == bcolz is in the PyPI repository, so installing it is easy: $ pip install -U bcolz Resources = Visit the main bcolz site repository at: http://github.com/Blosc/bcolz Manual: http://bcolz.blosc.org Home of Blosc compressor: http://blosc.org User's mail list: bc...@googlegroups.com http://groups.google.com/group/bcolz License is the new BSD: https://github.com/Blosc/bcolz/blob/master/LICENSES/BCOLZ.txt **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.2.7 released

= Announcing python-blosc 1.2.4 = What is new? This is a maintenance release, where included c-blosc sources have been updated to 1.4.0. This adds support for non-Intel architectures, most specially those not supporting unaligned access. For more info, you can have a look at the release notes in: https://github.com/Blosc/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://python-blosc.blosc.org What is it? === Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate object manipulations that are memory-bound (http://www.blosc.org/docs/StarvingCPUs.pdf). See http://www.blosc.org/synthetic-benchmarks.html for some benchmarks on how much speed it can achieve in some datasets. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. python-blosc (http://python-blosc.blosc.org/) is the Python wrapper for the Blosc compression library. There is also a handy command line and Python library for Blosc called Bloscpack (https://github.com/Blosc/bloscpack) that allows you to compress large binary datafiles on-disk. Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources The sources are managed through github services at: http://github.com/Blosc/python-blosc Documentation = There is Sphinx-based documentation site at: http://python-blosc.blosc.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/Blosc/python-blosc/blob/master/LICENSES for more details. **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] [CORRECTION] python-blosc 1.2.4 released (Was: ANN: python-blosc 1.2.7 released)

Indeed it was 1.2.4 the version just released and not 1.2.7. Sorry for the typo! Francesc On 7/7/14, 8:20 PM, Francesc Alted wrote: = Announcing python-blosc 1.2.4 = What is new? This is a maintenance release, where included c-blosc sources have been updated to 1.4.0. This adds support for non-Intel architectures, most specially those not supporting unaligned access. For more info, you can have a look at the release notes in: https://github.com/Blosc/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://python-blosc.blosc.org What is it? === Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate object manipulations that are memory-bound (http://www.blosc.org/docs/StarvingCPUs.pdf). See http://www.blosc.org/synthetic-benchmarks.html for some benchmarks on how much speed it can achieve in some datasets. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. python-blosc (http://python-blosc.blosc.org/) is the Python wrapper for the Blosc compression library. There is also a handy command line and Python library for Blosc called Bloscpack (https://github.com/Blosc/bloscpack) that allows you to compress large binary datafiles on-disk. Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources The sources are managed through github services at: http://github.com/Blosc/python-blosc Documentation = There is Sphinx-based documentation site at: http://python-blosc.blosc.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/Blosc/python-blosc/blob/master/LICENSES for more details. **Enjoy data!** -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] IDL vs Python parallel computing

On 5/3/14, 11:56 PM, Siegfried Gonzi wrote: Hi all I noticed IDL uses at least 400% (4 processors or cores) out of the box for simple things like reading and processing files, calculating the mean etc. I have never seen this happening with numpy except for the linalgebra stuff (e.g lapack). Well, this might be because it is the place where using several processes makes more sense. Normally, when you are reading files, the bottleneck is the I/O subsystem (at least if you don't have to convert from text to numbers), and for calculating the mean, normally the bottleneck is memory throughput. Having said this, there are several packages that work on top of NumPy that can use multiple cores when performing numpy operations, like numexpr (https://github.com/pydata/numexpr), or Theano (http://deeplearning.net/software/theano/tutorial/multi_cores.html) -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

El 18/04/14 13:39, Francesc Alted ha escrit: So, sqrt in numpy has barely the same speed than the one in MKL. Again, I wonder why :) So by peeking into the code I have seen that you implemented sqrt using SSE2 intrinsics. Cool! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

El 17/04/14 21:19, Julian Taylor ha escrit: On 17.04.2014 20:30, Francesc Alted wrote: El 17/04/14 19:28, Julian Taylor ha escrit: On 17.04.2014 18:06, Francesc Alted wrote: In [4]: x_unaligned = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] on arrays of this size you won't see alignment issues you are dominated by memory bandwidth. If at all you will only see it if the data fits into the cache. Its also about unaligned to simd vectors not unaligned to basic types. But it doesn't matter anymore on modern x86 cpus. I guess for array data cache line splits should also not be a big concern. Yes, that was my point, that in x86 CPUs this is not such a big problem. But still a factor of 2 is significant, even for CPU-intensive tasks. For example, computing sin() is affected similarly (sin() is using SIMD, right?): In [6]: shape = (1, 1) In [7]: x_aligned = np.zeros(shape, dtype=[('x',np.float64),('y',np.int64)])['x'] In [8]: x_unaligned = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] In [9]: %timeit res = np.sin(x_aligned) 1 loops, best of 3: 654 ms per loop In [10]: %timeit res = np.sin(x_unaligned) 1 loops, best of 3: 1.08 s per loop and again, numexpr can deal with that pretty well (using 8 physical cores here): In [6]: %timeit res = ne.evaluate('sin(x_aligned)') 10 loops, best of 3: 149 ms per loop In [7]: %timeit res = ne.evaluate('sin(x_unaligned)') 10 loops, best of 3: 151 ms per loop in this case the unaligned triggers a strided memcpy calling loop to copy the data into a aligned buffer which is terrible for performance, even compared to the expensive sin call. numexpr handles this well as it allows the compiler to replace the memcpy with inline assembly (a mov instruction). We could fix that in numpy, though I don't consider it very important, you usually always have base type aligned memory. Well, that *could* be important for evaluating conditions in structured arrays, as it is pretty easy to get unaligned 'columns'. But apparently this does not affect very much to numpy: In [23]: na_aligned = np.fromiter(((, i, i*2) for i in xrange(N)), dtype=S16,i4,i8) In [24]: na_unaligned = np.fromiter(((, i, i*2) for i in xrange(N)), dtype=S15,i4,i8) In [25]: %time sum((r['f1'] for r in na_aligned[na_aligned['f2'] 10])) CPU times: user 10.2 s, sys: 93 ms, total: 10.3 s Wall time: 10.3 s Out[25]: 499485 In [26]: %time sum((r['f1'] for r in na_unaligned[na_unaligned['f2'] 10])) CPU times: user 10.2 s, sys: 82 ms, total: 10.3 s Wall time: 10.3 s Out[26]: 499485 probably because the bottleneck is in another place. So yeah, probably not worth to worry about that. (sin is not a SIMD using function unless you use a vector math library not supported by numpy directly yet) Ah, so MKL is making use of SIMD for computing the sin(), but not in general. But you later said that numpy's sqrt *is* making use of SIMD. I wonder why. Aligned allocators are not the only allocator which might be useful in numpy. Modern CPUs also support larger pages than 4K (huge pages up to 1GB in size) which reduces TLB cache misses. Memory of this type typically needs to be allocated with special mmap flags, though newer kernel versions can now also provide this memory to transparent anonymous pages (normal non-file mmaps). That's interesting. In which scenarios do you think that could improve performance? it might improve all numpy operations dealing with big arrays. big arrays trigger many large temporaries meaning glibc uses mmap meaning lots of moving of address space between the kernel and userspace. but I haven't benchmarked it, so it could also be completely irrelevant. I was curious about this and apparently the speedups that typically bring large page caches is around 5%: http://stackoverflow.com/questions/14275170/performance-degradation-with-large-pages not a big deal, but it is something. Also memory fragments really fast, so after a few hours of operation you often can't allocate any huge pages anymore, so you need to reserve space for them which requires special setup of machines. Another possibility for special allocators are numa allocators that ensure you get memory local to a specific compute node regardless of the system numa policy. But again its probably not very important as python has poor thread scalability anyway, these are just examples for keeping flexibility of our allocators in numpy and not binding us to what python does. Agreed. That's smart. Yeah, I don't see a reason why numexpr would be performing badly on Ubuntu. But I am not getting your performance for blocked_thread on my AMI linux vbox: http://nbviewer.ipython.org/gist/anonymous/11000524 my numexpr amd64 package does not make use of SIMD e.g. sqrt which is vectorized in numpy: numexpr: 1.30 │ 4638: sqrtss (%r14),%xmm0 0.01 │ ucomis

### Re: [Numpy-discussion] About the npz format

El 18/04/14 13:01, Valentin Haenel ha escrit: Hi again, * onefire onefire.mys...@gmail.com [2014-04-18]: I think your workaround might help, but a better solution would be to not use Python's zipfile module at all. This would make it possible to, say, let the user choose the checksum algorithm or to turn that off. Or maybe the compression stuff makes this route too complicated to be worth the trouble? (after all, the zip format is not that hard to understand) Just to give you an idea of what my aforementioned Bloscpack library can do in the case of linspace: In [1]: import numpy as np In [2]: import bloscpack as bp In [3]: import bloscpack.sysutil as bps In [4]: x = np.linspace(1, 10, 5000) In [5]: %timeit np.save(x.npy, x) ; bps.sync() 1 loops, best of 3: 2.12 s per loop In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 1 loops, best of 3: 627 ms per loop In [7]: %timeit -n 3 -r 3 np.save(x.npy, x) ; bps.sync() 3 loops, best of 3: 1.92 s per loop In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 3 loops, best of 3: 564 ms per loop In [9]: ls -lah x.npy x.blp -rw-r--r-- 1 root root 49M Apr 18 12:53 x.blp -rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy However, this is a bit of special case, since Blosc does extremely well -- both speed and size wise -- on the linspace data, your milage may vary. Exactly, and besides, Blosc can use different codes inside it. Just for completeness, here it is a small benchmark of what you can expect from them (my laptop does not have a SSD, so my figures are a bit slow compared with Valentin's): In [50]: %timeit -n 3 -r 3 np.save(x.npy, x) ; bps.sync() 3 loops, best of 3: 5.7 s per loop In [51]: cargs = bp.args.DEFAULT_BLOSC_ARGS In [52]: cargs['cname'] = 'blosclz' In [53]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-blosclz.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.12 s per loop In [54]: cargs['cname'] = 'lz4' In [55]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 985 ms per loop In [56]: cargs['cname'] = 'lz4hc' In [57]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4hc.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.95 s per loop In [58]: cargs['cname'] = 'snappy' In [59]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-snappy.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.11 s per loop In [60]: cargs['cname'] = 'zlib' In [61]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-zlib.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 3.12 s per loop so all the codecs can make the storage go faster than a pure np.save(), and most specially blosclz, lz4 and snappy. However, lz4hc and zlib achieve the best compression ratios: In [62]: ls -lht x*.* -rw-r--r-- 1 faltet users 7,0M 18 abr 13:49 x-zlib.blp -rw-r--r-- 1 faltet users 54M 18 abr 13:48 x-snappy.blp -rw-r--r-- 1 faltet users 7,0M 18 abr 13:48 x-lz4hc.blp -rw-r--r-- 1 faltet users 48M 18 abr 13:47 x-lz4.blp -rw-r--r-- 1 faltet users 49M 18 abr 13:47 x-blosclz.blp -rw-r--r-- 1 faltet users 382M 18 abr 13:42 x.npy But again, we are talking about a specially nice compression case. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

Uh, 15x slower for unaligned access is quite a lot. But Intel (and AMD) arquitectures are much more tolerant in this aspect (and improving). For example, with a Xeon(R) CPU E5-2670 (2 years old) I get: In [1]: import numpy as np In [2]: shape = (1, 1) In [3]: x_aligned = np.zeros(shape, dtype=[('x',np.float64),('y',np.int64)])['x'] In [4]: x_unaligned = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] In [5]: %timeit res = x_aligned ** 2 1 loops, best of 3: 289 ms per loop In [6]: %timeit res = x_unaligned ** 2 1 loops, best of 3: 664 ms per loop so the added cost in this case is just a bit more than 2x. But you can also alleviate this overhead if you do a copy that fits in cache prior to do computations. numexpr does this: https://github.com/pydata/numexpr/blob/master/numexpr/interp_body.cpp#L203 and the results are pretty good: In [8]: import numexpr as ne In [9]: %timeit res = ne.evaluate('x_aligned ** 2') 10 loops, best of 3: 133 ms per loop In [10]: %timeit res = ne.evaluate('x_unaligned ** 2') 10 loops, best of 3: 134 ms per loop i.e. there is not a significant difference between aligned and unaligned access to data. I wonder if the same technique could be applied to NumPy. Francesc El 17/04/14 16:26, Aron Ahmadia ha escrit: Hmnn, I wasn't being clear :) The default malloc on BlueGene/Q only returns 8 byte alignment, but the SIMD units need 32-byte alignment for loads, stores, and operations or performance suffers. On the /P the required alignment was 16-bytes, but malloc only gave you 8, and trying to perform vectorized loads/stores generated alignment exceptions on unaligned memory. See https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q and https://computing.llnl.gov/tutorials/bgp/BGP-usage.Walkup.pdf (slides 14 for overview, 15 for the effective performance difference between the unaligned/aligned code) for some notes on this. A On Thu, Apr 17, 2014 at 10:18 AM, Nathaniel Smith n...@pobox.com mailto:n...@pobox.com wrote: On 17 Apr 2014 15:09, Aron Ahmadia a...@ahmadia.net mailto:a...@ahmadia.net wrote: On the one hand it would be nice to actually know whether posix_memalign is important, before making api decisions on this basis. FWIW: On the lightweight IBM cores that the extremely popular BlueGene machines were based on, accessing unaligned memory raised system faults. The default behavior of these machines was to terminate the program if more than 1000 such errors occurred on a given process, and an environment variable allowed you to terminate the program if *any* unaligned memory access occurred. This is because unaligned memory accesses were 15x (or more) slower than aligned memory access. The newer /Q chips seem to be a little more forgiving of this, but I think one can in general expect allocated memory alignment to be an important performance technique for future high performance computing architectures. Right, this much is true on lots of architectures, and so malloc is careful to always return values with sufficient alignment (e.g. 8 bytes) to make sure that any standard operation can succeed. The question here is whether it will be important to have *even more* alignment than malloc gives us by default. A 16 or 32 byte wide SIMD instruction might prefer that data have 16 or 32 byte alignment, even if normal memory access for the types being operated on only requires 4 or 8 byte alignment. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] High-quality memory profiling for numpy in python 3.5 / volunteers needed

El 17/04/14 19:28, Julian Taylor ha escrit: On 17.04.2014 18:06, Francesc Alted wrote: In [4]: x_unaligned = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] on arrays of this size you won't see alignment issues you are dominated by memory bandwidth. If at all you will only see it if the data fits into the cache. Its also about unaligned to simd vectors not unaligned to basic types. But it doesn't matter anymore on modern x86 cpus. I guess for array data cache line splits should also not be a big concern. Yes, that was my point, that in x86 CPUs this is not such a big problem. But still a factor of 2 is significant, even for CPU-intensive tasks. For example, computing sin() is affected similarly (sin() is using SIMD, right?): In [6]: shape = (1, 1) In [7]: x_aligned = np.zeros(shape, dtype=[('x',np.float64),('y',np.int64)])['x'] In [8]: x_unaligned = np.zeros(shape, dtype=[('y1',np.int8),('x',np.float64),('y2',np.int8,(7,))])['x'] In [9]: %timeit res = np.sin(x_aligned) 1 loops, best of 3: 654 ms per loop In [10]: %timeit res = np.sin(x_unaligned) 1 loops, best of 3: 1.08 s per loop and again, numexpr can deal with that pretty well (using 8 physical cores here): In [6]: %timeit res = ne.evaluate('sin(x_aligned)') 10 loops, best of 3: 149 ms per loop In [7]: %timeit res = ne.evaluate('sin(x_unaligned)') 10 loops, best of 3: 151 ms per loop Aligned allocators are not the only allocator which might be useful in numpy. Modern CPUs also support larger pages than 4K (huge pages up to 1GB in size) which reduces TLB cache misses. Memory of this type typically needs to be allocated with special mmap flags, though newer kernel versions can now also provide this memory to transparent anonymous pages (normal non-file mmaps). That's interesting. In which scenarios do you think that could improve performance? In [8]: import numexpr as ne In [9]: %timeit res = ne.evaluate('x_aligned ** 2') 10 loops, best of 3: 133 ms per loop In [10]: %timeit res = ne.evaluate('x_unaligned ** 2') 10 loops, best of 3: 134 ms per loop i.e. there is not a significant difference between aligned and unaligned access to data. I wonder if the same technique could be applied to NumPy. you already can do so with relatively simple means: http://nbviewer.ipython.org/gist/anonymous/10942132 If you change the blocking function to get a function as input and use inplace operations numpy can even beat numexpr (though I used the numexpr Ubuntu package which might not be compiled optimally) This type of transformation can probably be applied on the AST quite easily. That's smart. Yeah, I don't see a reason why numexpr would be performing badly on Ubuntu. But I am not getting your performance for blocked_thread on my AMI linux vbox: http://nbviewer.ipython.org/gist/anonymous/11000524 oh well, threads are always tricky beasts. By the way, apparently the optimal block size for my machine is something like 1 MB, not 128 KB, although the difference is not big: http://nbviewer.ipython.org/gist/anonymous/11002751 (thanks to Stefan Van der Walt for the script). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4 is out

Announcing Numexpr 2.4 Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == A new `contains()` function has been added for detecting substrings in strings. Only plain strings (bytes) are supported for now (see ticket #142). Thanks to Marcin Krol. You can have a glimpse on how `contains()` works in this notebook: http://nbviewer.ipython.org/gist/FrancescAlted/10595974 where it can be seen that this can make substring searches more than 10x faster than with regular Python. You can find the source for the notebook here: https://github.com/FrancescAlted/ngrams Also, there is a new version of setup.py that allows better management of the NumPy dependency during pip installs. Thanks to Aleks Bunin. Windows related bugs have been addressed and (hopefully) squashed. Thanks to Christoph Gohlke. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] PEP 465 has been accepted / volunteers needed

On 4/9/14, 10:46 PM, Chris Barker wrote: On Tue, Apr 8, 2014 at 11:14 AM, Nathaniel Smith n...@pobox.com mailto:n...@pobox.com wrote: Thank you! Though I suspect that the most important part of my contribution may have just been my high tolerance for writing emails ;-). no -- it's your high tolerance for _reading_ emails... Far too many of us have a high tolerance for writing them! Ha ha, very true! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4 RC2

=== Announcing Numexpr 2.4 RC2 === Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == A new `contains()` function has been added for detecting substrings in strings. Only plain strings (bytes) are supported for now (see ticket #142). Thanks to Marcin Krol. Also, there is a new version of setup.py that allows better management of the NumPy dependency during pip installs. Thanks to Aleks Bunin. Windows related bugs have been addressed and (hopefully) squashed. Thanks to Christoph Gohlke. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.4 RC1

=== Announcing Numexpr 2.4 RC1 === Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == A new `contains()` function has been added for detecting substrings in strings. Thanks to Marcin Krol. Also, there is a new version of setup.py that allows better management of the NumPy dependency during pip installs. Thanks to Aleks Bunin. This is the first release candidate before 2.4 final would be out, so please give it a go and report back any problems you may have. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] last call for fixes for numpy 1.8.1rc1

Hi Julian, Any chance that NPY_MAXARGS could be increased to something more than the current value of 32? There is a discussion about this in: https://github.com/numpy/numpy/pull/226 but I think that, as Charles was suggesting, just increasing NPY_MAXARGS to something more reasonable (say 256) should be enough for a long while. This issue limits quite a bit the number of operands in numexpr expressions, and hence, to other projects that depends on it, like PyTables or pandas. See for example this bug report: https://github.com/PyTables/PyTables/issues/286 Thanks, Francesc On 2/27/14, 9:05 PM, Julian Taylor wrote: hi, We want to start preparing the release candidate for the bugfix release 1.8.1rc1 this weekend, I'll start preparing the changelog tomorrow. So if you want a certain issue fixed please scream now or better create a pull request/patch on the maintenance/1.8.x branch. Please only consider bugfixes, no enhancements (unless they are really really simple), new features or invasive changes. I just finished my list of issues I want backported to numpy 1.8 (gh-4390, gh-4388). Please check if its already included in these PRs. I'm probably still going to add gh-4284 after some though tomorrow. Cheers, Julian ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] last call for fixes for numpy 1.8.1rc1

Well, what numexpr is using is basically NpyIter_AdvancedNew: https://github.com/pydata/numexpr/blob/master/numexpr/interpreter.cpp#L1178 and nothing else. If NPY_MAXARGS could be increased just for that, and without ABI breaking, then fine. If not, we should have to wait until 1.9 I am afraid. On the other hand, increasing the temporary arrays in nditer from 32kb to 128kb is a bit worrying, but probably we should do some benchmarks and see how much performance would be compromised (if any). Francesc On 2/28/14, 1:09 PM, Julian Taylor wrote: hm increasing it for PyArrayMapIterObject would break the public ABI. While nobody should be using this part of the ABI its not appropriate for a bugfix release. Note that as it currently stands in numpy 1.9.dev we will break this ABI for the indexing improvements. Though for nditer and some other functions we could change it if thats enough. It would bump some temporary arrays of nditer from 32kb to 128kb, I think that would still be fine, but getting to the point where we should move them onto the heap. On 28.02.2014 12:41, Francesc Alted wrote: Hi Julian, Any chance that NPY_MAXARGS could be increased to something more than the current value of 32? There is a discussion about this in: https://github.com/numpy/numpy/pull/226 but I think that, as Charles was suggesting, just increasing NPY_MAXARGS to something more reasonable (say 256) should be enough for a long while. This issue limits quite a bit the number of operands in numexpr expressions, and hence, to other projects that depends on it, like PyTables or pandas. See for example this bug report: https://github.com/PyTables/PyTables/issues/286 Thanks, Francesc On 2/27/14, 9:05 PM, Julian Taylor wrote: hi, We want to start preparing the release candidate for the bugfix release 1.8.1rc1 this weekend, I'll start preparing the changelog tomorrow. So if you want a certain issue fixed please scream now or better create a pull request/patch on the maintenance/1.8.x branch. Please only consider bugfixes, no enhancements (unless they are really really simple), new features or invasive changes. I just finished my list of issues I want backported to numpy 1.8 (gh-4390, gh-4388). Please check if its already included in these PRs. I'm probably still going to add gh-4284 after some though tomorrow. Cheers, Julian ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] last call for fixes for numpy 1.8.1rc1

On 2/28/14, 3:00 PM, Charles R Harris wrote: On Fri, Feb 28, 2014 at 5:52 AM, Julian Taylor jtaylor.deb...@googlemail.com mailto:jtaylor.deb...@googlemail.com wrote: performance should not be impacted as long as we stay on the stack, it just increases offset of a stack pointer a bit more. E.g. nditer and einsum use temporary stack arrays of this type for its initialization: op_axes_arrays[NPY_MAXARGS][NPY_MAXDIMS]; // both 32 currently The resulting nditer structure is then in heap space and dependent on the real amount of arguments it got. So I'm more worried about running out of stack space, though the limit is usually 8mb so taking 128kb for a short while should be ok. On 28.02.2014 13:32, Francesc Alted wrote: Well, what numexpr is using is basically NpyIter_AdvancedNew: https://github.com/pydata/numexpr/blob/master/numexpr/interpreter.cpp#L1178 and nothing else. If NPY_MAXARGS could be increased just for that, and without ABI breaking, then fine. If not, we should have to wait until 1.9 I am afraid. On the other hand, increasing the temporary arrays in nditer from 32kb to 128kb is a bit worrying, but probably we should do some benchmarks and see how much performance would be compromised (if any). Francesc On 2/28/14, 1:09 PM, Julian Taylor wrote: hm increasing it for PyArrayMapIterObject would break the public ABI. While nobody should be using this part of the ABI its not appropriate for a bugfix release. Note that as it currently stands in numpy 1.9.dev we will break this ABI for the indexing improvements. Though for nditer and some other functions we could change it if thats enough. It would bump some temporary arrays of nditer from 32kb to 128kb, I think that would still be fine, but getting to the point where we should move them onto the heap. These sort of changes can have subtle side effects and need lots of testing in a release cycle. Bugfix release cycles are kept short by restricting changes to those that look simple and safe. Agreed. I have just opened a ticket for having this in mind for NumPy 1.9: https://github.com/numpy/numpy/issues/4398 -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.3.1 released

== Announcing Numexpr 2.3.1 == Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. What's new == * Added support for shift-left () and shift-right () binary operators. See PR #131. Thanks to fish2000! * Removed the rpath flag for the GCC linker, because it is probably not necessary and it chokes to clang. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] argsort speed

On 2/17/14, 1:08 AM, josef.p...@gmail.com wrote: On Sun, Feb 16, 2014 at 6:12 PM, Daπid davidmen...@gmail.com wrote: On 16 February 2014 23:43, josef.p...@gmail.com wrote: What's the fastest argsort for a 1d array with around 28 Million elements, roughly uniformly distributed, random order? On numpy latest version: for kind in ['quicksort', 'mergesort', 'heapsort']: print kind %timeit np.sort(data, kind=kind) %timeit np.argsort(data, kind=kind) quicksort 1 loops, best of 3: 3.55 s per loop 1 loops, best of 3: 10.3 s per loop mergesort 1 loops, best of 3: 4.84 s per loop 1 loops, best of 3: 9.49 s per loop heapsort 1 loops, best of 3: 12.1 s per loop 1 loops, best of 3: 39.3 s per loop It looks quicksort is quicker sorting, but mergesort is marginally faster sorting args. The diference is slim, but upon repetition, it remains significant. Why is that? Probably part of the reason is what Eelco said, and part is that in sort comparison are done accessing the array elements directly, but in argsort you have to index the array, introducing some overhead. Thanks, both. I also gain a second with mergesort. matlab would be nicer in my case, it returns both. I still need to use the argsort to index into the array to also get the sorted array. Many years ago I needed something similar, so I made some functions for sorting and argsorting in one single shot. Maybe you want to reuse them. Here it is an example of the C implementation: https://github.com/PyTables/PyTables/blob/develop/src/idx-opt.c#L619 and here the Cython wrapper for all of them: https://github.com/PyTables/PyTables/blob/develop/tables/indexesextension.pyx#L129 Francesc Josef I seem unable to find the code for ndarray.sort, so I can't check. I have tried to grep it tring all possible combinations of def ndarray, self.sort, etc. Where is it? /David. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.3 (final) released

== Announcing Numexpr 2.3 == Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's MKL (Math Kernel Library), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Look here for a some benchmarks of numexpr using MKL: https://github.com/pydata/numexpr/wiki/NumexprMKL Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational engine for projects that don't want to adopt other solutions requiring more heavy dependencies. Numexpr is already being used in a series of packages (PyTables, pandas, BLZ...) for helping doing computations faster. What's new == The repository has been migrated to https://github.com/pydata/numexpr. All new tickets and PR should be directed there. Also, a `conj()` function for computing the conjugate of complex arrays has been added. Thanks to David Menéndez. See PR #125. Finallly, we fixed a DeprecationWarning derived of using ``oa_ndim == 0`` and ``op_axes == NULL`` with `NpyIter_AdvancedNew()` and NumPy 1.8. Thanks to Mark Wiebe for advise on how to fix this properly. Many thanks to Christoph Gohlke and Ilan Schnell for his help during the testing of this release in all kinds of possible combinations of platforms and MKL. In case you want to know more in detail what has changed in this version, see: https://github.com/pydata/numexpr/wiki/Release-Notes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at GitHub in: https://github.com/pydata/numexpr You can get the packages from PyPI as well (but not for RC releases): http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.2.0 released

services at: http://github.com/ContinuumIO/python-blosc Documentation = There is Sphinx-based documentation site at: http://blosc.pydata.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/ContinuumIO/python-blosc/blob/master/LICENSES for more details. -- Francesc Alted Continuum Analytics, Inc. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: BLZ 0.6.1 has been released

Announcing BLZ 0.6 series = What it is -- BLZ is a chunked container for numerical data. Chunking allows for efficient enlarging/shrinking of data container. In addition, it can also be compressed for reducing memory/disk needs. The compression process is carried out internally by Blosc, a high-performance compressor that is optimized for binary data. The main objects in BLZ are `barray` and `btable`. `barray` is meant for storing multidimensional homogeneous datasets efficiently. `barray` objects provide the foundations for building `btable` objects, where each column is made of a single `barray`. Facilities are provided for iterating, filtering and querying `btables` in an efficient way. You can find more info about `barray` and `btable` in the tutorial: http://blz.pydata.org/blz-manual/tutorial.html BLZ can use numexpr internally so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too) either from memory or from disk. In the future, it is planned to use Numba as the computational kernel and to provide better Blaze (http://blaze.pydata.org) integration. What's new -- BLZ has been branched off from the Blaze project (http://blaze.pydata.org). BLZ was meant as a persistent format and library for I/O in Blaze. BLZ in Blaze is based on previous carray 0.5 and this is why this new version is labeled 0.6. BLZ supports completely transparent storage on-disk in addition to memory. That means that *everything* that can be done with the in-memory container can be done using the disk as well. The advantages of a disk-based container is that the addressable space is much larger than just your available memory. Also, as BLZ is based on a chunked and compressed data layout based on the super-fast Blosc compression library, the data access speed is very good. The format chosen for the persistence layer is based on the 'bloscpack' library and described in the Persistent format for BLZ chapter of the user manual ('docs/source/persistence-format.rst'). More about Bloscpack here: https://github.com/esc/bloscpack You may want to know more about BLZ in this blog entry: http://continuum.io/blog/blz-format In this version, support for Blosc 1.3 has been added, that meaning that a new `cname` parameter has been added to the `bparams` class, so that you can select you preferred compressor from 'blosclz', 'lz4', 'lz4hc', 'snappy' and 'zlib'. Also, many bugs have been fixed, providing a much smoother experience. CAVEAT: The BLZ/bloscpack format is still evolving, so don't trust on forward compatibility of the format, at least until 1.0, where the internal format will be declared frozen. Resources - Visit the main BLZ site repository at: http://github.com/ContinuumIO/blz Read the online docs at: http://blz.pydata.org/blz-manual/index.html Home of Blosc compressor: http://www.blosc.org User's mail list: blaze-...@continuum.io Enjoy! Francesc Alted Continuum Analytics, Inc. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Catching out-of-memory error before it happens

Yeah, numexpr is pretty cool for avoiding temporaries in an easy way: https://github.com/pydata/numexpr Francesc El 24/01/14 16:30, Nathaniel Smith ha escrit: There is no reliable way to predict how much memory an arbitrary numpy operation will need, no. However, in most cases the main memory cost will be simply the need to store the input and output arrays; for large arrays, all other allocations should be negligible. The most effective way to avoid running out of memory, therefore, is to avoid creating temporary arrays, by using only in-place operations. E.g., if a and b each require N bytes of ram, then memory requirements (roughly). c = a + b: 3N c = a + 2*b: 4N a += b: 2N np.add(a, b, out=a): 2N b *= 2; a += b: 2N Note that simply loading a and b requires 2N memory, so the latter code samples are near-optimal. Of course some calculations do require the use of temporary storage space... -n On 24 Jan 2014 15:19, Dinesh Vadhia dineshbvad...@hotmail.com mailto:dineshbvad...@hotmail.com wrote: I want to write a general exception handler to warn if too much data is being loaded for the ram size in a machine for a successful numpy array operation to take place. For example, the program multiplies two floating point arrays A and B which are populated with loadtext. While the data is being loaded, want to continuously check that the data volume doesn't pass a threshold that will cause on out-of-memory error during the A*B operation. The known variables are the amount of memory available, data type (floats in this case) and the numpy array operation to be performed. It seems this requires knowledge of the internal memory requirements of each numpy operation. For sake of simplicity, can ignore other memory needs of program. Is this possible? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] -ffast-math

On 12/2/13, 12:14 AM, Dan Goodman wrote: Dan Goodman dg.gmane at thesamovar.net writes: ... I got around 5x slower. Using numexpr 'dumbly' (i.e. just putting the expression in directly) was slower than the function above, but doing a hybrid between the two approaches worked well: def timefunc_numexpr_smart(): _sin_term = sin(2.0*freq*pi*t) _exp_term = exp(-dt/tau) _a_term = (_sin_term-_sin_term*_exp_term) _const_term = -b*_exp_term + b v[:] = numexpr.evaluate('a*_a_term+v*_exp_term+_const_term') #numexpr.evaluate('a*_a_term+v*_exp_term+_const_term', out=v) This was about 3.5x slower than weave. If I used the commented out final line then it was only 1.5x slower than weave, but it also gives wrong results. I reported this as a bug in numexpr a long time ago but I guess it hasn't been fixed yet (or maybe I didn't upgrade my version recently). I just upgraded numexpr to 2.2 where they did fix this bug, and now the 'smart' numexpr version runs exactly as fast as weave (so I guess there were some performance enhancements in numexpr as well). Err no, there have not been performance improvements in numexpr since 2.0 (that I am aware of). Maybe you are running in a multi-core machine now and you are seeing better speedup because of this? Also, your expressions are made of transcendental functions, so linking numexpr with MKL could accelerate computations a good deal too. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] [ANN] numexpr 2.2 released

== Announcing Numexpr 2.2 == Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's VML library (included in Intel MKL), which allows an extremely fast evaluation of transcendental functions (sin, cos, tan, exp, log...) while squeezing the last drop of performance out of your multi-core processors. Its only dependency is NumPy (MKL is optional), so it works well as an easy-to-deploy, easy-to-use, computational kernel for projects that don't want to adopt other solutions that require more heavy dependencies. What's new == This release is mainly meant to fix a problem with the license the numexpr/win32/pthread.{c,h} files emulating pthreads on Windows. After persmission from the original authors is granted, these files adopt the MIT license and can be redistributed without problems. See issue #109 for details (https://code.google.com/p/numexpr/issues/detail?id=110). Another important improvement is the new algorithm to decide the initial number of threads to be used. This was necessary because by default, numexpr was using a number of threads equal to the detected number of cores, and this can be just too much for moder systems where this number can be too high (and counterporductive for performance in many cases). Now, the 'NUMEXPR_NUM_THREADS' environment variable is honored, and in case this is not present, a maximum number of *8* threads are setup initially. The new algorithm is fully described in the Users Guide, in the note of 'General routines' section: https://code.google.com/p/numexpr/wiki/UsersGuide#General_routines. Closes #110. In case you want to know more in detail what has changed in this version, see: http://code.google.com/p/numexpr/wiki/ReleaseNotes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at Google code in: http://code.google.com/p/numexpr/ You can get the packages from PyPI as well: http://pypi.python.org/pypi/numexpr Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy data! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] RAM problem during code execution - Numpya arrays

(std_dev_size_medio_nuevo)/numero_experimentos) tiempos=np.append(tiempos, time.clock()-empieza) componente_y=np.append(componente_y, sum(comp_y)/numero_experimentos) componente_x=np.append(componente_x, sum(comp_x)/numero_experimentos) anisotropia_macroscopica_porcentual=100*(1-(componente_y/componente_x)) I tryed with gc and gc.collect() and 'del'command for deleting arrays after his use and nothing work! What am I doing wrong? Why the memory becomes full while running (starts with 10% of RAM used and in 1-2hour is totally full used)? Please help me, I'm totally stuck! Thanks a lot! ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.1 (final) released

=== Announcing python-blosc 1.1 === What is it? === python-blosc (http://blosc.pydata.org/) is a Python wrapper for the Blosc compression library. Blosc (http://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Whether this is achieved or not depends of the data compressibility, the number of cores in the system, and other factors. See a series of benchmarks conducted for many different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. There is also a handy command line for Blosc called Bloscpack (https://github.com/esc/bloscpack) that allows you to compress large binary datafiles on-disk. Although the format for Bloscpack has not stabilized yet, it allows you to effectively use Blosc from your favorite shell. What is new? - Added new `compress_ptr` and `decompress_ptr` functions that allows to compress and decompress from/to a data pointer, avoiding an itermediate copy for maximum speed. Be careful, as these are low level calls, and user must make sure that the pointer data area is safe. - Since Blosc (the C library) already supports to be installed as an standalone library (via cmake), it is also possible to link python-blosc against a system Blosc library. - The Python calls to Blosc are now thread-safe (another consequence of recent Blosc library supporting this at C level). - Many checks on types and ranges of values have been added. Most of the calls will now complain when passed the wrong values. - Docstrings are much improved. Also, Sphinx-based docs are available now. Many thanks to Valentin Hänel for his impressive work for this release. For more info, you can see the release notes in: https://github.com/FrancescAlted/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://blosc.pydata.org Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the python- prefix Download sources The sources are managed through github services at: http://github.com/FrancescAlted/python-blosc Documentation = There is Sphinx-based documentation site at: http://blosc.pydata.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/FrancescAlted/python-blosc/blob/master/LICENSES for more details. Enjoy! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: python-blosc 1.1 RC1 available for testing

Announcing python-blosc 1.1 RC1 What is it? === python-blosc (http://blosc.pydata.org) is a Python wrapper for the Blosc compression library. Blosc (http://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Whether this is achieved or not depends of the data compressibility, the number of cores in the system, and other factors. See a series of benchmarks conducted for many different systems: http://blosc.org/trac/wiki/SyntheticBenchmarks. Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc. There is also a handy command line for Blosc called Bloscpack (https://github.com/esc/bloscpack) that allows you to compress large binary datafiles on-disk. Although the format for Bloscpack has not stabilized yet, it allows you to effectively use Blosc from your favorite shell. What is new? - Added new `compress_ptr` and `decompress_ptr` functions that allows to compress and decompress from/to a data pointer. These are low level calls and user must make sure that the pointer data area is safe. - Since Blosc (the C library) already supports to be installed as an standalone library (via cmake), it is also possible to link python-blosc against a system Blosc library. - The Python calls to Blosc are now thread-safe (another consequence of recent Blosc library supporting this at C level). - Many checks on types and ranges of values have been added. Most of the calls will now complain when passed the wrong values. - Docstrings are much improved. Also, Sphinx-based docs are available now. Many thanks to Valentin Hänel for his impressive work for this release. For more info, you can see the release notes in: https://github.com/FrancescAlted/python-blosc/wiki/Release-notes More docs and examples are available in the documentation site: http://blosc.pydata.org Installing == python-blosc is in PyPI repository, so installing it is easy: $ pip install -U blosc # yes, you should omit the blosc- prefix Download sources The sources are managed through github services at: http://github.com/FrancescAlted/python-blosc Documentation = There is Sphinx-based documentation site at: http://blosc.pydata.org/ Mailing list There is an official mailing list for Blosc at: bl...@googlegroups.com http://groups.google.es/group/blosc Licenses Both Blosc and its Python wrapper are distributed using the MIT license. See: https://github.com/FrancescAlted/python-blosc/blob/master/LICENSES for more details. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Profiling (was GSoC : Performance parity between numpy arrays and Python scalars)

On 5/2/13 3:58 PM, Nathaniel Smith wrote: callgrind has the *fabulous* kcachegrind front-end, but it only measures memory access performance on a simulated machine, which is very useful sometimes (if you're trying to optimize cache locality), but there's no guarantee that the bottlenecks on its simulated machine are the same as the bottlenecks on your real machine. Agreed, there is no guarantee, but my experience is that kcachegrind normally gives you a pretty decent view of cache faults and hence it can do pretty good predictions on how this affects your computations. I have used this feature extensively for optimizing parts of the Blosc compressor, and I cannot be more happier (to the point that, if it were not for Valgrind, I could not figure out many interesting memory access optimizations). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### [Numpy-discussion] ANN: numexpr 2.1 RC1

Announcing Numexpr 2.1RC1 Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's VML library, which allows for squeezing the last drop of performance out of your multi-core processors. What's new == This version adds compatibility for Python 3. A bunch of thanks to Antonio Valentino for his excellent work on this.I apologize for taking so long in releasing his contributions. In case you want to know more in detail what has changed in this version, see: http://code.google.com/p/numexpr/wiki/ReleaseNotes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? = The project is hosted at Google code in: http://code.google.com/p/numexpr/ This is a release candidate 1, so it will not be available on the PyPi repository. I'll post it there when the final version will be released. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy! -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] timezones and datetime64

On 4/4/13 1:52 AM, Chris Barker - NOAA Federal wrote: Thanks all for taking an interest. I need to think a bot more about the options before commenting more, but: while we're at it: It seems very odd to me that datetime64 supports different units (right down to attosecond) but not different epochs. How can it possible be useful to use nanoseconds, etc, but only right around 1970? For that matter, why all the units at all? I can see the need for nanosecond resolution, but not without changing the epoch -- so if the epoch is fixed, why bother with different units? snip When Ivan and me were discussing that, I remember us deciding that such a small units would be useful mainly for the timedelta datatype, which is a relative, not absolute time. We did not want to make short for very precise time measurements, and this is why we decided to go with attoseconds. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] timezones and datetime64

On 4/4/13 1:54 PM, Nathaniel Smith wrote: On Thu, Apr 4, 2013 at 12:52 AM, Chris Barker - NOAA Federal chris.bar...@noaa.gov wrote: Thanks all for taking an interest. I need to think a bot more about the options before commenting more, but: while we're at it: It seems very odd to me that datetime64 supports different units (right down to attosecond) but not different epochs. How can it possible be useful to use nanoseconds, etc, but only right around 1970? For that matter, why all the units at all? I can see the need for nanosecond resolution, but not without changing the epoch -- so if the epoch is fixed, why bother with different units? Using days (for instance) rather than seconds doesn't save memory, as we're always using 64 bits. It can't be common to need more than 2.9e12 years (OK, that's not quite as old as the universe, so some cosmologists may need it...) Another reason why it might be interesting to support different epochs is that many timeseries (e.g., the ones I work with) aren't linked to absolute time, but are instead milliseconds since we turned on the recording equipment. You can reasonably represent these as timedeltas of course, but it'd be even more elegant to be able to be able to represent them as absolute times against an opaque epoch. In particular, when you have multiple recording tracks, only those which were recorded against the same epoch are actually commensurable -- trying to do recording1_times[10] - recording2_times[10] is meaningless and should be an error. I remember to be discussing this with some level of depth 5 years ago in this list, as we asked people about the convenience of including an user-defined 'epoch'. We were calling it 'origin'. But apparently it was decided that this was not needed because timestamps+timedelta would be enough. The NEP still reflects this discussion: https://github.com/numpy/numpy/blob/master/doc/neps/datetime-proposal.rst#why-the-origin-metadata-disappeared This is just an historical note, not that we can't change that again. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] timezones and datetime64

On 4/4/13 8:56 PM, Chris Barker - NOAA Federal wrote: On Thu, Apr 4, 2013 at 10:54 AM, Francesc Alted franc...@continuum.io wrote: That makes a difference. This can be specially important for creating user-defined time origins: In []: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, dtype='timedelta64[ns]') Out[]: numpy.datetime64('2017-07-14T04:40:00.1+0200') but that's worthless if you try it higher-resolution: In [40]: np.array(int(1.5e9), dtype='datetime64[s]') Out[40]: array(datetime.datetime(2017, 7, 14, 2, 40), dtype='datetime64[s]') # Start at 2017 # add a picosecond: In [41]: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, dtype='timedelta64[ps]') Out[41]: numpy.datetime64('1970-03-08T22:55:30.029526319105-0800') # get 1970??? This is clearly a bug. Could you file a ticket please? Also, using attoseconds is giving a weird behavior: In []: np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, dtype='timedelta64[as]') --- OverflowError Traceback (most recent call last) ipython-input-42-acd66c465bef in module() 1 np.array(int(1.5e9), dtype='datetime64[s]') + np.array(1, dtype='timedelta64[as]') OverflowError: Integer overflow getting a common metadata divisor for NumPy datetime metadata [s] and [as] I would expect the attosecond to be happily ignored and nothing would be added. And even with nanoseconds, given the leap-second issues, etc, you really wouldn't want to do this anyway -- rather, keep your epoch close by. Now that I think about it -- being able to set your epoch could lessen the impact of leap-seconds for second-resolution as well. Probably this is the way to go, yes. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

On 3/13/13 2:45 PM, Andrea Cimatoribus wrote: Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). [clip] You can do a fid.seek(offset) prior to np.fromfile() and the it will read from offset. See the docstrings for `file.seek()` on how to use it. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

On 3/13/13 3:53 PM, Francesc Alted wrote: On 3/13/13 2:45 PM, Andrea Cimatoribus wrote: Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). [clip] You can do a fid.seek(offset) prior to np.fromfile() and the it will read from offset. See the docstrings for `file.seek()` on how to use it. Ups, you were already using file.seek(). Disregard, please. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] aligned / unaligned structured dtype behavior (was: GSOC 2013)

On 3/6/13 7:42 PM, Kurt Smith wrote: And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations: In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop Hmm, that clearly depends on the architecture. On my machine: In [1]: import numpy as np In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True) In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False) In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt) In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt) In [6]: baligned = aligned_arr['b'] In [7]: bpacked = packed_arr['b'] In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results: In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though). Whereas summing shows just a 10%-ish slowdown: In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop On my machine: In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop Again, the 4x slowdown is here. Using numexpr: In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop Again, the unaligned case is (sligthly better). In this case numexpr is a bit slower that NumPy because sum() is not parallelized internally. Hmm, provided that, I'm wondering if some internal copies to L1 in NumPy could help improving unaligned performance. Worth a try? -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] aligned / unaligned structured dtype behavior

On 3/7/13 6:47 PM, Francesc Alted wrote: On 3/6/13 7:42 PM, Kurt Smith wrote: And regarding performance, doing simple timings shows a 30%-ish slowdown for unaligned operations: In [36]: %timeit packed_arr['b']**2 100 loops, best of 3: 2.48 ms per loop In [37]: %timeit aligned_arr['b']**2 1000 loops, best of 3: 1.9 ms per loop Hmm, that clearly depends on the architecture. On my machine: In [1]: import numpy as np In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True) In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False) In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt) In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt) In [6]: baligned = aligned_arr['b'] In [7]: bpacked = packed_arr['b'] In [8]: %timeit baligned**2 1000 loops, best of 3: 1.96 ms per loop In [9]: %timeit bpacked**2 100 loops, best of 3: 7.84 ms per loop That is, the unaligned column is 4x slower (!). numexpr allows somewhat better results: In [11]: %timeit numexpr.evaluate('baligned**2') 1000 loops, best of 3: 1.13 ms per loop In [12]: %timeit numexpr.evaluate('bpacked**2') 1000 loops, best of 3: 865 us per loop Just for completeness, here it is what Theano gets: In [18]: import theano In [20]: a = theano.tensor.vector() In [22]: f = theano.function([a], a**2) In [23]: %timeit f(baligned) 100 loops, best of 3: 7.74 ms per loop In [24]: %timeit f(bpacked) 100 loops, best of 3: 12.6 ms per loop So yeah, Theano is also slower for the unaligned case (but less than 2x in this case). Yes, in this case, the unaligned array goes faster (as much as 30%). I think the reason is that numexpr optimizes the unaligned access by doing a copy of the different chunks in internal buffers that fits in L1 cache. Apparently this is very beneficial in this case (not sure why, though). Whereas summing shows just a 10%-ish slowdown: In [38]: %timeit packed_arr['b'].sum() 1000 loops, best of 3: 1.29 ms per loop In [39]: %timeit aligned_arr['b'].sum() 1000 loops, best of 3: 1.14 ms per loop On my machine: In [14]: %timeit baligned.sum() 1000 loops, best of 3: 1.03 ms per loop In [15]: %timeit bpacked.sum() 100 loops, best of 3: 3.79 ms per loop Again, the 4x slowdown is here. Using numexpr: In [16]: %timeit numexpr.evaluate('sum(baligned)') 100 loops, best of 3: 2.16 ms per loop In [17]: %timeit numexpr.evaluate('sum(bpacked)') 100 loops, best of 3: 2.08 ms per loop And with Theano: In [26]: f2 = theano.function([a], a.sum()) In [27]: %timeit f2(baligned) 100 loops, best of 3: 2.52 ms per loop In [28]: %timeit f2(bpacked) 100 loops, best of 3: 7.43 ms per loop Again, the unaligned case is significantly slower (as much as 3x here!). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] GSOC 2013

On 3/5/13 7:14 PM, Kurt Smith wrote: On Tue, Mar 5, 2013 at 1:45 AM, Eric Firing efir...@hawaii.edu wrote: On 2013/03/04 9:01 PM, Nicolas Rougier wrote: This made me think of a serious performance limitation of structured dtypes: a structured dtype is always packed, which may lead to terrible byte alignment for common types. For instance, `dtype([('a', 'u1'), ('b', 'u8')]).itemsize == 9`, meaning that the 8-byte integer is not aligned as an equivalent C-struct's would be, leading to all sorts of horrors at the cache and register level. Doesn't the align kwarg of np.dtype do what you want? In [2]: dt = np.dtype(dict(names=['a', 'b'], formats=['u1', 'u8']), align=True) In [3]: dt.itemsize Out[3]: 16 Thanks! That's what I get for not checking before posting. Consider this my vote to make `aligned=True` the default. I would not run too much. The example above takes 9 bytes to host the structure, while a `aligned=True` will take 16 bytes. I'd rather let the default as it is, and in case performance is critical, you can always copy the unaligned field to a new (homogeneous) array. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] pip install numpy throwing a lot of output.

On 2/12/13 1:37 PM, Daπid wrote: I have just upgraded numpy with pip on Linux 64 bits with Python 2.7, and I got *a lot* of output, so much it doesn't fit in the terminal. Most of it are gcc commands, but there are many different errors thrown by the compiler. Is this expected? Yes, I think that's expected. Just to make sure, can you send some excerpts of the errors that you are getting? I am not too worried as the test suite passes, but pip is supposed to give only meaningful output (or at least, this is what the creators intended). Well, pip needs to compile the libraries prior to install them, so compile messages are meaningful. Another question would be to reduce the amount of compile messages by default in NumPy, but I don't think this is realistic (and even not desirable). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] pip install numpy throwing a lot of output.

On 2/12/13 3:18 PM, Daπid wrote: On 12 February 2013 14:58, Francesc Alted franc...@continuum.io wrote: Yes, I think that's expected. Just to make sure, can you send some excerpts of the errors that you are getting? Actually the errors are at the beginning of the process, so they are out of the reach of my terminal right now. Seems like pip doesn't keep a log in case of success. Well, I think these errors are part of the auto-discovering process of the functions supported by the libraries in the hosting OS (kind of `autoconf`for Python), so they can be considered 'normal'. The ones I can see are mostly warnings of unused variables and functions, maybe this is the expected behaviour for a library? This errors come from a complete reinstall instead of the original upgrade (the cat closed the terminal, worst excuse ever!): [clip] These ones are not errors, but warnings. While it should be desirable to avoid any warning during the compilation process, not many libraries fulfill this (but patches for removing them are accepted). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Byte aligned arrays

On 12/20/12 7:35 PM, Henry Gomersall wrote: On Thu, 2012-12-20 at 15:23 +0100, Francesc Alted wrote: On 12/20/12 9:53 AM, Henry Gomersall wrote: On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable. Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article... http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX. Though it would be nice to put together a little example of this! Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3). snip All in all, it is not clear that AVX alignment would have an advantage, even for memory-bounded problems. But of course, if Intel people are saying that AVX alignment is important is because they have use cases for asserting this. It is just that I'm having a difficult time to find these cases. Thanks for those examples, they were very interesting. I managed to temporarily get my hands on a machine with AVX and I have shown some speed-up with aligned arrays. FFT (using my wrappers) gives about a 15% speedup. Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!). Okay, so a 15% is significant, yes. I'm still wondering why I did not get any speedup at all using MKL, but probably the reason is that it manages the unaligned corners of the datasets first, and then uses an aligned access for the rest of the data (but just guessing here). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Byte aligned arrays

On 12/21/12 11:58 AM, Henry Gomersall wrote: On Fri, 2012-12-21 at 11:34 +0100, Francesc Alted wrote: Also this convolution code: https://github.com/hgomersall/SSE-convolution/blob/master/convolve.c Shows a small but repeatable speed-up (a few %) when using some aligned loads (as many as I can work out to use!). Okay, so a 15% is significant, yes. I'm still wondering why I did not get any speedup at all using MKL, but probably the reason is that it manages the unaligned corners of the datasets first, and then uses an aligned access for the rest of the data (but just guessing here). With SSE in that convolution code example above (in which all alignments need be considered for each output element), I note a significant speedup by creating 4 copies of the float input array using memcopy, each shifted by 1 float (so the 5th element is aligned again). Despite all the extra copies its still quicker than using an unaligned load. However, when one tries the same trick with 8 copies for AVX it's actually slower than the SSE case. The fastest AVX (and any) implementation I have so far is with 16-aligned arrays (made with 4 copies as with SSE), with alternate aligned and unaligned loads (which is always at worst 16-byte aligned). Fascinating stuff! Yes, to say the least. And it supports the fact that, when fine tuning memory access performance, there is no replacement for experimentation (in some weird ways many times :) -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Byte aligned arrays

On 12/21/12 1:35 PM, Dag Sverre Seljebotn wrote: On 12/20/2012 03:23 PM, Francesc Alted wrote: On 12/20/12 9:53 AM, Henry Gomersall wrote: On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable. Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article... http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX. Though it would be nice to put together a little example of this! Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3). Here it is a first example using a DGEMM operation. First using a NumPy that is not turbo-loaded with MKL: In [34]: a = np.linspace(0,1,1e7) In [35]: b = a.reshape(1000, 1) In [36]: c = a.reshape(1, 1000) In [37]: time d = np.dot(b,c) CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s Wall time: 7.63 s In [38]: time d = np.dot(c,b) CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s Wall time: 78.89 s This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and AVX-unaligned data: In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p)) Out[7]: '0x7fcdef3b4010' # 16 bytes alignment In [8]: a = np.ndarray(1e7, f8, p) In [9]: a[:] = np.linspace(0,1,1e7) In [10]: b = a.reshape(1000, 1) In [11]: c = a.reshape(1, 1000) In [37]: %timeit d = np.dot(b,c) 10 loops, best of 3: 164 ms per loop In [38]: %timeit d = np.dot(c,b) 1 loops, best of 3: 1.65 s per loop That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX). Now, using MKL 10.3 and AVX-aligned data: In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); hex(ctypes.addressof(p)) Out[21]: '0x7f8cb9598010' In [22]: a2 = np.ndarray(1e7+2, f8, p2)[2:] # skip the first 16 bytes (now is 32-bytes aligned) In [23]: a2[:] = np.linspace(0,1,1e7) In [24]: b2 = a2.reshape(1000, 1) In [25]: c2 = a2.reshape(1, 1000) In [35]: %timeit d2 = np.dot(b2,c2) 10 loops, best of 3: 163 ms per loop In [36]: %timeit d2 = np.dot(c2,b2) 1 loops, best of 3: 1.67 s per loop So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX data is negligible. One may argue that DGEMM is CPU-bounded and that memory access plays little role here, and that is certainly true. So, let's go with a more memory-bounded problem, like computing a transcendental function with numexpr. First with a with NumPy and numexpr with no MKL support: In [8]: a = np.linspace(0,1,1e8) In [9]: %time b = np.sin(a) CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s Wall time: 1.42 s In [10]: import numexpr as ne In [12]: %time b = ne.evaluate(sin(a)) CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s Wall time: 0.37 s This time is around 4x faster than regular 'sin' in libc, and about the same speed than a memcpy(): In [13]: %time c = a.copy() CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s Wall time: 0.39 s Now, with a MKL-aware numexpr and non-AVX alignment: In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p)) Out[8]: '0x7fce435da010' # 16 bytes alignment In [9]: a = np.ndarray(1e8, f8, p) In [10]: a[:] = np.linspace(0,1,1e8) In [11]: %time b = ne.evaluate(sin(a)) CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s Wall time: 0.15 s That is, more than 2x faster than a memcpy() in this system, meaning that the problem is truly memory-bounded. So now, with an AVX aligned buffer: In [14]: a2 = a[2:] # skip the first 16 bytes In [15]: %time b = ne.evaluate(sin(a2)) CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s Wall time: 0.16 s Again, times are very close. Just

### Re: [Numpy-discussion] Byte aligned arrays

On 12/20/12 9:53 AM, Henry Gomersall wrote: On Wed, 2012-12-19 at 19:03 +0100, Francesc Alted wrote: The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable. Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. Further to this point, from an Intel article... http://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors Aligning data to vector length is always recommended. When using Intel SSE and Intel SSE2 instructions, loaded data should be aligned to 16 bytes. Similarly, to achieve best results use Intel AVX instructions on 32-byte vectors that are 32-byte aligned. The use of Intel AVX instructions on unaligned 32-byte vectors means that every second load will be across a cache-line split, since the cache line is 64 bytes. This doubles the cache line split rate compared to Intel SSE code that uses 16-byte vectors. A high cache-line split rate in memory-intensive code is extremely likely to cause performance degradation. For that reason, it is highly recommended to align the data to 32 bytes for use with Intel AVX. Though it would be nice to put together a little example of this! Indeed, an example is what I was looking for. So provided that I have access to an AVX capable machine (having 6 physical cores), and that MKL 10.3 has support for AVX, I have made some comparisons using the Anaconda Python distribution (it ships with most packages linked against MKL 10.3). Here it is a first example using a DGEMM operation. First using a NumPy that is not turbo-loaded with MKL: In [34]: a = np.linspace(0,1,1e7) In [35]: b = a.reshape(1000, 1) In [36]: c = a.reshape(1, 1000) In [37]: time d = np.dot(b,c) CPU times: user 7.56 s, sys: 0.03 s, total: 7.59 s Wall time: 7.63 s In [38]: time d = np.dot(c,b) CPU times: user 78.52 s, sys: 0.18 s, total: 78.70 s Wall time: 78.89 s This is getting around 2.6 GFlop/s. Now, with a MKL 10.3 NumPy and AVX-unaligned data: In [7]: p = ctypes.create_string_buffer(int(8e7)); hex(ctypes.addressof(p)) Out[7]: '0x7fcdef3b4010' # 16 bytes alignment In [8]: a = np.ndarray(1e7, f8, p) In [9]: a[:] = np.linspace(0,1,1e7) In [10]: b = a.reshape(1000, 1) In [11]: c = a.reshape(1, 1000) In [37]: %timeit d = np.dot(b,c) 10 loops, best of 3: 164 ms per loop In [38]: %timeit d = np.dot(c,b) 1 loops, best of 3: 1.65 s per loop That is around 120 GFlop/s (i.e. almost 50x faster than without MKL/AVX). Now, using MKL 10.3 and AVX-aligned data: In [21]: p2 = ctypes.create_string_buffer(int(8e7+16)); hex(ctypes.addressof(p)) Out[21]: '0x7f8cb9598010' In [22]: a2 = np.ndarray(1e7+2, f8, p2)[2:] # skip the first 16 bytes (now is 32-bytes aligned) In [23]: a2[:] = np.linspace(0,1,1e7) In [24]: b2 = a2.reshape(1000, 1) In [25]: c2 = a2.reshape(1, 1000) In [35]: %timeit d2 = np.dot(b2,c2) 10 loops, best of 3: 163 ms per loop In [36]: %timeit d2 = np.dot(c2,b2) 1 loops, best of 3: 1.67 s per loop So, again, around 120 GFlop/s, and the difference wrt to unaligned AVX data is negligible. One may argue that DGEMM is CPU-bounded and that memory access plays little role here, and that is certainly true. So, let's go with a more memory-bounded problem, like computing a transcendental function with numexpr. First with a with NumPy and numexpr with no MKL support: In [8]: a = np.linspace(0,1,1e8) In [9]: %time b = np.sin(a) CPU times: user 1.20 s, sys: 0.22 s, total: 1.42 s Wall time: 1.42 s In [10]: import numexpr as ne In [12]: %time b = ne.evaluate(sin(a)) CPU times: user 1.42 s, sys: 0.27 s, total: 1.69 s Wall time: 0.37 s This time is around 4x faster than regular 'sin' in libc, and about the same speed than a memcpy(): In [13]: %time c = a.copy() CPU times: user 0.19 s, sys: 0.20 s, total: 0.39 s Wall time: 0.39 s Now, with a MKL-aware numexpr and non-AVX alignment: In [8]: p = ctypes.create_string_buffer(int(8e8)); hex(ctypes.addressof(p)) Out[8]: '0x7fce435da010' # 16 bytes alignment In [9]: a = np.ndarray(1e8, f8, p) In [10]: a[:] = np.linspace(0,1,1e8) In [11]: %time b = ne.evaluate(sin(a)) CPU times: user 0.44 s, sys: 0.27 s, total: 0.71 s Wall time: 0.15 s That is, more than 2x faster than a memcpy() in this system, meaning that the problem is truly memory-bounded. So now, with an AVX aligned buffer: In [14]: a2 = a[2:] # skip the first 16 bytes In [15]: %time b = ne.evaluate(sin(a2)) CPU times: user 0.40 s, sys: 0.28 s, total: 0.69 s Wall time: 0.16 s Again, times are very close. Just to make sure, let's use the timeit magic: In [16]: %timeit b = ne.evaluate(sin(a)) 10 loops, best of 3: 159 ms per loop # unaligned In [17]: %timeit b

### Re: [Numpy-discussion] Byte aligned arrays

On 12/19/12 5:47 PM, Henry Gomersall wrote: On Wed, 2012-12-19 at 15:57 +, Nathaniel Smith wrote: Not sure which interface is more useful to users. On the one hand, using funny dtypes makes regular non-SIMD access more cumbersome, and it forces your array size to be a multiple of the SIMD word size, which might be inconvenient if your code is smart enough to handle arbitrary-sized arrays with partial SIMD acceleration (i.e., using SIMD for most of the array, and then a slow path to handle any partial word at the end). OTOH, if your code *is* that smart, you should probably just make it smart enough to handle a partial word at the beginning as well and then you won't need any special alignment in the first place, and representing each SIMD word as a single numpy scalar is an intuitively appealing model of how SIMD works. OTOOH, just adding a single argument np.array() is a much simpler to explain than some elaborate scheme involving the creation of special custom dtypes. If it helps, my use-case is in wrapping the FFTW library. This _is_ smart enough to deal with unaligned arrays, but it just results in a performance penalty. In the case of an FFT, there are clearly going to be issues with the powers of two indices in the array not lying on a suitable n-byte boundary (which would be the case with a misaligned array), but I imagine it's not unique. The other point is that it's easy to create a suitable power of two array that should always bypass any special case unaligned code (e.g. with floats, any multiple of 4 array length will fill every 16-byte word). Finally, I think there is significant value in auto-aligning the array based on an appropriate inspection of the cpu capabilities (or alternatively, a function that reports back the appropriate SIMD alignment). Again, this makes it easier to wrap libraries that may function with any alignment, but benefit from optimum alignment. Hmm, NumPy seems to return data blocks that are aligned to 16 bytes on systems (Linux and Mac OSX): In []: np.empty(1).data Out[]: read-write buffer for 0x102b97b60, size 8, offset 0 at 0x102e7c130 In []: np.empty(1).data Out[]: read-write buffer for 0x102ba64e0, size 8, offset 0 at 0x102e7c430 In []: np.empty(1).data Out[]: read-write buffer for 0x102b86700, size 8, offset 0 at 0x102e7c5b0 In []: np.empty(1).data Out[]: read-write buffer for 0x102b981d0, size 8, offset 0 at 0x102e7c5f0 [Check that the last digit in the addresses above is always 0] The only scenario that I see that this would create unaligned arrays is for machines having AVX. But provided that the Intel architecture is making great strides in fetching unaligned data, I'd be surprised that the difference in performance would be even noticeable. Can you tell us which difference in performance are you seeing for an AVX-aligned array and other that is not AVX-aligned? Just curious. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] the difference between + and np.add?

On 11/23/12 8:00 PM, Chris Barker - NOAA Federal wrote: On Thu, Nov 22, 2012 at 6:20 AM, Francesc Alted franc...@continuum.io wrote: As Nathaniel said, there is not a difference in terms of *what* is computed. However, the methods that you suggested actually differ on *how* they are computed, and that has dramatic effects on the time used. For example: In []: arr1, arr2, arr3, arr4, arr5 = [np.arange(1e7) for x in range(5)] In []: %time arr1 + arr2 + arr3 + arr4 + arr5 CPU times: user 0.05 s, sys: 0.10 s, total: 0.14 s Wall time: 0.15 s There are also ways to minimize the size of temporaries, and numexpr is one of the simplests: but you can also use np.add (and friends) to reduce the number of temporaries. It can make a difference: In [11]: def add_5_arrays(arr1, arr2, arr3, arr4, arr5): : result = arr1 + arr2 : np.add(result, arr3, out=result) : np.add(result, arr4, out=result) : np.add(result, arr5, out=result) In [13]: timeit arr1 + arr2 + arr3 + arr4 + arr5 1 loops, best of 3: 528 ms per loop In [17]: timeit add_5_arrays(arr1, arr2, arr3, arr4, arr5) 1 loops, best of 3: 293 ms per loop (don't have numexpr on this machine for a comparison) Yes, you are right. However, numexpr still can beat this: In [8]: timeit arr1 + arr2 + arr3 + arr4 + arr5 10 loops, best of 3: 138 ms per loop In [9]: timeit add_5_arrays(arr1, arr2, arr3, arr4, arr5) 10 loops, best of 3: 74.3 ms per loop In [10]: timeit ne.evaluate(arr1 + arr2 + arr3 + arr4 + arr5) 10 loops, best of 3: 20.8 ms per loop The reason is that numexpr is multithreaded (using 6 cores above), and for memory-bounded problems like this one, fetching data in different threads is more efficient than using a single thread: In [12]: timeit arr1.copy() 10 loops, best of 3: 41 ms per loop In [13]: ne.set_num_threads(1) Out[13]: 6 In [14]: timeit ne.evaluate(arr1) 10 loops, best of 3: 30.7 ms per loop In [15]: ne.set_num_threads(6) Out[15]: 1 In [16]: timeit ne.evaluate(arr1) 100 loops, best of 3: 13.4 ms per loop I.e., the joy of multi-threading is that it not only buys you CPU speed, but can also bring your data from memory faster. So yeah, modern applications *do* need multi-threading for getting good performance. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Conditional update of recarray field

On 11/28/12 1:47 PM, Bartosz wrote: Hi, I try to update values in a single field of numpy record array based on a condition defined in another array. I found that that the result depends on the order in which I apply the boolean indices/field names. For example: cond = np.zeros(5, dtype=np.bool) cond[2:] = True X = np.rec.fromarrays([np.arange(5)], names='a') X[cond]['a'] = -1 print X returns: [(0,) (1,) (2,) (3,) (4,)] (the values were not updated) X['a'][cond] = -1 print X returns: [(0,) (1,) (-1,) (-1,) (-1,)] (it worked this time). I find this behaviour very confusing. Is it expected? Yes, it is. In the first idiom, X[cond] is a fancy indexing operation and the result is not a view, so what you are doing is basically modifying the temporary object that results from the indexing. In the second idiom, X['a'] is returning a *view* of the original object, so this is why it works. Would it be possible to emit a warning message in the case of faulty assignments? The only solution that I can see for this is that the fancy indexing would return a view, and not a different object, but NumPy containers are not prepared for this. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] Conditional update of recarray field

Hey Bartosz, On 11/28/12 3:26 PM, Bartosz wrote: Thanks for answer, Francesc. I understand now that fancy indexing returns a copy of a recarray. Is it also true for standard ndarrays? If so, I do not understand why X['a'][cond]=-1 should work. Yes, that's a good question. No, in this case the boolean array `cond` is passed to the __setitem__() of the original view, so this is why this works. The first idiom is concatenating the fancy indexing with another indexing operation, and NumPy needs to create a temporary for executing this, so the second indexing operation acts over a copy, not a view. And yes, fancy indexing returning a copy is standard for all ndarrays. Hope it is clearer now (although admittedly it is a bit strange at first sight), -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

### Re: [Numpy-discussion] the difference between + and np.add?

On 11/22/12 1:41 PM, Chao YUE wrote: Dear all, if I have two ndarray arr1 and arr2 (with the same shape), is there some difference when I do: arr = arr1 + arr2 and arr = np.add(arr1, arr2), and then if I have more than 2 arrays: arr1, arr2, arr3, arr4, arr5, then I cannot use np.add anymore as it only recieves 2 arguments. then what's the best practice to add these arrays? should I do arr = arr1 + arr2 + arr3 + arr4 + arr5 or I do arr = np.sum(np.array([arr1, arr2, arr3, arr4, arr5]), axis=0)? because I just noticed recently that there are functions like np.add, np.divide, np.substract... before I am using all like directly arr1/arr2, rather than np.divide(arr1,arr2). As Nathaniel said, there is not a difference in terms of *what* is computed. However, the methods that you suggested actually differ on *how* they are computed, and that has dramatic effects on the time used. For example: In []: arr1, arr2, arr3, arr4, arr5 = [np.arange(1e7) for x in range(5)] In []: %time arr1 + arr2 + arr3 + arr4 + arr5 CPU times: user 0.05 s, sys: 0.10 s, total: 0.14 s Wall time: 0.15 s Out[]: array([ 0.e+00, 5.e+00, 1.e+01, ..., 4.9850e+07, 4.9900e+07, 4.9950e+07]) In []: %time np.sum(np.array([arr1, arr2, arr3, arr4, arr5]), axis=0) CPU times: user 2.98 s, sys: 0.15 s, total: 3.13 s Wall time: 3.14 s Out[]: array([ 0.e+00, 5.e+00, 1.e+01, ..., 4.9850e+07, 4.9900e+07, 4.9950e+07]) The difference is how memory is used. In the first case, the additional memory was just a temporary with the size of the operands, while for the second case a big temporary has to be created, so the difference in is speed is pretty large. There are also ways to minimize the size of temporaries, and numexpr is one of the simplests: In []: import numexpr as ne In []: %time ne.evaluate('arr1 + arr2 + arr3 + arr4 + arr5') CPU times: user 0.04 s, sys: 0.04 s, total: 0.08 s Wall time: 0.04 s Out[]: array([ 0.e+00, 5.e+00, 1.e+01, ..., 4.9850e+07, 4.9900e+07, 4.9950e+07]) Again, the computations are the same, but how you manage memory is critical. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion