Hi,
Note that the plug-in idea is just my own idea, it is not something
agreed by anyone else. So maybe it won't be done for numpy 1.1, or at
all. It depends on the main maintainers of numpy.
I'm +3 for the plugin idea - it would have huge benefits for
installation and automatic
David Cournapeau wrote:
Gnata Xavier wrote:
Ok I will try to see what I can do but it is sure that we do need the
plug-in system first (read before the threads in the numpy release).
During the devel of 1.1, I will try to find some time to understand
where I should put some pragma into
A couple of thoughts on parallelism:
1. Can someone come up with a small set of cases and time them on
numpy, IDL, Matlab, and C, using various parallel schemes, for each of
a representative set of architectures? We're comparing a benchmark to
itself on different architectures, rather than
A couple of thoughts on parallelism:
1. Can someone come up with a small set of cases and time them on
numpy, IDL, Matlab, and C, using various parallel schemes, for each of
a representative set of architectures? We're comparing a benchmark to
itself on different architectures, rather than
It is a real problem in some communities like astronomers and images
processing people but the lack of documentation is the first one, that
is true.
Even in those communities, I think that a lot could be done at a higher
level, as what IPython1 does (tasks parallelism).
Matthieu
--
French
On Sat, Mar 22, 2008 at 4:25 PM, Charles R Harris
[EMAIL PROTECTED] wrote:
On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote:
On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
[EMAIL PROTECTED] wrote:
Maybe it's time to revisit the template subsystem I pulled out of
Matthieu Brucher wrote:
It is a real problem in some communities like astronomers and images
processing people but the lack of documentation is the first one,
that
is true.
Even in those communities, I think that a lot could be done at a
higher level, as what IPython1
On Mon, Mar 24, 2008 at 10:35 AM, Robert Kern [EMAIL PROTECTED] wrote:
On Sat, Mar 22, 2008 at 4:25 PM, Charles R Harris
[EMAIL PROTECTED] wrote:
On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED]
wrote:
On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
[EMAIL
Matthew Brett wrote:
I'm +3 for the plugin idea - it would have huge benefits for
installation and automatic optimization. What needs to be done? Who
could do it?
The main issues are portability, and reliability I think. All OS
supported by numpy have more or less a dynamic library loading
On Mon, Mar 24, 2008 at 12:12 PM, Gnata Xavier [EMAIL PROTECTED] wrote:
Well it is not that easy. We have several numpy code following like this :
1) open an large data file to get a numpy array
2) perform computations on this array (I'm only talking of the numpy
part here. scipy is
Robert Kern wrote:
On Mon, Mar 24, 2008 at 12:12 PM, Gnata Xavier [EMAIL PROTECTED] wrote:
Well it is not that easy. We have several numpy code following like this :
1) open an large data file to get a numpy array
2) perform computations on this array (I'm only talking of the numpy
On Sat, Mar 22, 2008 at 10:59 PM, David Cournapeau
[EMAIL PROTECTED] wrote:
Charles R Harris wrote:
It looks like memory access is the bottleneck, otherwise running 4
floats through in parallel should go a lot faster. I need to modify
the program a bit and see how it works for doubles.
Charles R Harris wrote:
Yep, but I expect the compilers to take care of alignment, say by
inserting a few single ops when needed.
The other solution would be to have aligned allocators (it won't solve
all cases, of course). Because the compilers will never be able to take
care of the cases
James Philbin wrote:
OK, i've written a simple benchmark which implements an elementwise
multiply (A=B*C) in three different ways (standard C, intrinsics, hand
coded assembly). On the face of things the results seem to indicate
that the vectorization works best on medium sized inputs. If
Wow, a much more varied set of results than I was expecting. Could
someone who has gcc 4.3 installed compile it with:
gcc -msse -O2 -ftree-vectorize -ftree-vectorizer-verbose=5 -S
vec_bench.c -o vec_bench.s
And attach vec_bench.s and the verbose output from gcc.
James
Gnata Xavier wrote:
Hi,
I have a very limited knowledge of openmp but please consider this
testcase :
Honestly, if it was that simple, it would already have been done for a
long time. The problem is that your test-case is not even remotely close
to how things have to be done in
A Sunday 23 March 2008, Charles R Harris escrigué:
gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Problem size Simple Intrin
Inline
100 0.0002ms (100.0%) 0.0001ms ( 68.7%)
A Sunday 23 March 2008, David Cournapeau escrigué:
Gnata Xavier wrote:
Hi,
I have a very limited knowledge of openmp but please consider this
testcase :
Honestly, if it was that simple, it would already have been done for
a long time. The problem is that your test-case is not even
Francesc Altet wrote:
Why not? IMHO, complex operations requiring a great deal of operations
per word, like trigonometric, exponential, etc..., are the best
candidates to take advantage of several cores or even SSE instructions
(not sure whether SSE supports this sort of operations,
I find the example of sse rather enlightening: in theory, you should
expect a 100-300 % speed increase using sse, but even with pure C code
in a controlled manner, on one platform (linux + gcc), with varying,
recent CPU, the results are fundamentally different. So what would
happen in numpy,
David Cournapeau wrote:
Francesc Altet wrote:
Why not? IMHO, complex operations requiring a great deal of operations
per word, like trigonometric, exponential, etc..., are the best
candidates to take advantage of several cores or even SSE instructions
(not sure whether SSE supports
Gnata Xavier wrote:
Well of course my goal was not to say that my simple testcase can be
copied/pasted into numpy :)
Of ourse it is one of the best case to use openmp.
Of course pragma can be more complex than that (you can tell variables
that can/cannot be shared for instance).
The size
If the performances are so bad, ok, forget about itbut it would be
sad because the next generation CPU will not be more powerfull, they
will only have more that one or two cores on the same chip.
I don't think this is the worst that will happen. The worst is what has been
seen for
Hi David et al,
Very interesting. I thought that the 64-bit gcc's automatically
aligned memory on 16-bit (or 32-bit) boundaries. But apparently
not. Because running your code certainly made the intrinsic code
quite a bit faster. However, another thing that I noticed was
that the simple code
Scott Ransom wrote:
Hi David et al,
Very interesting. I thought that the 64-bit gcc's automatically
aligned memory on 16-bit (or 32-bit) boundaries.
Note that I am talking about bytes, not bits. Default alignement depend
on many parameters, like the OS, C runtime. For example, on mac os X,
On Sun, Mar 23, 2008 at 6:41 AM, Francesc Altet [EMAIL PROTECTED] wrote:
A Sunday 23 March 2008, Charles R Harris escrigué:
gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
Problem size Simple
OK, i'm really impressed with the improvements in vectorization for
gcc 4.3. It really seems like it's able to work with real loops which
wasn't the case with 4.1. I think Chuck's right that we should simply
special case contiguous data and allow the auto-vectorizer to do the
rest. Something like
On 23/03/2008, David Cournapeau [EMAIL PROTECTED] wrote:
Gnata Xavier wrote:
Hi,
I have a very limited knowledge of openmp but please consider this
testcase :
Honestly, if it was that simple, it would already have been done for a
long time. The problem is that your
(And I suspect that OpenMP is
smart enough to use single threads without locking when multiple
threads won't help. Certainly all the information is available to
OpenMP to make such decisions.)
Unfortunately, I don't think there is such a think. For instance the number
of threads used by MKL
Anne Archibald wrote:
Actually, there are a few places where a parallel for would serve to
accelerate all ufuncs. There are build issues, yes, though they are
mild;
Maybe, maybe not. Anyway, I said that I would step in to resolve those
issues if someone else does the coding.
we would also
Gnata Xavier wrote:
Ok I will try to see what I can do but it is sure that we do need the
plug-in system first (read before the threads in the numpy release).
During the devel of 1.1, I will try to find some time to understand
where I should put some pragma into ufunct using a very
Matthieu Brucher wrote:
Hi,
It seems complicated to add OpenMP in the code, I don't think many
people have the knowlegde to do this, not mentioning the fact that
there are a lotof Python calls in the different functions.
Yes, this makes potential optimizations harder, at least for someone
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
particular, doing some basic loop unrolling and SSE versions of the
ufuncs would be beneficial. I
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
particular, doing some basic loop unrolling and SSE versions of the
gcc keeps advancing autovectorization. Is manual vectorization worth the
trouble?
Well, the way that the ufuncs are written at the moment,
-ftree-vectorize will never kick in due to the non-constant strides.
To get this to work, one has to special case out unary strides. Even
with constant
On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker [EMAIL PROTECTED] wrote:
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
On Sat, Mar 22, 2008 at 12:01 PM, James Philbin [EMAIL PROTECTED] wrote:
gcc keeps advancing autovectorization. Is manual vectorization worth
the
trouble?
Well, the way that the ufuncs are written at the moment,
-ftree-vectorize will never kick in due to the non-constant strides.
To
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
particular, doing some basic loop unrolling and SSE versions of the
ufuncs
Charles R Harris wrote:
On Sat, Mar 22, 2008 at 11:43 AM, Neal Becker [EMAIL PROTECTED]
mailto:[EMAIL PROTECTED] wrote:
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
OK, so a few questions:
1. I'm not familiar with the format of the code generators. Should I
pull the special case out of the /** begin repeats or should I do a
conditional inside the repeats (how does one do this?).
2. I don't have access to Windows+VisualC, so I will need some help
testing for
Am 22.03.2008 um 19:20 schrieb Travis E. Oliphant:
I think the thing to do is to special-case the code so that if the
strides work for vectorization, then a different bit of code is executed
and this current code is used as the final special-case.
Something like this would be relatively
On 22/03/2008, Thomas Grill [EMAIL PROTECTED] wrote:
I've experimented with branching the ufuncs into different constant
strides and aligned/unaligned cases to be able to use SSE using
compiler intrinsics.
I expected a considerable gain as i was using float32 with stride 1
most of the
On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote:
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex calculations. In
However, profiling revealed that hardly anything was gained because of
1) non-alignment of the vectors this _could_ be handled by
shuffled loading of the values though
2) the fact that my application used relatively large vectors that
wouldn't fit into the CPU cache, hence the memory
On Sat, Mar 22, 2008 at 12:54 PM, Anne Archibald [EMAIL PROTECTED]
wrote:
On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote:
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
Anne Archibald wrote:
On 22/03/2008, Travis E. Oliphant [EMAIL PROTECTED] wrote:
James Philbin wrote:
Personally, I think that the time would be better spent optimizing
routines for single-threaded code and relying on BLAS and LAPACK
libraries to use multiple cores for more complex
On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
[EMAIL PROTECTED] wrote:
Maybe it's time to revisit the template subsystem I pulled out of Django.
I am still -lots on using the Django template system. Please, please,
please, look at Jinja or another templating package that could be
dropped in
On Sat, Mar 22, 2008 at 2:59 PM, Robert Kern [EMAIL PROTECTED] wrote:
On Sat, Mar 22, 2008 at 2:04 PM, Charles R Harris
[EMAIL PROTECTED] wrote:
Maybe it's time to revisit the template subsystem I pulled out of
Django.
I am still -lots on using the Django template system. Please, please,
On Sat, Mar 22, 2008 at 8:16 PM, Travis E. Oliphant
[EMAIL PROTECTED] wrote:
Perhaps we could drum up interest in a Need for Speed Sprint on NumPy
sometime over the next few months.
I guess we'd all like our computations to complete more quickly, as
long as they still give valid results. I
OK, i've written a simple benchmark which implements an elementwise
multiply (A=B*C) in three different ways (standard C, intrinsics, hand
coded assembly). On the face of things the results seem to indicate
that the vectorization works best on medium sized inputs. If people
could post the results
gcc --version
gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[EMAIL PROTECTED] ~]$ cat
On Sat, Mar 22, 2008 at 5:03 PM, James Philbin [EMAIL PROTECTED] wrote:
OK, i've written a simple benchmark which implements an elementwise
multiply (A=B*C) in three different ways (standard C, intrinsics, hand
coded assembly). On the face of things the results seem to indicate
that the
Hi,
here's my results:
Intel Core 2 Duo, 2.16GHz, 667MHz bus, 4MB Cache
running under OSX 10.5.2
please note that the auto-vectorizer of gcc-4.3 is doing really well
gr~~~
-
gcc version 4.0.1 (Apple Inc. build 5465)
xbook-2:temp thomas$ gcc -msse -O2 vec_bench.c -o
On Sat, Mar 22, 2008 at 5:32 PM, Charles R Harris [EMAIL PROTECTED]
wrote:
On Sat, Mar 22, 2008 at 5:03 PM, James Philbin [EMAIL PROTECTED] wrote:
OK, i've written a simple benchmark which implements an elementwise
multiply (A=B*C) in three different ways (standard C, intrinsics, hand
On Sat, Mar 22, 2008 at 6:34 PM, Charles R Harris [EMAIL PROTECTED]
wrote:
I've attached a double version. Compile with
gcc -msse2 -mfpmath=sse -O2 vec_bench_dbl.c -o vec_bench_dbl
Chuck
#include assert.h
#include stdio.h
#include stdlib.h
#include math.h
#include emmintrin.h
int sizes[6] =
Here are results under 64-bit linux using gcc-4.3 (which by
default turns on the various sse flags). Note that -O3 is
significantly better than -O2 for the simple calls:
nimrod:~$ cat /proc/cpuinfo | grep model name | head -1
model name : Intel(R) Xeon(R) CPU E5450 @ 3.00GHz
Thomas Grill wrote:
Hi,
here's my results:
Intel Core 2 Duo, 2.16GHz, 667MHz bus, 4MB Cache
running under OSX 10.5.2
please note that the auto-vectorizer of gcc-4.3 is doing really well
gr~~~
-
gcc version 4.0.1 (Apple Inc. build 5465)
xbook-2:temp
On Sat, Mar 22, 2008 at 7:35 PM, Scott Ransom [EMAIL PROTECTED] wrote:
Here are results under 64-bit linux using gcc-4.3 (which by
default turns on the various sse flags). Note that -O3 is
significantly better than -O2 for the simple calls:
nimrod:~$ cat /proc/cpuinfo | grep model name |
Charles R Harris wrote:
It looks like memory access is the bottleneck, otherwise running 4
floats through in parallel should go a lot faster. I need to modify
the program a bit and see how it works for doubles.
I am not sure the benchmark is really meaningful: it does not uses
aligned
59 matches
Mail list logo