[OMPI users] 1.6.2 affinity failures

2012-12-19 Thread Brock Palen
Using openmpi 1.6.2 with intel 13.0  though the problem not specific to the 
compiler.

Using two 12 core 2 socket nodes, 

mpirun -np 4 -npersocket 2 uptime
--
Your job has requested a conflicting number of processes for the
application:

App: uptime
number of procs:  4

This is more processes than we can launch under the following
additional directives and conditions:

number of sockets:   0
npersocket:   2


Any idea why this wouldn't work?  

Another problem the following does what I expect,  two 2 socket 8 core sockets. 
16 total cores/node.

mpirun -np 8 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get
0x000f
0x000f
0x00f0
0x00f0
0x0f00
0x0f00
0xf000
0xf000

But fails at large scale:

mpirun -np 276 -npernode 4 -bind-to-core -cpus-per-rank 4 hwloc-bind --get

--
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M).  Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.
You job will now abort.
--



Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985






Re: [OMPI users] mpi problems/many cpus per node

2012-12-19 Thread Daniel Davidson

I figured this out.

ssh was working, but scp was not due to an mtu mismatch between the 
systems.  Adding MTU=1500 to my 
/etc/sysconfig/network-scripts/ifcfg-eth2 fixed the problem.


Dan

On 12/17/2012 04:12 PM, Daniel Davidson wrote:

Yes, it does.

Dan

[root@compute-2-1 ~]# ssh compute-2-0
Warning: untrusted X11 forwarding setup failed: xauth key data not 
generated
Warning: No xauth data; using fake authentication data for X11 
forwarding.

Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
[root@compute-2-0 ~]# ssh compute-2-1
Warning: untrusted X11 forwarding setup failed: xauth key data not 
generated
Warning: No xauth data; using fake authentication data for X11 
forwarding.

Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
[root@compute-2-1 ~]#



On 12/17/2012 03:39 PM, Doug Reeder wrote:

Daniel,

Does passwordless ssh work. You need to make sure that it is.

Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:

I would also add that scp seems to be creating the file in the /tmp 
directory of compute-2-0, and that /var/log secure is showing ssh 
connections being accepted.  Is there anything in ssh that can limit 
connections that I need to look out for?  My guess is that it is 
part of the client prefs and not the server prefs since I can 
initiate the mpi command from another machine and it works fine, 
even when it uses compute-2-0 and 1.


Dan


[root@compute-2-1 /]# date
Mon Dec 17 15:11:50 CST 2012
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:70237] mca:base:select:(  plm) Querying component 
[rsh]
[compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on 
agent ssh : rsh path NULL


[root@compute-2-0 tmp]# ls -ltr
total 24
-rw---.  1 rootroot   0 Nov 28 08:42 yum.log
-rw---.  1 rootroot5962 Nov 29 10:50 
yum_save_tx-2012-11-29-10-50SRba9s.yumtx
drwx--.  3 danield danield 4096 Dec 12 14:56 
openmpi-sessions-danield@compute-2-0_0
drwx--.  3 rootroot4096 Dec 13 15:38 
openmpi-sessions-root@compute-2-0_0
drwx--  18 danield danield 4096 Dec 14 09:48 
openmpi-sessions-danield@compute-2-0.local_0
drwx--  44 rootroot4096 Dec 17 15:14 
openmpi-sessions-root@compute-2-0.local_0


[root@compute-2-0 tmp]# tail -10 /var/log/secure
Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root 
from 10.1.255.226 port 49483 ssh2
Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): 
session opened for user root by (uid=0)
Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
10.1.255.226: 11: disconnected by user
Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): 
session closed for user root
Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root 
from 10.1.255.226 port 49484 ssh2
Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): 
session opened for user root by (uid=0)
Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
10.1.255.226: 11: disconnected by user
Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): 
session closed for user root
Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root 
from 10.1.255.226 port 49485 ssh2
Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): 
session opened for user root by (uid=0)







On 12/17/2012 11:16 AM, Daniel Davidson wrote:
A very long time (15 mintues or so) I finally received the 
following in addition to what I just sent earlier:


[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc 
working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc 
working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc 
working on WILDCARD

[compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
[compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
orted_exit commands
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc 
working on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc 
working on WILDCARD


Firewalls are down:

[root@compute-2-1 /]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
[root@compute-2-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

On 12/17/2012 11:09 AM, Ralph Castain wrote:
Hmmm...and that is ALL the output? If so, then it never succeeded 
in sending a message back, which leads one to suspect some kind of 
firewall in the way.


Looking at the 

Re: [OMPI users] openmpi-1.9a1r27674 on Cygwin-1.7.17

2012-12-19 Thread marco atzeri

On 12/19/2012 12:28 PM, marco atzeri wrote:


working on openmpi-1.7rc5.
It needs some cleaning and after I need to test.


built and passed test
http://www.open-mpi.org/community/lists/devel/2012/12/11855.php

Regards
Marco



Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Number Cruncher

On 19/12/12 11:08, Paul Kapinos wrote:
Did you *really* wanna to dig into code just in order to switch a 
default communication algorithm?


No, I didn't want to, but with a huge change in performance, I'm forced 
to do something! And having looked at the different algorithms, I think 
there's a problem with the new default whenever message sizes are small 
enough that connection latency will dominate. We're not all running 
Infiniband, and having to wait for each pairwise exchange to complete 
before initiating another seems wrong if the latency in waiting for 
completion dominates the transmission time.


E.g. If I have 10 small pairwise exchanges to perform,isn't it better to 
put all 10 outbound messages on the wire, and wait for 10 matching 
inbound messages, in any order? The new algorithm must wait for first 
exchange to complete, then second exchange, then third. Unlike before, 
does it not have to wait to acknowledge the matching *zero-sized* 
request? I don't see why this temporal ordering matters.


Thanks for your help,
Simon






Note there are several ways to set the parameters; --mca on command 
line is just one of them (suitable for quick online tests).


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try 
and figure

out why.

It seems that the biggest difference will occur when the all_to_all 
is actually
sparse (e.g. our application); if most N-M process exchanges are zero 
in size
the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will 
actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges 
are
skipped. It then waits once for all requests to complete. In 
contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size 
exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is 
O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there 
are for
zero-size isend/irecv, but surely there's a great deal more latency 
if each
pairwise exchange has to be confirmed complete before executing the 
next?


Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like 
fashion

with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of 
doing so,
some is - it depends (usually on the price). This said, not all 
algorithms
perform the same given a specific type of network interconnect. For 
example,
on our fat-tree InfiniBand network the pairwise algorithm performs 
better.


You can switch back to the basic linear algorithm by providing the 
following

MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. 
Algorithm 2

is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make 
it have

global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already 
did so.

It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 
1.6.1


I've noticed a very significant (100%) slow down for MPI_Alltoallv 
calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb 
ethernet
where process-to-process message sizes are fairly small (e.g. 
100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default 
algorithm

to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which 

Re: [OMPI users] [Open MPI] #3351: JAVA scatter error

2012-12-19 Thread Siegmar Gross
Hi

I shortend this email so that you get earlier to my comments.

> > In my opinion Datatype.Vector must set the size of the
> > base datatype as extent of the vector and not the true extent, because
> > MPI-Java doesn't provide a function to resize a datatype.
> 
> No, I think Datatype.Vector is doing the Right Thing in that it acts
> just like MPI_Type_vector.  We do want these to be *bindings*, after
> all -- meaning that they should be pretty much a 1:1 mapping to the
> C bindings.  
> 
> I think the real shortcoming is that there is no Datatype.Resized
> function.  That can be fixed.

Are you sure? That would at least solve one problem.


> > We should forget
> > ObjectScatterMain.java for the moment and concentrate on
> > ObjectBroadcastMain.java, which I have sent three days ago to the list,
> > because it has the same problem.
> > 
> > 1) ColumnSendRecvMain.java
> > 
> > I create a 2D-matrix with (Java books would use "double[][] matrix"
> > which is the same in my opinion, but I like C notation)
> > 
> > double matrix[][] = new double[P][Q];
> 
> I noticed that if I used [][] in my version of the Scatter program,
> I got random results.  But if I used [] and did my own offset
> indexing, it worked.

I think if you want a 2D-matrix you should use a Java matrix and not
a special one with your own offset indexing. In my opinion that is
something a C programmer can/would do (I'm a C programmer myself with
a little Java knowledge), but the benefit of Java is that the programmer
should not know about addresses, memory layouts and similar things. Now
I sound like my colleagues who always claim that my Java programs look
more like C programs than Java programs :-(. I know nothing about the
memory layout of a Java matrix or if the layout is stable during the
lifetime of the object, but I think that the Java interface should deal
with all these things if that is possible. I suppose that Open MPI will
not succeed in the Java world if it requires "special" matrices and a
special offset indexing. Perhaps some members of this list have very
good Java knowledge or even know the exact layout of Java matrices so
that Datatype.Vector can build a Java column vector from a Java matrix
which even contains valid values.


> If double[][] is a fundamentally different type (and storage format)
> than double[], what is MPI to do?  How can it tell the difference?
> 
> > It is easy to see that process 1 doesn't get column 0. Your
> > suggestion to allocate enough memory for a matrix (without defining
> > a matrix) and doing all index computations yourself is in my opinion
> > not applicable for a "normal" Java programmer (it's even hard for
> > most C programmers :-) ). Hopefully you have an idea how to solve
> > this problem so that all processes receive correct column values.
> 
> I'm afraid I don't, other than defining your own class which
> allocates memory contiguously, but overrides [] and [][]
> (I'm *assuming* you can do that in Java...?).

Does anybody else in this list know how it can be done?


> > 2) ObjectBroadcastMain.java
> > 
> > As I said above, it is my understanding, that I can send a Java object
> > when I use MPI.OBJECT and that the MPI implementation must perform all
> > necessary tasks.
> 
> Remember: there is no standard for MPI and Java.  So there is no
> "must".  :-)

I know and I'm grateful that you try nevertheless to offer a Java
interface. Hopefully you will not misunderstand my "must". It wasn't
complaining, but trying to express that a "normal" Java user would
expect that he can implement an MPI program without special knowledge
about data layouts.


> This is one research implementation that was created.  We can update
> it and try to make it better, but we're somewhat crafting the rules
> as we go along here.
> 
> (BTW, if we continue detailed discussions about implementation,
> this conversation should probably move to the devel list...)
> 
> > Your interface for derived datatypes provides only
> > methods for discontiguous data and no method to create an MPI.OBJECT,
> > so that I have no idea what I would have to do to create one. The
> > object must be serializable so that you get the same values in a
> > heterogeneous environment. 
> > 
> > tyr java 146 mpiexec -np 2 java ObjectBroadcastMain
> > Exception in thread "main" java.lang.ClassCastException:
> >  MyData cannot be cast to [Ljava.lang.Object;
> >at mpi.Comm.Object_Serialize(Comm.java:207)
> >at mpi.Comm.Send(Comm.java:292)
> >at mpi.Intracomm.Bcast(Intracomm.java:202)
> >at ObjectBroadcastMain.main(ObjectBroadcastMain.java:44)
> > ...
> 
> After rooting around in the code a bit, I think I understand this
> stack trace a bit better now..
> 
> The code line in question is in the Object_Serialize method, where
> it calls:
> 
>   Object buf_els [] = (Object[])buf;
> 
> So it's trying to cast an (Object) to an (Object[]).  Apparently,
> this works for intrinsic Java types (e.g., int).  But it 

Re: [OMPI users] openmpi-1.9a1r27674 on Cygwin-1.7.17

2012-12-19 Thread marco atzeri

On 12/19/2012 11:04 AM, Siegmar Gross wrote:

Hi


On 12/18/2012 6:55 PM, Jeff Squyres wrote:

...but only of v1.6.x.


okay, adding development version on Christmas wishlist
;-)


Can you build the package with thread and Java support?

   --enable-mpi-java \
   --enable-opal-multi-threads \
   --enable-mpi-thread-multiple \
   --with-threads=posix \

I could build openmpi-1.6.4 with thread support without a problem
for Cygwin 1.7.17 but I failed to build openmpi-1.9 until now.



working on openmpi-1.7rc5.
It needs some cleaning and after I need to test.

java surely no as there is no cygwin Java.

--with-threads=posix  yes

not tested yet
--enable-opal-multi-threads \
--enable-mpi-thread-multiple \





Kind regards

Siegmar



Regards
Marco




Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Paul Kapinos
Did you *really* wanna to dig into code just in order to switch a default 
communication algorithm?


Note there are several ways to set the parameters; --mca on command line is just 
one of them (suitable for quick online tests).


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

We 'tune' our Open MPI by setting environment variables

Best
Paul Kapinos



On 12/19/12 11:44, Number Cruncher wrote:

Having run some more benchmarks, the new default is *really* bad for our
application (2-10x slower), so I've been looking at the source to try and figure
out why.

It seems that the biggest difference will occur when the all_to_all is actually
sparse (e.g. our application); if most N-M process exchanges are zero in size
the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear algorithm will actually
only post irecv/isend for non-zero exchanges; any zero-size exchanges are
skipped. It then waits once for all requests to complete. In contrast, the new
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size exchanges for
*every* N-M pair, and wait for each pairwise exchange. This is O(comm_size)
waits, may of which are zero. I'm not clear what optimizations there are for
zero-size isend/irecv, but surely there's a great deal more latency if each
pairwise exchange has to be confirmed complete before executing the next?

Relatedly, how would I direct OpenMPI to use the older algorithm
programmatically? I don't want the user to have to use "--mca" in their
"mpiexec". Is there a C API?

Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.
You can also set these values as exported environment variables:

export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

I've noticed a very significant (100%) slow down for MPI_Alltoallv calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange

typical

of part of our application. This is then exchanged several thousand times.
Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.

I also

attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35

minutes.

Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over

a

hour to run. There seems to be a much greater network demand in the 1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon



___
users mailing list

Re: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

2012-12-19 Thread Number Cruncher
Having run some more benchmarks, the new default is *really* bad for our 
application (2-10x slower), so I've been looking at the source to try 
and figure out why.


It seems that the biggest difference will occur when the all_to_all is 
actually sparse (e.g. our application); if most N-M process exchanges 
are zero in size the 1.6 ompi_coll_tuned_alltoallv_intra_basic_linear 
algorithm will actually only post irecv/isend for non-zero exchanges; 
any zero-size exchanges are skipped. It then waits once for all requests 
to complete. In contrast, the new 
ompi_coll_tuned_alltoallv_intra_pairwise will post the zero-size 
exchanges for *every* N-M pair, and wait for each pairwise exchange. 
This is O(comm_size) waits, may of which are zero. I'm not clear what 
optimizations there are for zero-size isend/irecv, but surely there's a 
great deal more latency if each pairwise exchange has to be confirmed 
complete before executing the next?


Relatedly, how would I direct OpenMPI to use the older algorithm 
programmatically? I don't want the user to have to use "--mca" in their 
"mpiexec". Is there a C API?


Thanks,
Simon


On 16/11/12 10:15, Iliev, Hristo wrote:

Hi, Simon,

The pairwise algorithm passes messages in a synchronised ring-like fashion
with increasing stride, so it works best when independent communication
paths could be established between several ports of the network
switch/router. Some 1 Gbps Ethernet equipment is not capable of doing so,
some is - it depends (usually on the price). This said, not all algorithms
perform the same given a specific type of network interconnect. For example,
on our fat-tree InfiniBand network the pairwise algorithm performs better.

You can switch back to the basic linear algorithm by providing the following
MCA parameters:

mpiexec --mca coll_tuned_use_dynamic_rules 1 --mca
coll_tuned_alltoallv_algorithm 1 ...

Algorithm 1 is the basic linear, which used to be the default. Algorithm 2
is the pairwise one.
  
You can also set  these values as exported environment variables:


export OMPI_MCA_coll_tuned_use_dynamic_rules=1
export OMPI_MCA_coll_tuned_alltoallv_algorithm=1
mpiexec ...

You can also put this in $HOME/.openmpi/mcaparams.conf or (to make it have
global effect) in $OPAL_PREFIX/etc/openmpi-mca-params.conf:

coll_tuned_use_dynamic_rules=1
coll_tuned_alltoallv_algorithm=1

A gratuitous hint: dual-Opteron systems are NUMAs so it makes sense to
activate process binding with --bind-to-core if you haven't already did so.
It prevents MPI processes from being migrated to other NUMA nodes while
running.

Kind regards,
Hristo
--
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Number Cruncher
Sent: Thursday, November 15, 2012 5:37 PM
To: Open MPI Users
Subject: [OMPI users] MPI_Alltoallv performance regression 1.6.0 to 1.6.1

I've noticed a very significant (100%) slow down for MPI_Alltoallv calls

as of

version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)

and

much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm
to a pairwise exchange", but I'm not clear what this means or how to

switch

back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange

typical

of part of our application. This is then exchanged several thousand times.
Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0.

I also

attach a plot showing network throughput on our actual mesh generation
application. Nodes cfsc01-04 are running 1.6.0 and finish within 35

minutes.

Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over

a

hour to run. There seems to be a much greater network demand in the 1.6.1
version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon





Re: [OMPI users] openmpi-1.9a1r27674 on Cygwin-1.7.17

2012-12-19 Thread Siegmar Gross
Hi

> On 12/18/2012 6:55 PM, Jeff Squyres wrote:
> > ...but only of v1.6.x.
> 
> okay, adding development version on Christmas wishlist
> ;-)

Can you build the package with thread and Java support?

  --enable-mpi-java \
  --enable-opal-multi-threads \
  --enable-mpi-thread-multiple \
  --with-threads=posix \

I could build openmpi-1.6.4 with thread support without a problem
for Cygwin 1.7.17 but I failed to build openmpi-1.9 until now.


> > On Dec 18, 2012, at 10:32 AM, Ralph Castain wrote:
> >
> >> Also, be aware that the Cygwin folks have already released a
> >> fully functional port of OMPI to that environment as a package.
> >> So if you want OMPI on Cygwin, you can just download and
> >> install the Cygwin package - no need to build it yourself.


Kind regards

Siegmar



Re: [OMPI users] Infiniband errors

2012-12-19 Thread Syed Ahsan Ali
Dear John

I found this output of ibstatus on some nodes (most probably the problem
causing)
[root@compute-01-08 ~]# ibstatus
Fatal error:  device '*': sys files not found
(/sys/class/infiniband/*/ports)

Does this show any hardware or software issue?

Thanks




On Wed, Nov 28, 2012 at 3:17 PM, John Hearns  wrote:

> Those diagnostics are from Openfabrics.
> What type of infiniband card do you have?
> What drivers are you using?
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>