Re: [OMPI users] Problems on large clusters

2011-06-22 Thread Gilbert Grosdidier
2040 cores job. I use 255 nodes with one MPI task on each node and use 8-way OpenMP. I don't need -np and -machinefile, because mpiexec picks up this information from PBS. Thorsten On Tuesday, June 21, 2011, Gilbert Grosdidier wrote: Bonjour Thorsten, Could you please be a little bit

Re: [OMPI users] Problems on large clusters

2011-06-21 Thread Gilbert Grosdidier
i.org/mailman/listinfo.cgi/users -- *---------* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546 B.P. 34, F-91898 Orsay Cedex (FRANCE) *-*

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
with np=8? On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: Hi Jeff, Thanks for taking care of this. Here is what I got on a worker node: mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 Is this what is expected, please ? Or sho

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Gilbert Grosdidier
:30 PM, Gilbert Grosdidier wrote: John, Thanks, more info below. Le 17/12/2010 17:32, John Hearns a écrit : On 17 December 2010 15:47, Gilbert Grosdidier wrote: gg= I don't know, and firmware_revs does not seem to be available. Only thing I got on a worker node was with lspci : If yo

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
ng list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- *---------* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS Phone : +33 1 6446 8909 Faculté des Scienc

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote: Hi Jeff, Thanks for taking care of this. Here is what I got on a worker node: mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 1.1rc6r3028/bin/hwloc-bind --get 0x0001 Is this what is expected, please ? Or should I try

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
rds, Gilbert. Le 7 janv. 11 à 15:35, Jeff Squyres a écrit : On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote: lstopo Machine (35GB) NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#8) L2 L#1 (256KB) + L1 L#1 (3

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Gilbert Grosdidier
P#7) PU L#15 (P#15) Tests with --bind-to-core are under way ... What is your conclusion, please ? Thanks, G. Le 06/01/2011 23:16, Jeff Squyres a écrit : On Jan 6, 2011, at 5:07 PM, Gilbert Grosdidier wrote: Yes Jeff, I'm pretty sure indeed that hyperthreading is enabled, since

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Gilbert Grosdidier
the text output from running hwloc's "lstopo" command on your compute nodes? I ask because if hyperthreading is enabled, OMPI might be assigning one process per *hyerthread* (vs. one process per *core*). And that could be disastrous for performance. On Dec 22, 2010, at 2:25 PM,

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-06 Thread Gilbert Grosdidier
Dec 22, 2010, at 2:25 PM, Gilbert Grosdidier wrote: Hi David, Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ? Thanks for your help, Best, G. Le 22/12/2010 20:18, David Singleton a écrit : Is the same level of processes and memory affinity or binding being used

Re: [OMPI users] Granular locks?

2011-01-05 Thread Gilbert Grosdidier
Hi Gijsbert, Thank you for this proposal, I think it could be useful for our LQCD application, at least for further evaluations. How could I get to the code, please ? Thanks in advance for your help, Best, G. Le 03/01/2011 22:36, Gijsbert Wiesenekker a écrit : On Oct 2, 2010, at 10:5

Re: [OMPI users] Trouble with Memlock when using OpenIB on an SGI ICE Cluster

2010-12-31 Thread Gilbert Grosdidier
when all-to-all communication is not required on a big cluster. Could someone comment on this ? More info on request. Thanks, Happy New Year to you all, G. Le 29/11/2010 16:58, Gilbert Grosdidier a écrit : Bonjour John, Thanks for your feedback, but my investigations so far di

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-22 Thread Gilbert Grosdidier
Hi David, Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ? Thanks for your help, Best, G. Le 22/12/2010 20:18, David Singleton a écrit : Is the same level of processes and memory affinity or binding being used? On 12/21/2010 07:45 AM, Gilbert Grosdidier

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Gilbert Grosdidier
quot;lots of short messages", or "lots of long messages", etc. It sounds like there is some repeated set of MPI exchanges, so maybe that set can be extracted and run without the complexities of the application. Anyhow, some profiling might help guide one to the problem. Gilber

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-22 Thread Gilbert Grosdidier
x27;t forget that MPT has some optimizations OpenMPI may not have, as "overriding" free(). This way, MPT can have a huge performance boost if you're allocating and freeing memory, and the same happens if you communicate often. Matthieu 2010/12/21 Gilbert Grosdidier: Hi George, Thank

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance

2010-12-21 Thread Gilbert Grosdidier
h the --byslot --bynode options to see how this affect the performance of your application. For the hardcore cases we provide a rankfile feature. More info at: http://www.open-mpi.org/faq/?category=tuning#using-paffinity Enjoy, george. On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote: Yes,

Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-20 Thread Gilbert Grosdidier
open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Cordialement, Gilbert. -- *---------* Gilbert Grosdidier gilbert.g

[OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2010-12-20 Thread Gilbert Grosdidier
Bonjour, I am now at a loss with my running of OpenMPI (namely 1.4.3) on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband. After fixing several rather obvious failures with Ralph, Jeff and John help, I am now facing the bottom of this story since : - there are no more obv

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2010-12-17 Thread Gilbert Grosdidier
John, Thanks, more info below. Le 17/12/2010 17:32, John Hearns a écrit : On 17 December 2010 15:47, Gilbert Grosdidier wrote: gg= I don't know, and firmware_revs does not seem to be available. Only thing I got on a worker node was with lspci : If you log into a compute node the co

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2010-12-17 Thread Gilbert Grosdidier
Bonjour John, First, Thanks for your feedback. Le 17 déc. 10 à 16:13, John Hearns a écrit : On 17 December 2010 14:45, Gilbert Grosdidier wrote: Bonjour, About this issue, for which I got NO feedback ;-) Gilbert, as you have an SGI cluster, have you filed a support request to SGI

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2010-12-17 Thread Gilbert Grosdidier
the configure step. Thanks,Best,G. Le 15 déc. 10 à 08:59, Gilbert Grosdidier a écrit : Bonjour, Running with OpenMPI 1.4.3 on an SGI Altix cluster with 2048 cores, I got this error message on all cores, right at startup : btl_openib.c:211:adjust_cq] cannot resize completion queue

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-16 Thread Gilbert Grosdidier
Bonjour Jeff, Le 16/12/2010 01:40, Jeff Squyres a écrit : On Dec 15, 2010, at 3:24 PM, Ralph Castain wrote: I am not using the TCP BTL, only OPENIB one. Does this change the number of sockets in use per node, please ? I believe the openib btl opens sockets for connection purposes, so the cou

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
of nodes (1k nodes, ie 8k cores) that I could ask him (her) about the right setup ? Thanks, Best,G. Le 15/12/2010 21:03, Ralph Castain a écrit : On Dec 15, 2010, at 12:30 PM, Gilbert Grosdidier wrote: Bonsoir Ralph, Le 15/12/2010 18:45, Ralph Castain a écrit : It looks like all the

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
Bonsoir Ralph, Le 15/12/2010 18:45, Ralph Castain a écrit : It looks like all the messages are flowing within a single job (all three processes mentioned in the error have the same identifier). Only possibility I can think of is that somehow you are reusing ports - is it possible your system d

Re: [OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
18992 is indeed the master one on r36i3n15. Thanks, Best,G. On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote: Bonjour, Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got this error message, right at startup : mca_oob_tcp_peer_recv_connect_ack: received

[OMPI users] Issue with : mca_oob_tcp_peer_recv_connect_ack on SGI Altix

2010-12-15 Thread Gilbert Grosdidier
Bonjour, Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got this error message, right at startup : mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier [[13816,0],209] and the whole job is going to spin for an undefined period, without crashing/ab

[OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2010-12-15 Thread Gilbert Grosdidier
Bonjour, Running with OpenMPI 1.4.3 on an SGI Altix cluster with 2048 cores, I got this error message on all cores, right at startup : btl_openib.c:211:adjust_cq] cannot resize completion queue, error: 12 What could be the culprit please ? Is there a workaround ? What parameter is to be tuned

Re: [OMPI users] jobs with more that 2, 500 processes will not even start

2010-12-14 Thread Gilbert Grosdidier
.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Cordialement, Gilbert. -- *-----* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 / CNRS

[OMPI users] Use of -mca pml csum

2010-12-14 Thread Gilbert Grosdidier
Bonjour, Since I'm very suspicious about the condition of the IB network on my cluster, I'm trying to use the csum pml feature of OMPI (1.4.3). But I have a question: what happens if the Checksum is different on both ends ? Is there a warning printed, a flag set by the MPI_(I)recv or equ

[OMPI users] Trouble with IPM & OpenMPI on SGI Altix

2010-12-08 Thread Gilbert Grosdidier
Bonjour, I have trouble when trying to compile& run IPM on an SGI Altix cluster. The issue is: this cluster is providing a default SGI MPT implementation of MPI, but I want to use a private installation of OpenMPI 1.4.3 instead. 1) When I compile IPM as recommended, everything works fine, bu

Re: [OMPI users] Trouble with Memlock when using OpenIB on an SGI ICE Cluster

2010-11-29 Thread Gilbert Grosdidier
2010 16:31, Gilbert Grosdidier Bonjour, Bonjour Gilbert. I manage ICE clusters also. Please could you have look at /etc/init.d/pbs on the compute blades? Do you have something like: if [ "${PBS_START_MOM}" -gt 0 ] ; then if check_prog "mom" ; then echo

Re: [OMPI users] mpool_sm_max_size disappeared ?

2010-11-29 Thread Gilbert Grosdidier
Bonjour, I found this parameter mpool_sm_max_size in this post: http://www.open-mpi.org/community/lists/devel/2008/11/4883.php But I was unable to spot it back into the 'ompi_info -all' output for v 1.4.3. Is it still existing ? If not, which other one is replacing it, please ? Also, is

Re: [OMPI users] Trouble with Memlock when using OpenIB on an SGI ICE Cluster

2010-11-25 Thread Gilbert Grosdidier
On 20 November 2010 16:31, Gilbert Grosdidier wrote: Bonjour, Bonjour Gilbert. I manage ICE clusters also. Please could you have look at /etc/init.d/pbs on the compute blades? Do you have something like:    if [ "${PBS_START_MOM}" -gt 0 ] ; the

[OMPI users] Trouble with Memlock when using OpenIB on an SGI ICE Cluster (fwd)

2010-11-20 Thread Gilbert Grosdidier
Bonjour, I am afraid I got a weird issue when running an OpenMPI job using OpenIB on an SGI ICE cluster with 4096 cores (or larger), and the FAQ does not help. The OMPI version is 1.4.1, and it is running just fine with a smaller number of cores (up to 512). The error message is the following

[OMPI users] Checksuming in openmpi 1.4.1

2010-08-31 Thread Gilbert Grosdidier
hould I set to activate it ? - Is there a time penalty for using it, please ? Thanks in advance for any help. -- Regards, Gilbert. -- *-* Gilbert Grosdidier gilbert.grosdid...@in2p3.fr LAL / IN2P3 /

Re: [OMPI users] open mpi on non standard ssh port

2009-03-17 Thread Gilbert Grosdidier
estion is stupid but I could not find a solution via google or > search function ... > > cheers > Bernhard > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- *--

Re: [OMPI users] Problem with feupdateenv

2008-12-10 Thread Gilbert Grosdidier
readv failed: > >> Connection reset by peer (104) > >> mpirun noticed that job rank 0 with PID 17781 on node master exited on > >> signal 11 (Segmentation fault). > >> 1 additional process aborted (not shown) > >> > >> But this is

[OMPI users] mca btl_openib_flags default value

2008-11-04 Thread Gilbert Grosdidier
ted or not ? I could understand any value between 1 & 7, but what does mean 54, please ? Does it behave like 6, if removal of the unexpected bits ? Thanks,Gilbert -- *---------* Gilbert Grosdidier

Re: [OMPI users] Working with a CellBlade cluster

2008-10-31 Thread Gilbert Grosdidier
t; > > > One way to check if the message goes via IB or SM maybe to check the > > counters in /sys/class/infiniband. > > > > Regards, > > Mi > > Gilbert Grosdidier > > > > > > Gilbert Grosdidier > > Sent by: users-boun

Re: [OMPI users] problem running Open MPI on Cells

2008-10-31 Thread Gilbert Grosdidier
Thanks. > > Hahn > > -- > Hahn Kim, h...@ll.mit.edu > MIT Lincoln Laboratory > 244 Wood St., Lexington, MA 02420 > Tel: 781-981-0940, Fax: 781-981-5255 > > > > > > > _______ > users mailing list > us...@open-mpi.org >

Re: [OMPI users] Working with a CellBlade cluster

2008-10-29 Thread Gilbert Grosdidier
o formal FAQ due to a multiple reasons but you can read how to > use it in the attached scratch ( there were few name changings of the > params, so check with ompi_info ) > > shared memory is used between processes that share same machine, and openib > is used between different

Re: [OMPI users] Working with a CellBlade cluster

2008-10-28 Thread Gilbert Grosdidier
pi.org* > > > To > > > > "Open MPI Users" <*us...@open-mpi.org* >cc > > Subject > > > > Re: [OMPI users] Working with a CellBlade cluster > > > > > >Hi, > > > > > >If I understand you correctly the most

[OMPI users] Working with a CellBlade cluster

2008-10-19 Thread Gilbert Grosdidier
Working with a CellBlade cluster (QS22), the requirement is to have one instance of the executable running on each socket of the blade (there are 2 sockets). The application is of the 'domain decomposition' type, and each instance is required to often send/receive data with both the remote blades