2040 cores job. I use 255 nodes with one MPI task on each
node and
use 8-way OpenMP.
I don't need -np and -machinefile, because mpiexec picks up this
information
from PBS.
Thorsten
On Tuesday, June 21, 2011, Gilbert Grosdidier wrote:
Bonjour Thorsten,
Could you please be a little bit
i.org/mailman/listinfo.cgi/users
--
*---------*
Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
Faculté des Sciences, Bat. 200 Fax : +33 1 6446 8546
B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*
with np=8?
On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:
Hi Jeff,
Thanks for taking care of this.
Here is what I got on a worker node:
mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/
1.1rc6r3028/bin/hwloc-bind --get
0x0001
Is this what is expected, please ? Or sho
:30 PM, Gilbert Grosdidier wrote:
John,
Thanks, more info below.
Le 17/12/2010 17:32, John Hearns a écrit :
On 17 December 2010 15:47, Gilbert Grosdidier
wrote:
gg= I don't know, and firmware_revs does not seem to be available.
Only thing I got on a worker node was with lspci :
If yo
ng list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
*---------*
Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
Faculté des Scienc
On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:
Hi Jeff,
Thanks for taking care of this.
Here is what I got on a worker node:
mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/
1.1rc6r3028/bin/hwloc-bind --get
0x0001
Is this what is expected, please ? Or should I try
rds, Gilbert.
Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :
On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:
lstopo
Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (256KB) + L1 L#1 (3
P#7)
PU L#15 (P#15)
Tests with --bind-to-core are under way ...
What is your conclusion, please ?
Thanks, G.
Le 06/01/2011 23:16, Jeff Squyres a écrit :
On Jan 6, 2011, at 5:07 PM, Gilbert Grosdidier wrote:
Yes Jeff, I'm pretty sure indeed that hyperthreading is enabled, since
the text output from running hwloc's "lstopo" command on your
compute nodes?
I ask because if hyperthreading is enabled, OMPI might be assigning one process
per *hyerthread* (vs. one process per *core*). And that could be disastrous
for performance.
On Dec 22, 2010, at 2:25 PM,
Dec 22, 2010, at 2:25 PM, Gilbert Grosdidier wrote:
Hi David,
Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ?
Thanks for your help, Best, G.
Le 22/12/2010 20:18, David Singleton a écrit :
Is the same level of processes and memory affinity or binding being used
Hi Gijsbert,
Thank you for this proposal, I think it could be useful for our LQCD
application,
at least for further evaluations. How could I get to the code, please ?
Thanks in advance for your help, Best, G.
Le 03/01/2011 22:36, Gijsbert Wiesenekker a écrit :
On Oct 2, 2010, at 10:5
when all-to-all communication is not
required on a big cluster.
Could someone comment on this ?
More info on request.
Thanks, Happy New Year to you all, G.
Le 29/11/2010 16:58, Gilbert Grosdidier a écrit :
Bonjour John,
Thanks for your feedback, but my investigations so far di
Hi David,
Yes, I set mpi_affinity_alone to 1. Is that right and sufficient, please ?
Thanks for your help, Best, G.
Le 22/12/2010 20:18, David Singleton a écrit :
Is the same level of processes and memory affinity or binding being used?
On 12/21/2010 07:45 AM, Gilbert Grosdidier
quot;lots of
short messages", or "lots of long messages", etc. It sounds like
there is some repeated set of MPI exchanges, so maybe that set can be
extracted and run without the complexities of the application.
Anyhow, some profiling might help guide one to the problem.
Gilber
x27;t forget that MPT has some optimizations OpenMPI may not have, as
"overriding" free(). This way, MPT can have a huge performance boost
if you're allocating and freeing memory, and the same happens if you
communicate often.
Matthieu
2010/12/21 Gilbert Grosdidier:
Hi George,
Thank
h the --byslot --bynode options to see how
this affect the performance of your application.
For the hardcore cases we provide a rankfile feature. More info at:
http://www.open-mpi.org/faq/?category=tuning#using-paffinity
Enjoy,
george.
On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:
Yes,
open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Cordialement, Gilbert.
--
*---------*
Gilbert Grosdidier gilbert.g
Bonjour,
I am now at a loss with my running of OpenMPI (namely 1.4.3)
on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
After fixing several rather obvious failures with Ralph, Jeff and John
help,
I am now facing the bottom of this story since :
- there are no more obv
John,
Thanks, more info below.
Le 17/12/2010 17:32, John Hearns a écrit :
On 17 December 2010 15:47, Gilbert Grosdidier
wrote:
gg= I don't know, and firmware_revs does not seem to be available.
Only thing I got on a worker node was with lspci :
If you log into a compute node the co
Bonjour John,
First, Thanks for your feedback.
Le 17 déc. 10 à 16:13, John Hearns a écrit :
On 17 December 2010 14:45, Gilbert Grosdidier
wrote:
Bonjour,
About this issue, for which I got NO feedback ;-)
Gilbert, as you have an SGI cluster, have you filed a support
request to SGI
the configure step.
Thanks,Best,G.
Le 15 déc. 10 à 08:59, Gilbert Grosdidier a écrit :
Bonjour,
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 2048 cores,
I got
this error message on all cores, right at startup :
btl_openib.c:211:adjust_cq] cannot resize completion queue
Bonjour Jeff,
Le 16/12/2010 01:40, Jeff Squyres a écrit :
On Dec 15, 2010, at 3:24 PM, Ralph Castain wrote:
I am not using the TCP BTL, only OPENIB one. Does this change the number of
sockets in use per node, please ?
I believe the openib btl opens sockets for connection purposes, so the cou
of nodes
(1k nodes, ie 8k cores) that I could ask him (her) about the right setup ?
Thanks, Best,G.
Le 15/12/2010 21:03, Ralph Castain a écrit :
On Dec 15, 2010, at 12:30 PM, Gilbert Grosdidier wrote:
Bonsoir Ralph,
Le 15/12/2010 18:45, Ralph Castain a écrit :
It looks like all the
Bonsoir Ralph,
Le 15/12/2010 18:45, Ralph Castain a écrit :
It looks like all the messages are flowing within a single job (all
three processes mentioned in the error have the same identifier). Only
possibility I can think of is that somehow you are reusing ports - is
it possible your system d
18992 is indeed the master one on r36i3n15.
Thanks, Best,G.
On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:
Bonjour,
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores,
I got
this error message, right at startup :
mca_oob_tcp_peer_recv_connect_ack: received
Bonjour,
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
this error message, right at startup :
mca_oob_tcp_peer_recv_connect_ack: received unexpected process
identifier [[13816,0],209]
and the whole job is going to spin for an undefined period, without
crashing/ab
Bonjour,
Running with OpenMPI 1.4.3 on an SGI Altix cluster with 2048 cores, I got
this error message on all cores, right at startup :
btl_openib.c:211:adjust_cq] cannot resize completion queue, error: 12
What could be the culprit please ?
Is there a workaround ?
What parameter is to be tuned
.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Cordialement, Gilbert.
--
*-----*
Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
LAL / IN2P3 / CNRS
Bonjour,
Since I'm very suspicious about the condition of the IB network on
my cluster,
I'm trying to use the csum pml feature of OMPI (1.4.3).
But I have a question: what happens if the Checksum is different on
both ends ?
Is there a warning printed, a flag set by the MPI_(I)recv or equ
Bonjour,
I have trouble when trying to compile& run IPM on an SGI Altix cluster.
The issue is: this cluster is providing a default SGI MPT
implementation of MPI,
but I want to use a private installation of OpenMPI 1.4.3 instead.
1) When I compile IPM as recommended, everything works fine, bu
2010 16:31, Gilbert Grosdidier
Bonjour,
Bonjour Gilbert.
I manage ICE clusters also.
Please could you have look at /etc/init.d/pbs on the compute blades?
Do you have something like:
if [ "${PBS_START_MOM}" -gt 0 ] ; then
if check_prog "mom" ; then
echo
Bonjour,
I found this parameter mpool_sm_max_size in this post:
http://www.open-mpi.org/community/lists/devel/2008/11/4883.php
But I was unable to spot it back into the 'ompi_info -all' output for
v 1.4.3.
Is it still existing ?
If not, which other one is replacing it, please ?
Also, is
On 20 November 2010 16:31, Gilbert Grosdidier wrote:
Bonjour,
Bonjour Gilbert.
I manage ICE clusters also.
Please could you have look at /etc/init.d/pbs on the compute blades?
Do you have something like:
if [ "${PBS_START_MOM}" -gt 0 ] ; the
Bonjour,
I am afraid I got a weird issue when running an OpenMPI job using OpenIB
on an SGI ICE cluster with 4096 cores (or larger), and the FAQ does not help.
The OMPI version is 1.4.1, and it is running just fine with a smaller number of
cores (up to 512).
The error message is the following
hould I set to activate it ?
- Is there a time penalty for using it, please ?
Thanks in advance for any help.
--
Regards, Gilbert.
--
*-*
Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
LAL / IN2P3 /
estion is stupid but I could not find a solution via google or
> search function ...
>
> cheers
> Bernhard
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
--
*--
readv failed:
> >> Connection reset by peer (104)
> >> mpirun noticed that job rank 0 with PID 17781 on node master exited on
> >> signal 11 (Segmentation fault).
> >> 1 additional process aborted (not shown)
> >>
> >> But this is
ted or not ?
I could understand any value between 1 & 7, but what does mean 54, please ?
Does it behave like 6, if removal of the unexpected bits ?
Thanks,Gilbert
--
*---------*
Gilbert Grosdidier
t; >
> > One way to check if the message goes via IB or SM maybe to check the
> > counters in /sys/class/infiniband.
> >
> > Regards,
> > Mi
> > Gilbert Grosdidier
> >
> >
> > Gilbert Grosdidier
> > Sent by: users-boun
Thanks.
>
> Hahn
>
> --
> Hahn Kim, h...@ll.mit.edu
> MIT Lincoln Laboratory
> 244 Wood St., Lexington, MA 02420
> Tel: 781-981-0940, Fax: 781-981-5255
>
>
>
>
>
>
> _______
> users mailing list
> us...@open-mpi.org
>
o formal FAQ due to a multiple reasons but you can read how to
> use it in the attached scratch ( there were few name changings of the
> params, so check with ompi_info )
>
> shared memory is used between processes that share same machine, and openib
> is used between different
pi.org* >
> > To
> >
> > "Open MPI Users" <*us...@open-mpi.org* >cc
> > Subject
> >
> > Re: [OMPI users] Working with a CellBlade cluster
> >
> >
> >Hi,
> >
> >
> >If I understand you correctly the most
Working with a CellBlade cluster (QS22), the requirement is to have one
instance of the executable running on each socket of the blade (there are 2
sockets). The application is of the 'domain decomposition' type, and each
instance is required to often send/receive data with both the remote blades
43 matches
Mail list logo