[OMPI users] MPI_Comm_Spawn intercommunication

2011-01-07 Thread Pierre Chanial
Hello,

When I run this code:

program testcase

use mpi
implicit none

integer :: rank, lsize, rsize, code
integer :: intercomm

call MPI_INIT(code)

call MPI_COMM_GET_PARENT(intercomm, code)
if (intercomm == MPI_COMM_NULL) then
call MPI_COMM_SPAWN ("./testcase", MPI_ARGV_NULL, 1, MPI_INFO_NULL,
&
 0, MPI_COMM_WORLD, intercomm, MPI_ERRCODES_IGNORE, code)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code)
call MPI_COMM_SIZE(MPI_COMM_WORLD, lsize, code)
call MPI_COMM_SIZE(intercomm, rsize, code)
if (rank == 0) then
print *, 'from parent: local size is ', lsize
print *, 'from parent: remote size is ', rsize
end if
else
call MPI_COMM_SIZE(MPI_COMM_WORLD, lsize, code)
call MPI_COMM_SIZE(intercomm, rsize, code)
print *, 'from child: local size is ', lsize
print *, 'from child: remote size is ', rsize
end if

call MPI_FINALIZE (code)

end program testcase

I get the following results with openmpi 1.4.1 and two processes:
 from parent: local size is
2

 from parent: remote size is
2

 from child: local size is
1

 from child: remote size is
1


I would have expected:
 from parent: local size is
2

 from parent: remote size is1


 from child: local size is
1

 from child: remote size is2



Could anyone tell me what's going on ? It's not a fortran issue, I can also
replicate it using mpi4py.
Probably related to the universe size: I haven't found a way to hand it to
mpirun.

Cheers,
Pierre


Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Eugene Loh




Gilbert Grosdidier wrote:
Any other suggestion ?
Can any more information be extracted from profiling?  Here is where I
think things left off:

Eugene Loh wrote:

  
  
Gilbert Grosdidier wrote:
  #    
[time]   [calls]    <%mpi>  <%wall>
# MPI_Waitall 741683   7.91081e+07 77.96   
21.58
# MPI_Allreduce   114057   2.53665e+07
11.99 3.32
# MPI_Isend  27420.6   6.53513e+08 
2.88 0.80
# MPI_Irecv  464.616   6.53513e+08 
0.05 0.01
###

It seems to my non-expert eye that MPI_Waitall is dominant among MPI
calls,
but not for the overall application,
Looks like on average each MPI_Waitall call is completing 8+ MPI_Isend
calls and 8+ MPI_Irecv calls.  I think IPM gives some point-to-point
messaging information.  Maybe you can tell what the distribution is of
message sizes, etc.  Or, maybe you already know the characteristic
pattern.  Does a stand-alone message-passing test (without the
computational portion) capture the performance problem you're looking
for?

Do you know message lengths and patterns?  Can you confirm whether
non-MPI time is the same between good and bad runs?




Re: [OMPI users] mpirun --nice 10 prog ??

2011-01-07 Thread Eugene Loh




David Mathog wrote:

  Ralph Castain wrote:
  
  
Afraid not - though you could alias your program name to be "nice --10 prog"

  
  Is there an OMPI wish list?  If so, can we please add to it "a method
to tell mpirun  what nice values to use when it starts programs on
nodes"?  Minimally, something like this:

  --nice  12   #nice value used on all nodes
  --mnice 5#nice value for master (first) node
  --wnice 10   #nice value for worker (worker) nodes

For my purposes that would be enough, as the only distinction is
master/worker.  For more complex environments more flexibility might be
desired, for instance, in a large cluster, where a subset of nodes
integrate data from worker subsets, effectively acting as "local masters".

Obviously for platforms without nice mpirun would try to use whatever
priority scheme was available, and failing that, just run the program as
it does now.

Or are we the only site where quick high priority jobs must run on the
same nodes where long term low priority jobs are also running?
  

I'm guessing people might have all sorts of ideas about how they would
want to solve "a problem like this one".

One is to forbid MPI jobs from competing for the same resources.  The
assumption that an MPI process has dedicated use of its resources is
somewhat ingrained into OMPI.

Checkpoint/restart:  if a higher-priority job comes along, kick the
lower-priority job off.

Yield.  This issue comes up often on these lists.  That is, don't just
set process priorities high or low, but make them more aggressive when
they're doing useful work and more passive when they're waiting idly.




Re: [OMPI users] mpirun --nice 10 prog ??

2011-01-07 Thread David Mathog
Ralph Castain wrote:

> Afraid not - though you could alias your program name to be "nice --10
prog"
> 

Is there an OMPI wish list?  If so, can we please add to it "a method
to tell mpirun  what nice values to use when it starts programs on
nodes"?  Minimally, something like this:

  --nice  12   #nice value used on all nodes
  --mnice 5#nice value for master (first) node
  --wnice 10   #nice value for worker (worker) nodes

For my purposes that would be enough, as the only distinction is
master/worker.  For more complex environments more flexibility might be
desired, for instance, in a large cluster, where a subset of nodes
integrate data from worker subsets, effectively acting as "local masters".

Obviously for platforms without nice mpirun would try to use whatever
priority scheme was available, and failing that, just run the program as
it does now.

Or are we the only site where quick high priority jobs must run on the
same nodes where long term low priority jobs are also running?

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech


Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier
Unfortunately, I was unable to spot any striking difference in perfs  
when using --bind-to-core.


 Sorry. Any other suggestion ?

 Regards,Gilbert.



Le 7 janv. 11 à 16:32, Jeff Squyres a écrit :

Well, bummer -- there goes my theory.  According to the hwloc info  
you posted earlier, this shows that OMPI is binding to the 1st  
hyperthread on each core; *not* to both hyperthreads on a single  
core.  :-\


It would still be slightly interesting to see if there's any  
difference when you run with --bind-to-core instead of  
paffinity_alone.




On Jan 7, 2011, at 9:56 AM, Gilbert Grosdidier wrote:


Yes, here it is :

mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001
0x0002
0x0004
0x0008
0x0010
0x0020
0x0040
0x0080

Gilbert.

Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :


Can you run with np=8?

On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:


Hi Jeff,

Thanks for taking care of this.

Here is what I got on a worker node:

mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001

Is this what is expected, please ? Or should I try yet another  
command ?


Thanks,   Regards,   Gilbert.



Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :


On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:


lstopo

Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
  PU L#0 (P#0)
  PU L#1 (P#8)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
  PU L#2 (P#1)
  PU L#3 (P#9)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
  PU L#4 (P#2)
  PU L#5 (P#10)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
  PU L#6 (P#3)
  PU L#7 (P#11)

[snip]


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 11:16 AM, Jeff Squyres wrote:

> Ok, I can replicate the hang in publish now.  I'll file a bug report.

Filed here: 

https://svn.open-mpi.org/trac/ompi/ticket/2681

Thanks for your persistence!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 10:41 AM, Bernard Secher - SFME/LGLS wrote:

> srv = 0 is set in my main program
> I call Bcast because all the processes must call MPI_Comm_accept (collective) 
> or must call MPI_Comm_connect (collective)

Ah -- I see.  I thought this was a test program where some processes were 
supposed to call connect and others were supposed to call accept.

> Anyway, I get also a dead lock with your lookup program:
> 
> That's what I do:
> 
> ompi-server -r URIfile
> 
> mpirun -np 1 -ompi-server file:URIfile lookup& (it the program which publish 
> the name)
> mpirun -np 1 -ompi-server file:URIfile lookup (it is the program which lookup 
> the name)
> 
> >From these two programs I create a global communicator to exchange 
> >communications between the two others

Ah -- this is a key point that I missed in your intial mail: that you're using 
the ompi server and multiple different mpirun's.  :-)

Ok, I can replicate the hang in publish now.  I'll file a bug report.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] srun and openmpi

2011-01-07 Thread Michael Di Domenico
I'm still testing the slurm integration, which seems to work fine so
far.  However, i just upgraded another cluster to openmpi-1.5 and
slurm 2.1.15 but this machine has no infiniband

if i salloc the nodes and mpirun the command it seems to run and complete fine
however if i srun the command i get

[btl_tcp_endpoint:486] mca_btl_tcp_endpoint_recv_connect_ack received
unexpected prcoess identifier

the job does not seem to run, but exhibits two behaviors
running a single process per node the job runs and does not present
the error (srun -N40 --ntasks-per-node=1)
running multiple processes per node, the job spits out the error but
does not run (srun -n40 --ntasks-per-node=8)

I copied the configs from the other machine, so (i think) everything
should be configured correctly (but i can't rule it out)

I saw (and reported) a similar error to above with the 1.4-dev branch
(see mailing list) and slurm, I can't say whether they're related or
not though


On Mon, Jan 3, 2011 at 3:00 PM, Jeff Squyres  wrote:
> Yo Ralph --
>
> I see this was committed https://svn.open-mpi.org/trac/ompi/changeset/24197.  
> Do you want to add a blurb in README about it, and/or have this executable 
> compiled as part of the PSM MTL and then installed into $bindir (maybe named 
> ompi-psm-keygen)?
>
> Right now, it's only compiled as part of "make check" and not installed, 
> right?
>
>
>
> On Dec 30, 2010, at 5:07 PM, Ralph Castain wrote:
>
>> Run the program only once - it can be in the prolog of the job if you like. 
>> The output value needs to be in the env of every rank.
>>
>> You can reuse the value as many times as you like - it doesn't have to be 
>> unique for each job. There is nothing magic about the value itself.
>>
>> On Dec 30, 2010, at 2:11 PM, Michael Di Domenico wrote:
>>
>>> How early does this need to run? Can I run it as part of a task
>>> prolog, or does it need to be the shell env for each rank?  And does
>>> it need to run on one node or all the nodes in the job?
>>>
>>> On Thu, Dec 30, 2010 at 8:54 PM, Ralph Castain  wrote:
 Well, I couldn't do it as a patch - proved too complicated as the psm 
 system looks for the value early in the boot procedure.

 What I can do is give you the attached key generator program. It outputs 
 the envar required to run your program. So if you run the attached program 
 and then export the output into your environment, you should be okay. 
 Looks like this:

 $ ./psm_keygen
 OMPI_MCA_orte_precondition_transports=0099b3eaa2c1547e-afb287789133a954
 $

 You compile the program with the usual mpicc.

 Let me know if this solves the problem (or not).
 Ralph




 On Dec 30, 2010, at 11:18 AM, Michael Di Domenico wrote:

> Sure, i'll give it a go
>
> On Thu, Dec 30, 2010 at 5:53 PM, Ralph Castain  wrote:
>> Ah, yes - that is going to be a problem. The PSM key gets generated by 
>> mpirun as it is shared info - i.e., every proc has to get the same value.
>>
>> I can create a patch that will do this for the srun direct-launch 
>> scenario, if you want to try it. Would be later today, though.
>>
>>
>> On Dec 30, 2010, at 10:31 AM, Michael Di Domenico wrote:
>>
>>> Well maybe not horray, yet.  I might have jumped the gun a bit, it's
>>> looking like srun works in general, but perhaps not with PSM
>>>
>>> With PSM i get this error, (at least now i know what i changed)
>>>
>>> Error obtaining unique transport key from ORTE
>>> (orte_precondition_transports not present in the environment)
>>> PML add procs failed
>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>
>>> Turn off PSM and srun works fine
>>>
>>>
>>> On Thu, Dec 30, 2010 at 5:13 PM, Ralph Castain  
>>> wrote:
 Hooray!

 On Dec 30, 2010, at 9:57 AM, Michael Di Domenico wrote:

> I think i take it all back.  I just tried it again and it seems to
> work now.  I'm not sure what I changed (between my first and this
> msg), but it does appear to work now.
>
> On Thu, Dec 30, 2010 at 4:31 PM, Michael Di Domenico
>  wrote:
>> Yes that's true, error messages help.  I was hoping there was some
>> documentation to see what i've done wrong.  I can't easily cut and
>> paste errors from my cluster.
>>
>> Here's a snippet (hand typed) of the error message, but it does look
>> like a rank communications error
>>
>> ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
>> contact information is unknown in file rml_oob_send.c at line 145.
>> *** MPI_INIT failure message (snipped) ***
>> orte_grpcomm_modex failed
>> --> Returned "A messages is attempting to be 

Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Gilbert Grosdidier

Bonjour Pavel,

 Here is the output of the ofed_info command :

==
OFED-1.4.1
libibverbs:
git://git.openfabrics.org/ofed_1_4/libibverbs.git ofed_1_4
commit b00dc7d2f79e0660ac40160607c9c4937a895433
libmthca:
git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git master
commit be5eef3895eb7864db6395b885a19f770fde7234
libmlx4:
git://git.openfabrics.org/ofed_1_4/libmlx4.git ofed_1_4
commit d5e5026e2bd3bbd7648199a48c4245daf313aa48
libehca:
git://git.openfabrics.org/ofed_1_4/libehca.git ofed_1_4
commit 0249815e9b6f134f33546da6fa2e84e1185eea6d
libipathverbs:
git://git.openfabrics.org/~ralphc/libipathverbs ofed_1_4
commit 337df3c1cbe43c3e9cb58e7f6e91f44603dd23fb
libcxgb3:
git://git.openfabrics.org/~swise/libcxgb3.git ofed_1_4
commit f685c8fe7e77e64614d825e563dd9f02a0b1ae16
libnes:
git://git.openfabrics.org/~glenn/libnes.git master
commit 379cccb4484f39b99c974eb6910d3a0407c0bbd1
libibcm:
git://git.openfabrics.org/~shefty/libibcm.git master
commit 7fb57e005b3eae2feb83b3fd369aeba700a5bcf8
librdmacm:
git://git.openfabrics.org/~shefty/librdmacm.git master
commit 62c2bddeaf5275425e1a7e3add59c3913ccdb4e9
libsdp:
git://git.openfabrics.org/ofed_1_4/libsdp.git ofed_1_4
commit b1eaecb7806d60922b2fe7a2592cea4ae56cc2ab
sdpnetstat:
git://git.openfabrics.org/~amirv/sdpnetstat.git ofed_1_4
commit 798e44f6d5ff8b15b2a86bc36768bd2ad473a6d7
srptools:
git://git.openfabrics.org/~ishai/srptools.git master
commit ce1f64c8dd63c93d56c1cc5fbcdaaadd4f74a1e3
perftest:
git://git.openfabrics.org/~orenmeron/perftest.git master
commit 1cd38e844dc50d670b48200bcda91937df5f5a92
qlvnictools:
git://git.openfabrics.org/~ramachandrak/qlvnictools.git ofed_1_4
commit 4ce9789273896d0e67430c330eb3703405b59951
tvflash:
git://git.openfabrics.org/ofed_1_4/tvflash.git ofed_1_4
commit e1b50b3b8af52b0bc55b2825bb4d6ce699d5c43b
mstflint:
git://git.openfabrics.org/~orenk/mstflint.git master
commit 3352f8997591c6955430b3e68adba33e80a974e3
qperf:
git://git.openfabrics.org/~johann/qperf.git/.git master
commit 18e1c1e8af96cd8bcacced3c4c2a4fd90f880792
ibutils:
git://git.openfabrics.org/~kliteyn/ibutils.git ofed_1_4
commit 9d4bfc3ba19875dfa4583dfaef6f0f579bb013bb
ibsim:
git://git.openfabrics.org/ofed_1_4/ibsim.git ofed_1_4
commit a76132ae36dde8302552d896e35bd29608ac9524

ofa_kernel-1.4.1:
Git:
git://git.openfabrics.org/ofed_1_4/linux-2.6.git ofed_kernel
commit 868661b127c355c64066a796460a7380a722dd84


 Does this mean the resize_cq function should be available, please ?

 Thanks,   Regards, Gilbert.


Le 7 janv. 11 à 16:14, Shamis, Pavel a écrit :

The FW version looks ok. But it may be driver issues as well. I  
guess that OFED 1.4.X or 1.5.x driver should be ok.

To check driver version , you may run ofed_info command.

Regards,

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
Email: sham...@ornl.gov





On Dec 17, 2010, at 12:30 PM, Gilbert Grosdidier wrote:


John,

Thanks, more info below.


Le 17/12/2010 17:32, John Hearns a écrit :

On 17 December 2010 15:47, Gilbert Grosdidier
  wrote:

gg= I don't know, and firmware_revs does not seem to be available.
Only thing I got on a worker node was with lspci :

If you log into a compute node the command is /usr/sbin/ibstat

gg= Here it is :


/usr/sbin/ibstat

CA 'mlx4_0'
   CA type: MT26418
   Number of ports: 2
   Firmware version: 2.7.0
   Hardware version: a0
   Node GUID: 0x003048f036c4
   System image GUID: 0x003048f036c7
   Port 1:
   State: Active
   Physical state: LinkUp
   Rate: 20
   Base lid: 6611
   LMC: 0
   SM lid: 1
   Capability mask: 0x02510868
   Port GUID: 0x003048f036c5
   Port 2:
   State: Active
   Physical state: LinkUp
   Rate: 20
   Base lid: 6612
   LMC: 0
   SM lid: 1
   Capability mask: 0x02510868
   Port GUID: 0x003048f036c6

Does this mean resize_cq should be available, please ?

Thanks,Best,  G.



The firmware_revs command is on the cluster admin node, and is
provided by the sgi-admin-node RPM package.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier

I'll very soon give a try to using Hyperthreading with our app,
and keep you posted about the improvements, if any.

 Our current cluster is made out of 4-core dual-socket Nehalem nodes.

 Cheers,Gilbert.


Le 7 janv. 11 à 16:17, Tim Prince a écrit :


On 1/7/2011 6:49 AM, Jeff Squyres wrote:


My understanding is that hyperthreading can only be activated/ 
deactivated at boot time -- once the core resources are allocated  
to hyperthreads, they can't be changed while running.


Whether disabling the hyperthreads or simply telling Linux not to  
schedule on them makes a difference performance-wise remains to be  
seen.  I've never had the time to do a little benchmarking to  
quantify the difference.  If someone could rustle up a few cycles  
(get it?) to test out what the real-world performance difference is  
between disabling hyperthreading in the BIOS vs. telling Linux to  
ignore the hyperthreads, that would be awesome.  I'd love to see  
such results.


My personal guess is that the difference is in the noise.  But  
that's a guess.


Applications which depend on availability of full size instruction  
lookaside buffer would be candidates for better performance with  
hyperthreads completely disabled.  Many HPC applications don't  
stress ITLB, but some do.
Most of the important resources are allocated dynamically between  
threads, but the ITLB is an exception.
We reported results of an investigation on Intel Nehalem 4-core  
hyperthreading where geometric mean performance of standard  
benchmarks for certain commercial applications was 2% better with  
hyperthreading disabled at boot time, compared with best 1 rank per  
core scheduling with hyperthreading enabled.  Needless to say, the  
report wasn't popular with marketing.  I haven't seen an equivalent  
investigation for the 6-core CPUs, where various strange performance  
effects have been noted, so, as Jeff said, the hyperthreading effect  
could be "in the noise."



--
Tim Prince

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
Well, bummer -- there goes my theory.  According to the hwloc info you posted 
earlier, this shows that OMPI is binding to the 1st hyperthread on each core; 
*not* to both hyperthreads on a single core.  :-\

It would still be slightly interesting to see if there's any difference when 
you run with --bind-to-core instead of paffinity_alone.



On Jan 7, 2011, at 9:56 AM, Gilbert Grosdidier wrote:

> Yes, here it is :
> 
> > mpirun -np 8 --mca mpi_paffinity_alone 1 
> > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
> 0x0001
> 0x0002
> 0x0004
> 0x0008
> 0x0010
> 0x0020
> 0x0040
> 0x0080
> 
>  Gilbert.
> 
> Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :
> 
>> Can you run with np=8?
>> 
>> On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:
>> 
>>> Hi Jeff,
>>> 
>>> Thanks for taking care of this.
>>> 
>>> Here is what I got on a worker node:
>>> 
 mpirun --mca mpi_paffinity_alone 1 
 /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
>>> 0x0001
>>> 
>>> Is this what is expected, please ? Or should I try yet another command ?
>>> 
>>> Thanks,   Regards,   Gilbert.
>>> 
>>> 
>>> 
>>> Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :
>>> 
 On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:
 
>> lstopo
> Machine (35GB)
> NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>  L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>PU L#0 (P#0)
>PU L#1 (P#8)
>  L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>PU L#2 (P#1)
>PU L#3 (P#9)
>  L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>PU L#4 (P#2)
>PU L#5 (P#10)
>  L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>PU L#6 (P#3)
>PU L#7 (P#11)
 [snip]

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Jeff Squyres
+1

AFAIR (and I stopped being an IB vendor a long time ago, so I might be wrong), 
the _resize_cq function being there or not is not an issue of the underlying 
HCA; it's a function of what version of OFED you're running.


On Jan 7, 2011, at 10:14 AM, Shamis, Pavel wrote:

> The FW version looks ok. But it may be driver issues as well. I guess that 
> OFED 1.4.X or 1.5.x driver should be ok.
> To check driver version , you may run ofed_info command.
> 
> Regards,
> 
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> Email: sham...@ornl.gov
> 
> 
> 
> 
> 
> On Dec 17, 2010, at 12:30 PM, Gilbert Grosdidier wrote:
> 
>> John,
>> 
>> Thanks, more info below.
>> 
>> 
>> Le 17/12/2010 17:32, John Hearns a écrit :
>>> On 17 December 2010 15:47, Gilbert Grosdidier
>>>   wrote:
 gg= I don't know, and firmware_revs does not seem to be available.
 Only thing I got on a worker node was with lspci :
>>> If you log into a compute node the command is /usr/sbin/ibstat
>> gg= Here it is :
>> 
>>> /usr/sbin/ibstat
>> CA 'mlx4_0'
>>CA type: MT26418
>>Number of ports: 2
>>Firmware version: 2.7.0
>>Hardware version: a0
>>Node GUID: 0x003048f036c4
>>System image GUID: 0x003048f036c7
>>Port 1:
>>State: Active
>>Physical state: LinkUp
>>Rate: 20
>>Base lid: 6611
>>LMC: 0
>>SM lid: 1
>>Capability mask: 0x02510868
>>Port GUID: 0x003048f036c5
>>Port 2:
>>State: Active
>>Physical state: LinkUp
>>Rate: 20
>>Base lid: 6612
>>LMC: 0
>>SM lid: 1
>>Capability mask: 0x02510868
>>Port GUID: 0x003048f036c6
>> 
>> Does this mean resize_cq should be available, please ?
>> 
>> Thanks,Best,  G.
>> 
>>> 
>>> The firmware_revs command is on the cluster admin node, and is
>>> provided by the sgi-admin-node RPM package.
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Tim Prince

On 1/7/2011 6:49 AM, Jeff Squyres wrote:


My understanding is that hyperthreading can only be activated/deactivated at 
boot time -- once the core resources are allocated to hyperthreads, they can't 
be changed while running.

Whether disabling the hyperthreads or simply telling Linux not to schedule on 
them makes a difference performance-wise remains to be seen.  I've never had 
the time to do a little benchmarking to quantify the difference.  If someone 
could rustle up a few cycles (get it?) to test out what the real-world 
performance difference is between disabling hyperthreading in the BIOS vs. 
telling Linux to ignore the hyperthreads, that would be awesome.  I'd love to 
see such results.

My personal guess is that the difference is in the noise.  But that's a guess.

Applications which depend on availability of full size instruction 
lookaside buffer would be candidates for better performance with 
hyperthreads completely disabled.  Many HPC applications don't stress 
ITLB, but some do.
Most of the important resources are allocated dynamically between 
threads, but the ITLB is an exception.
We reported results of an investigation on Intel Nehalem 4-core 
hyperthreading where geometric mean performance of standard benchmarks 
for certain commercial applications was 2% better with hyperthreading 
disabled at boot time, compared with best 1 rank per core scheduling 
with hyperthreading enabled.  Needless to say, the report wasn't popular 
with marketing.  I haven't seen an equivalent investigation for the 
6-core CPUs, where various strange performance effects have been noted, 
so, as Jeff said, the hyperthreading effect could be "in the noise."



--
Tim Prince



Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Jeff Squyres
You're calling bcast with root=0, so whatever value rank 0 has for srv, 
everyone will have after the bcast.  Plus, I didn't see in your code where *srv 
was ever set to 0.

In my runs, rank 0 is usually the one that publishes first.  Everyone then gets 
the lookup properly, and then the bcast sends srv=1 to everyone.  They all then 
try to call MPI_Comm_accept.

Your code was incomplete, so I had to extend it; see attached.

Here's a sample output with 8 procs:

[7:12] svbu-mpi:~/mpi % mpicc lookup.c -o lookup -g && mpirun lookup
[0] Publish name
[0] service ocean available at 
3853516800.0;tcp://172.29.218.140:36685;tcp://10.10.10.140:36685;tcp://10.10.20.140:36685;tcp://10.10.30.140:36685;tcp://172.16.68.1:36685;tcp://172.16.29.1:36685+3853516801.0;tcp://172.29.218.150:34210;tcp://10.10.30.150:34210:300
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
[2] Lookup name
[6] Lookup name
[4] Lookup name
[3] Lookup name
MPI_Lookup_name succeeded
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
[1] Lookup name
[7] Lookup name
MPI_Lookup_name succeeded
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
MPI_Lookup_name succeeded
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
[5] Lookup name
MPI_Lookup_name succeeded
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
MPI_Lookup_name succeeded
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
MPI_Lookup_name succeeded
Bcast
MPI_Lookup_name succeeded
Bcast
Bcast complete: srv=1
Server calling MPI_Comm_accept
Bcast complete: srv=1
Server calling MPI_Comm_accept
[hang -- because everyone's in accept, not connect]



On Jan 7, 2011, at 4:17 AM, Bernard Secher - SFME/LGLS wrote:

> Jeff,
> 
> Only the processes of the program where process 0 successed to publish name, 
> have srv=1 and then call MPI_Comm_accept.
> The processes of the program where process 0 failed to publish name, have 
> srv=0 and then call MPI_Comm_connect.
> 
> That's worked like this with openmpi 1.4.1.
> 
> Is it different whith openmpi 1.5.1 ?
> 
> Best
> Bernard
> 
> 
> Jeff Squyres a écrit :
>> On Jan 5, 2011, at 10:36 AM, Bernard Secher - SFME/LGLS wrote:
>> 
>>   
>> 
>>> MPI_Comm remoteConnect(int myrank, int *srv, char *port_name, char* service)
>>> {
>>>   int clt=0;
>>>   MPI_Request request; /* requete pour communication non bloquante */
>>>   MPI_Comm gcom;
>>>   MPI_Status status; 
>>>   char   port_name_clt[MPI_MAX_PORT_NAME]; 
>>> 
>>>   if( service == NULL ) service = defaultService;
>>> 
>>>   /* only process of rank null can publish name */
>>>   MPI_Barrier(MPI_COMM_WORLD);
>>> 
>>>   /* A lookup for an unpublished service generate an error */
>>>   MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>>>   if( myrank == 0 ){
>>> /* Try to be a server. If there service is already published, try to be 
>>> a cient */
>>> MPI_Open_port(MPI_INFO_NULL, port_name); 
>>> printf("[%d] Publish name\n",myrank);
>>> if ( MPI_Publish_name(service, MPI_INFO_NULL, port_name) == MPI_SUCCESS 
>>> )  {
>>>   *srv = 1;
>>>   printf("[%d] service %s available at %s\n",myrank,service,port_name);
>>> }
>>> else if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == 
>>> MPI_SUCCESS ){
>>>   MPI_Close_port( port_name ); 
>>>   clt = 1;
>>> }
>>> else
>>>   /* Throw exception */
>>>   printf("[%d] Error\n",myrank);
>>>   }
>>>   else{
>>> /* Waiting rank 0 publish name */
>>> sleep(1);
>>> printf("[%d] Lookup name\n",myrank);
>>> if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == 
>>> MPI_SUCCESS ){
>>>   clt = 1;
>>> }
>>> else
>>>   /* Throw exception */
>>>   ;
>>>   }
>>>   MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
>>>   
>>>   MPI_Bcast(srv,1,MPI_INT,0,MPI_COMM_WORLD);
>>> 
>>> 
>> 
>> You're broadcasting srv here -- won't everyone now have *srv==1, such that 
>> they all call MPI_COMM_ACCEPT, below?
>> 
>>   
>> 
>>>   if ( *srv )
>>> /* I am the Master */
>>> MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  );
>>>   else{
>>> /*  Connect to service SERVER, get the inter-communicator server*/
>>> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>>> if ( MPI_Comm_connect(port_name_clt, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 
>>>  )  == MPI_SUCCESS )
>>>   printf("[%d] I get the connection with %s at %s !\n",myrank, service, 
>>> port_name_clt);
>>> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
>>>   }
>>> 
>>>   if(myrank != 0) *srv = 0;
>>> 
>>>   return gcom;
>>> 
>>> }


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


lookup.c
Description: Binary data


Re: [OMPI users] Issue with : btl_openib.c (OMPI 1.4.3)

2011-01-07 Thread Shamis, Pavel
The FW version looks ok. But it may be driver issues as well. I guess that OFED 
1.4.X or 1.5.x driver should be ok.
To check driver version , you may run ofed_info command.

Regards,

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
Email: sham...@ornl.gov





On Dec 17, 2010, at 12:30 PM, Gilbert Grosdidier wrote:

> John,
> 
>  Thanks, more info below.
> 
> 
> Le 17/12/2010 17:32, John Hearns a écrit :
>> On 17 December 2010 15:47, Gilbert Grosdidier
>>   wrote:
>>> gg= I don't know, and firmware_revs does not seem to be available.
>>> Only thing I got on a worker node was with lspci :
>> If you log into a compute node the command is /usr/sbin/ibstat
> gg= Here it is :
> 
>> /usr/sbin/ibstat
> CA 'mlx4_0'
> CA type: MT26418
> Number of ports: 2
> Firmware version: 2.7.0
> Hardware version: a0
> Node GUID: 0x003048f036c4
> System image GUID: 0x003048f036c7
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 20
> Base lid: 6611
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510868
> Port GUID: 0x003048f036c5
> Port 2:
> State: Active
> Physical state: LinkUp
> Rate: 20
> Base lid: 6612
> LMC: 0
> SM lid: 1
> Capability mask: 0x02510868
> Port GUID: 0x003048f036c6
> 
>  Does this mean resize_cq should be available, please ?
> 
>  Thanks,Best,  G.
> 
>> 
>> The firmware_revs command is on the cluster admin node, and is
>> provided by the sgi-admin-node RPM package.
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier

Yes, here it is :

> mpirun -np 8 --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001
0x0002
0x0004
0x0008
0x0010
0x0020
0x0040
0x0080

 Gilbert.

Le 7 janv. 11 à 15:50, Jeff Squyres a écrit :


Can you run with np=8?

On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:


Hi Jeff,

Thanks for taking care of this.

Here is what I got on a worker node:

mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001

Is this what is expected, please ? Or should I try yet another  
command ?


Thanks,   Regards,   Gilbert.



Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :


On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:


lstopo

Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
 L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
   PU L#0 (P#0)
   PU L#1 (P#8)
 L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
   PU L#2 (P#1)
   PU L#3 (P#9)
 L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
   PU L#4 (P#2)
   PU L#5 (P#10)
 L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
   PU L#6 (P#3)
   PU L#7 (P#11)

[snip]

Well, this might disprove my theory.  :-\  The OS indexing is not  
contiguous on the hyperthreads, so I might be wrong about what  
happened here.  Try this:


mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get

You can even run that on just one node; let's see what you get.   
This will tell us what each process is *actually* bound to.  hwloc- 
bind --get will report a bitmask of the P#'s from above.  So if we  
see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc  
per hyperthread (vs. 1 proc per core) is incorrect.


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
 Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
 LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
 Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
 B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*








--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
Can you run with np=8?

On Jan 7, 2011, at 9:49 AM, Gilbert Grosdidier wrote:

> Hi Jeff,
> 
>  Thanks for taking care of this.
> 
> Here is what I got on a worker node:
> 
> > mpirun --mca mpi_paffinity_alone 1 
> > /opt/software/SGI/hwloc/1.1rc6r3028/bin/hwloc-bind --get
> 0x0001
> 
>  Is this what is expected, please ? Or should I try yet another command ?
> 
>  Thanks,   Regards,   Gilbert.
> 
> 
> 
> Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :
> 
>> On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:
>> 
 lstopo
>>> Machine (35GB)
>>> NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>>>   L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>>> PU L#0 (P#0)
>>> PU L#1 (P#8)
>>>   L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>>> PU L#2 (P#1)
>>> PU L#3 (P#9)
>>>   L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>>> PU L#4 (P#2)
>>> PU L#5 (P#10)
>>>   L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>>> PU L#6 (P#3)
>>> PU L#7 (P#11)
>> [snip]
>> 
>> Well, this might disprove my theory.  :-\  The OS indexing is not contiguous 
>> on the hyperthreads, so I might be wrong about what happened here.  Try this:
>> 
>> mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get
>> 
>> You can even run that on just one node; let's see what you get.  This will 
>> tell us what each process is *actually* bound to.  hwloc-bind --get will 
>> report a bitmask of the P#'s from above.  So if we see 001, 010, 011, 
>> ...etc, then my theory of OMPI binding 1 proc per hyperthread (vs. 1 proc 
>> per core) is incorrect.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
> 
> --
> *-*
>   Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
>   LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
>   Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
>   B.P. 34, F-91898 Orsay Cedex (FRANCE)
> *-*
> 
> 
> 
> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
On Jan 7, 2011, at 5:27 AM, John Hearns wrote:

> Actually, the topic of hyperthreading is interesting, and we should
> discuss it please.
> Hyperthreading is supposedly implemented better and 'properly' on
> Nehalem - I would be interested to see some genuine
> performance measurements with hyperthreading on/off on your machine Gilbert.

FWIW, from what I've seen, and from the recommendations I've heard from Intel, 
using hyperthreading is still a hit-or-miss proposition with HPC apps.  It's 
true that Nehalem (and later) hyperthreading is much better than it was before. 
 But hyperthreading is still designed to support apps that stall frequently (so 
the other hyperthread(s) can take over and do useful work while one is 
stalled).  Good HPC apps don't stall much, so hyperthreading still isn't a huge 
win.

Nehalem (and later) hyperthreading has been discussed on this list at least 
once or twice before; google through the archives to see if you can dig up the 
conversations.  I have dim recollections of people sending at least some 
performance numbers...?  (I could be wrong here, though)

> Also you don;t need to reboot and change BIOS settings - there was a
> rather niofty technique on this list I think,
> where you disable every second CPU in Linux - which has the same
> effect as switching off hyperthreading.

Yes, you can disable all but one hyperthread on a processor in Linux by:

# echo 0 > /sys/devices/system/cpu/cpuX/online

where X is an integer from the set listed in hwloc's lstopo output from the P# 
numbers (i.e., the OS index values, as opposed to the logical index values).  
Repeat for the 2nd P# value on each core in your machine.  You can run lstopo 
again to verify that they went offline.  You can "echo 1" to the same file to 
bring it back online.

Note that you can't offline X=0.

Note that this technique technically doesn't disable each hyperthread; it just 
causes Linux to avoid scheduling on it.  Disabling hyperthreading in the BIOS 
is slightly different; you are actually physically disabling all but one thread 
per core.

The difference is in how resources in a core are split between hyperthreads.  
When you disable hyperthreading in the BIOS, all the resources in the core are 
given to the first hyperthread and the 2nd is deactivated (i.e., the OS doesn't 
even see it at all).  When hyperthreading is enabled in the BIOS, the core 
resources are split between all hyperthreads.  

Specifically: causing the OS to simply not schedule on all but the first 
hyperthread doesn't give those resources back to the first hyperthread; it just 
effectively ignores all but the first hyperthread.

My understanding is that hyperthreading can only be activated/deactivated at 
boot time -- once the core resources are allocated to hyperthreads, they can't 
be changed while running.

Whether disabling the hyperthreads or simply telling Linux not to schedule on 
them makes a difference performance-wise remains to be seen.  I've never had 
the time to do a little benchmarking to quantify the difference.  If someone 
could rustle up a few cycles (get it?) to test out what the real-world 
performance difference is between disabling hyperthreading in the BIOS vs. 
telling Linux to ignore the hyperthreads, that would be awesome.  I'd love to 
see such results.  

My personal guess is that the difference is in the noise.  But that's a guess.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Gilbert Grosdidier

Hi Jeff,

 Thanks for taking care of this.

Here is what I got on a worker node:

> mpirun --mca mpi_paffinity_alone 1 /opt/software/SGI/hwloc/ 
1.1rc6r3028/bin/hwloc-bind --get

0x0001

 Is this what is expected, please ? Or should I try yet another  
command ?


 Thanks,   Regards,   Gilbert.



Le 7 janv. 11 à 15:35, Jeff Squyres a écrit :


On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:


lstopo

Machine (35GB)
NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
  L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
  L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#9)
  L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#10)
  L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#11)

[snip]

Well, this might disprove my theory.  :-\  The OS indexing is not  
contiguous on the hyperthreads, so I might be wrong about what  
happened here.  Try this:


mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get

You can even run that on just one node; let's see what you get.   
This will tell us what each process is *actually* bound to.  hwloc- 
bind --get will report a bitmask of the P#'s from above.  So if we  
see 001, 010, 011, ...etc, then my theory of OMPI binding 1 proc per  
hyperthread (vs. 1 proc per core) is incorrect.


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



--
*-*
  Gilbert Grosdidier gilbert.grosdid...@in2p3.fr
  LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
  Faculté des Sciences, Bat. 200 Fax   : +33 1 6446 8546
  B.P. 34, F-91898 Orsay Cedex (FRANCE)
*-*







Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread Jeff Squyres
On Jan 6, 2011, at 11:23 PM, Gilbert Grosdidier wrote:

> > lstopo
> Machine (35GB)
>  NUMANode L#0 (P#0 18GB) + Socket L#0 + L3 L#0 (8192KB)
>L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
>  PU L#0 (P#0)
>  PU L#1 (P#8)
>L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1
>  PU L#2 (P#1)
>  PU L#3 (P#9)
>L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2
>  PU L#4 (P#2)
>  PU L#5 (P#10)
>L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3
>  PU L#6 (P#3)
>  PU L#7 (P#11)
[snip]

Well, this might disprove my theory.  :-\  The OS indexing is not contiguous on 
the hyperthreads, so I might be wrong about what happened here.  Try this:

mpirun --mca mpi_paffinity_alone 1 hwloc-bind --get

You can even run that on just one node; let's see what you get.  This will tell 
us what each process is *actually* bound to.  hwloc-bind --get will report a 
bitmask of the P#'s from above.  So if we see 001, 010, 011, ...etc, then my 
theory of OMPI binding 1 proc per hyperthread (vs. 1 proc per core) is 
incorrect.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores : very poor performance

2011-01-07 Thread John Hearns
On 6 January 2011 21:10, Gilbert Grosdidier  wrote:
> Hi Jeff,
>
>  Where's located lstopo command on SuseLinux, please ?
> And/or hwloc-bind, which seems related to it ?

I was able to get hwloc to install quite easily on SuSE -
download/configure/make
Configure it to install to /usr/local/bin


Actually, the topic of hyperthreading is interesting, and we should
discuss it please.
Hyperthreading is supposedly implemented better and 'properly' on
Nehalem - I would be interested to see some genuine
performance measurements with hyperthreading on/off on your machine Gilbert.

Also you don;t need to reboot and change BIOS settings - there was a
rather niofty technique on this list I think,
where you disable every second CPU in Linux - which has the same
effect as switching off hyperthreading.
Maybe you could try it?



Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS

The accept and connect tests are OK with version openmpi 1.4.1.

I think there is a bug in version 1.5.1

Best
Bernard

Bernard Secher - SFME/LGLS a écrit :
I get the same dead lock with openmpi tests: pubsub, accept and 
connect with version 1.5.1


Bernard Secher - SFME/LGLS a écrit :

Jeff,

The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but 
before in MPI_Publish_name and MPI_Lookup_name.

So the broadcast of srv is not involved in the dead lock.

Best
Bernard

Bernard Secher - SFME/LGLS a écrit :

Jeff,

Only the processes of the program where process 0 successed to 
publish name, have srv=1 and then call MPI_Comm_accept.
The processes of the program where process 0 failed to publish name, 
have srv=0 and then call MPI_Comm_connect.


That's worked like this with openmpi 1.4.1.

Is it different whith openmpi 1.5.1 ?

Best
Bernard


Jeff Squyres a écrit :

On Jan 5, 2011, at 10:36 AM, Bernard Secher - SFME/LGLS wrote:

  

MPI_Comm remoteConnect(int myrank, int *srv, char *port_name, char* service)
{
  int clt=0;
  MPI_Request request; /* requete pour communication non bloquante */
  MPI_Comm gcom;
  MPI_Status status; 
  char   port_name_clt[MPI_MAX_PORT_NAME]; 


  if( service == NULL ) service = defaultService;

  /* only process of rank null can publish name */
  MPI_Barrier(MPI_COMM_WORLD);

  /* A lookup for an unpublished service generate an error */
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  if( myrank == 0 ){
/* Try to be a server. If there service is already published, try to be a 
cient */
MPI_Open_port(MPI_INFO_NULL, port_name); 
printf("[%d] Publish name\n",myrank);

if ( MPI_Publish_name(service, MPI_INFO_NULL, port_name) == MPI_SUCCESS )  {
  *srv = 1;
  printf("[%d] service %s available at %s\n",myrank,service,port_name);
}
else if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == 
MPI_SUCCESS ){
  MPI_Close_port( port_name ); 
  clt = 1;

}
else
  /* Throw exception */
  printf("[%d] Error\n",myrank);
  }
  else{
/* Waiting rank 0 publish name */
sleep(1);
printf("[%d] Lookup name\n",myrank);
if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == MPI_SUCCESS 
){
  clt = 1;
}
else
  /* Throw exception */
  ;
  }
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  
  MPI_Bcast(srv,1,MPI_INT,0,MPI_COMM_WORLD);



You're broadcasting srv here -- won't everyone now have *srv==1, such that they 
all call MPI_COMM_ACCEPT, below?

  

  if ( *srv )
/* I am the Master */
MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  );
  else{
/*  Connect to service SERVER, get the inter-communicator server*/
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
if ( MPI_Comm_connect(port_name_clt, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 
 )  == MPI_SUCCESS )
  printf("[%d] I get the connection with %s at %s !\n",myrank, service, 
port_name_clt);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  }

  if(myrank != 0) *srv = 0;

  return gcom;

}




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  





  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS
I get the same dead lock with openmpi tests: pubsub, accept and connect 
with version 1.5.1


Bernard Secher - SFME/LGLS a écrit :

Jeff,

The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but 
before in MPI_Publish_name and MPI_Lookup_name.

So the broadcast of srv is not involved in the dead lock.

Best
Bernard

Bernard Secher - SFME/LGLS a écrit :

Jeff,

Only the processes of the program where process 0 successed to 
publish name, have srv=1 and then call MPI_Comm_accept.
The processes of the program where process 0 failed to publish name, 
have srv=0 and then call MPI_Comm_connect.


That's worked like this with openmpi 1.4.1.

Is it different whith openmpi 1.5.1 ?

Best
Bernard


Jeff Squyres a écrit :

On Jan 5, 2011, at 10:36 AM, Bernard Secher - SFME/LGLS wrote:

  

MPI_Comm remoteConnect(int myrank, int *srv, char *port_name, char* service)
{
  int clt=0;
  MPI_Request request; /* requete pour communication non bloquante */
  MPI_Comm gcom;
  MPI_Status status; 
  char   port_name_clt[MPI_MAX_PORT_NAME]; 


  if( service == NULL ) service = defaultService;

  /* only process of rank null can publish name */
  MPI_Barrier(MPI_COMM_WORLD);

  /* A lookup for an unpublished service generate an error */
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  if( myrank == 0 ){
/* Try to be a server. If there service is already published, try to be a 
cient */
MPI_Open_port(MPI_INFO_NULL, port_name); 
printf("[%d] Publish name\n",myrank);

if ( MPI_Publish_name(service, MPI_INFO_NULL, port_name) == MPI_SUCCESS )  {
  *srv = 1;
  printf("[%d] service %s available at %s\n",myrank,service,port_name);
}
else if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == 
MPI_SUCCESS ){
  MPI_Close_port( port_name ); 
  clt = 1;

}
else
  /* Throw exception */
  printf("[%d] Error\n",myrank);
  }
  else{
/* Waiting rank 0 publish name */
sleep(1);
printf("[%d] Lookup name\n",myrank);
if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == MPI_SUCCESS 
){
  clt = 1;
}
else
  /* Throw exception */
  ;
  }
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  
  MPI_Bcast(srv,1,MPI_INT,0,MPI_COMM_WORLD);



You're broadcasting srv here -- won't everyone now have *srv==1, such that they 
all call MPI_COMM_ACCEPT, below?

  

  if ( *srv )
/* I am the Master */
MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  );
  else{
/*  Connect to service SERVER, get the inter-communicator server*/
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
if ( MPI_Comm_connect(port_name_clt, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 
 )  == MPI_SUCCESS )
  printf("[%d] I get the connection with %s at %s !\n",myrank, service, 
port_name_clt);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  }

  if(myrank != 0) *srv = 0;

  return gcom;

}




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  





  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS

Jeff,

The dead lock is not in MPI_Comm_accept and MPI_Comm_connect, but before 
in MPI_Publish_name and MPI_Lookup_name.

So the broadcast of srv is not involved in the dead lock.

Best
Bernard

Bernard Secher - SFME/LGLS a écrit :

Jeff,

Only the processes of the program where process 0 successed to publish 
name, have srv=1 and then call MPI_Comm_accept.
The processes of the program where process 0 failed to publish name, 
have srv=0 and then call MPI_Comm_connect.


That's worked like this with openmpi 1.4.1.

Is it different whith openmpi 1.5.1 ?

Best
Bernard


Jeff Squyres a écrit :

On Jan 5, 2011, at 10:36 AM, Bernard Secher - SFME/LGLS wrote:

  

MPI_Comm remoteConnect(int myrank, int *srv, char *port_name, char* service)
{
  int clt=0;
  MPI_Request request; /* requete pour communication non bloquante */
  MPI_Comm gcom;
  MPI_Status status; 
  char   port_name_clt[MPI_MAX_PORT_NAME]; 


  if( service == NULL ) service = defaultService;

  /* only process of rank null can publish name */
  MPI_Barrier(MPI_COMM_WORLD);

  /* A lookup for an unpublished service generate an error */
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  if( myrank == 0 ){
/* Try to be a server. If there service is already published, try to be a 
cient */
MPI_Open_port(MPI_INFO_NULL, port_name); 
printf("[%d] Publish name\n",myrank);

if ( MPI_Publish_name(service, MPI_INFO_NULL, port_name) == MPI_SUCCESS )  {
  *srv = 1;
  printf("[%d] service %s available at %s\n",myrank,service,port_name);
}
else if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == 
MPI_SUCCESS ){
  MPI_Close_port( port_name ); 
  clt = 1;

}
else
  /* Throw exception */
  printf("[%d] Error\n",myrank);
  }
  else{
/* Waiting rank 0 publish name */
sleep(1);
printf("[%d] Lookup name\n",myrank);
if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == MPI_SUCCESS 
){
  clt = 1;
}
else
  /* Throw exception */
  ;
  }
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  
  MPI_Bcast(srv,1,MPI_INT,0,MPI_COMM_WORLD);



You're broadcasting srv here -- won't everyone now have *srv==1, such that they 
all call MPI_COMM_ACCEPT, below?

  

  if ( *srv )
/* I am the Master */
MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  );
  else{
/*  Connect to service SERVER, get the inter-communicator server*/
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
if ( MPI_Comm_connect(port_name_clt, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 
 )  == MPI_SUCCESS )
  printf("[%d] I get the connection with %s at %s !\n",myrank, service, 
port_name_clt);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  }

  if(myrank != 0) *srv = 0;

  return gcom;

}




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  







Re: [OMPI users] change between openmpi 1.4.1 and 1.5.1 about MPI2 publish name

2011-01-07 Thread Bernard Secher - SFME/LGLS

Jeff,

Only the processes of the program where process 0 successed to publish 
name, have srv=1 and then call MPI_Comm_accept.
The processes of the program where process 0 failed to publish name, 
have srv=0 and then call MPI_Comm_connect.


That's worked like this with openmpi 1.4.1.

Is it different whith openmpi 1.5.1 ?

Best
Bernard


Jeff Squyres a écrit :

On Jan 5, 2011, at 10:36 AM, Bernard Secher - SFME/LGLS wrote:

  

MPI_Comm remoteConnect(int myrank, int *srv, char *port_name, char* service)
{
  int clt=0;
  MPI_Request request; /* requete pour communication non bloquante */
  MPI_Comm gcom;
  MPI_Status status; 
  char   port_name_clt[MPI_MAX_PORT_NAME]; 


  if( service == NULL ) service = defaultService;

  /* only process of rank null can publish name */
  MPI_Barrier(MPI_COMM_WORLD);

  /* A lookup for an unpublished service generate an error */
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
  if( myrank == 0 ){
/* Try to be a server. If there service is already published, try to be a 
cient */
MPI_Open_port(MPI_INFO_NULL, port_name); 
printf("[%d] Publish name\n",myrank);

if ( MPI_Publish_name(service, MPI_INFO_NULL, port_name) == MPI_SUCCESS )  {
  *srv = 1;
  printf("[%d] service %s available at %s\n",myrank,service,port_name);
}
else if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == 
MPI_SUCCESS ){
  MPI_Close_port( port_name ); 
  clt = 1;

}
else
  /* Throw exception */
  printf("[%d] Error\n",myrank);
  }
  else{
/* Waiting rank 0 publish name */
sleep(1);
printf("[%d] Lookup name\n",myrank);
if ( MPI_Lookup_name(service, MPI_INFO_NULL, port_name_clt) == MPI_SUCCESS 
){
  clt = 1;
}
else
  /* Throw exception */
  ;
  }
  MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  
  MPI_Bcast(srv,1,MPI_INT,0,MPI_COMM_WORLD);



You're broadcasting srv here -- won't everyone now have *srv==1, such that they 
all call MPI_COMM_ACCEPT, below?

  

  if ( *srv )
/* I am the Master */
MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,  );
  else{
/*  Connect to service SERVER, get the inter-communicator server*/
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
if ( MPI_Comm_connect(port_name_clt, MPI_INFO_NULL, 0, MPI_COMM_WORLD, 
 )  == MPI_SUCCESS )
  printf("[%d] I get the connection with %s at %s !\n",myrank, service, 
port_name_clt);
MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL);
  }

  if(myrank != 0) *srv = 0;

  return gcom;

}




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  



--

  _\\|//_
 (' 0 0 ')
ooO  (_) Ooo__
Bernard Sécher  DEN/DM2S/SFME/LGLSmailto : bsec...@cea.fr
CEA Saclay, Bât 454, Pièce 114Phone  : 33 (0)1 69 08 73 78
91191 Gif-sur-Yvette Cedex, FranceFax: 33 (0)1 69 08 10 87
Oooo---
  oooO (   )
  (   ) ) /
   \ ( (_/
\_)


Ce message électronique et tous les fichiers attachés qu'il contient
sont confidentiels et destinés exclusivement à l'usage de la personne
à laquelle ils sont adressés. Si vous avez reçu ce message par erreur,
merci d'en avertir immédiatement son émetteur et de ne pas en conserver
de copie.

This e-mail and any files transmitted with it are confidential and
intended solely for the use of the individual to whom they are addressed.
If you have received this e-mail in error please inform the sender
immediately, without keeping any copy thereof.