Re: [OMPI users] Problems with mpirun

2010-09-03 Thread Peter Kjellstrom
On Friday 03 September 2010, Alexander Kalinin wrote:
> Hello!
>
> I have a problem to run mpi program. My command line is:
> $ mpirun -np 1 ./ksurf
>
> But I got an error:
> [0,0,0] mca_oob_tcp_init: invalid address '' returned for selected oob
> interfaces.
> [0,0,0] ORTE_ERROR_LOG: Error in file oob_tcp.c at line 880
>
> My working environment is: Fedora 7, openmpi-1.1
>
> Is it possible to treat this problem?

Both Fedora7 and OpenMPI-1.1 are ancient. I'd suggest you upgrade to current 
versions before you invest time debugging this.

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Low Open MPI performance on InfiniBand and shared memory?

2010-07-09 Thread Peter Kjellstrom
On Friday 09 July 2010, Andreas Schäfer wrote:
> Thanks, those were good suggestions.
>
> On 11:53 Fri 09 Jul     , Peter Kjellstrom wrote:
> > On an E5520 (nehalem) node I get ~5 GB/s ping-pong for >64K sizes.
>
> I just tried a Core i7 system which maxes at 6550 MB/s for the
> ping-pong test.

It makes quite some difference if the ranks end up on the same socket or 
different sockets (on an i7 you only have one).

> > On QDR IB on similar nodes I get ~3 GB/s ping-pong for >256K.
>
> I'll try to find a Intel system to repeat the tests. Maybe it's AMD's
> different memory subsystem/cache architecture which is slowing Open
> MPI? Or are my systems just badly configured?

8x pci-express gen2 5GT/s should show figures like mine. If it's pci-express 
gen1 or gen2 2.5GT/s or 4x or if the IB only came up with two lanes then 1500 
is expected.

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Low Open MPI performance on InfiniBand and shared memory?

2010-07-09 Thread Peter Kjellstrom
On Friday 09 July 2010, Andreas Schäfer wrote:
> Hi,
>
> I'm evaluating Open MPI 1.4.2 on one of our BladeCenters and I'm
> getting via InfiniBand about 1550 MB/s and via shared memory about
> 1770 for the PingPong benchmark in Intel's MPI benchmark. (That
> benchmark is just an example, I'm seeing similar numbers for my own
> codes.)

Two factors that make a big difference, size of the operations and type of 
node (cpu model).

On an E5520 (nehalem) node I get ~5 GB/s ping-pong for >64K sizes.

On QDR IB on similar nodes I get ~3 GB/s ping-pong for >256K.

Numbers are for 1.4.1 YMMV. I couldn't find an AMD node similar to yours, 
sorry.

/Peter


> Each node has two AMD hex-cores and two 40 Gbps InfiniBand ports, so I
> wonder if I shouldn't be getting a significantly higher throughput on
> InfiniBand. Considering the CPUs' memory bandwidth, I believe that
> shared memory throughput should be much higher as well.
>
> Are those numbers what is to be expected? If not: any ideas how to
> debug this or tune Open MPI?
>
> Thanks in advance
> -Andreas
>
> ps: if it's any help, this is what iblinkinfo is telling me
> (tests were run on faui36[bc])


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] (no subject)

2010-06-11 Thread Peter Kjellstrom
On Friday 11 June 2010, asmae.elbahlo...@mpsa.com wrote:
> Hello
> i have a problem with parFoam, when i type in the terminal parafoam, it
> lauches nothing but in the terminal i have : 

This is the OpenMPI mailling list, not OpenFoam. I suggest you contact the 
team behind OpenFoam.

I also suggest that you post plain text to mailing lists in the future and not 
html (and while you're at it do use a descriptive subject line).

/Peter
  
> tta201@linux-qv31:/media/OpenFoam/FOAMpro/FOAMpro-1.5-2.2/FOAM-1.5-2.2/tuto
>rials/icoFoam/cavity> paraFoam Xlib:  extension "GLX" missing on display
> ":0.0". Xlib:  extension "GLX" missing on display ":0.0". Xlib:  extension
> "GLX" missing on display ":0.0". Xlib:  extension "GLX" missing on display
> ":0.0". Xlib:  extension "GLX" missing on display ":0.0". Xlib:  extension
> "GLX" missing on display ":0.0". Xlib:  extension "GLX" missing on display
> ":0.0". Xlib:  extension "GLX" missing on display ":0.0". ERROR: In
> /home/kitware/Dashboard/MyTests/ParaView-3-8/ParaView-3.8/ParaView/VTK/Rend
>ering/vtkXOpenGLRenderWindow.cxx, line 404 vtkXOpenGLRenderWindow
> (0x117b3d0): Could not find a decent visual 
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> Xlib:  extension "GLX" missing on display ":0.0".
> ERROR: In
> /home/kitware/Dashboard/MyTests/ParaView-3-8/ParaView-3.8/ParaView/VTK/Rend
>ering/vtkXOpenGLRenderWindow.cxx, line 404 vtkXOpenGLRenderWindow
> (0x117b3d0): Could not find a decent visual 
> Xlib:  extension "GLX" missing on display ":0.0".
> ERROR: In
> /home/kitware/Dashboard/MyTests/ParaView-3-8/ParaView-3.8/ParaView/VTK/Rend
>ering/vtkXOpenGLRenderWindow.cxx, line 611 vtkXOpenGLRenderWindow
> (0x117b3d0): GLX not found.  Aborting.
>   
> /media/OpenFoam/FOAMpro/FOAMpro-1.5-2.2/FOAM-1.5-2.2/bin/paraFoam: line 81:
> 15497 Aborted paraview --data=$caseFile 
>  
>  
> I don't understand the problem, can someone help me please?
> thanks



-- 

  Peter Kjellström   | E-mail: c...@nsc.liu.se
  National Supercomputer Centre  |
  Sweden | http://www.nsc.liu.se


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Trouble building openmpi 1.2.8 with intel compilers 10.0.23

2010-04-06 Thread Peter Kjellstrom
On Monday 05 April 2010, Steve Swanekamp (L3-Titan Contractor) wrote:
> When I try to run the configure utility I get the message that the c++
> compiler can not compile simple c programs.  Any ideas?

(at least some) Intel compilers need the gcc-c++ distribution package. Have 
you tested icpc with a simple c++ program?

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] mpi error?

2010-03-11 Thread Peter Kjellstrom
On Thursday 11 March 2010, Matthew MacManes wrote:
> Can anybody tell me if this is an error associated with openmpi, versus an
> issue with the program I am running (MRBAYES,
> https://sourceforge.net/projects/mrbayes/)
>
> We are trying to run a large simulated dataset using 1,000,000 bases
> divided up into 1000 genes, 5 taxa.. An error is occurring, but we are not
> sure why. We are using the MPI version of MRBAYES v3.2-cvs on a linux
> 16core 24GB RAM machine. It does not appear as if the program runs out of
> memory (max memory usage is 13gb).  Maybe this is an OpenMPI problem and
> not related to MrBayes...
>
> See snippet of error message below. Can anybody give me any hints about the
> source of the problem?
>
> I am using OPENMPI version 1.4.1.
>
> ...
> Defining charset called gene997
> Defining charset called gene998
> Defining charset called gene999
> Defining charset called gene1000
> Defining partition called Genes
> [macmanes:02546] *** Process received signal ***
> [macmanes:02546] Signal: Segmentation fault (11)
> [macmanes:02546] Signal code: Address not mapped (1)
> [macmanes:02546] Failing at address: (nil)
> [macmanes:02546] [ 0] /lib/libpthread.so.0 [0x7ffd0f322190]
> [macmanes:02546] *** End of error message ***
> --
> mpirun noticed that process rank 13 with PID 2546 on node macmanes exited
> on signal 11 (Segmentation fault).

On of the ranks got a "Segmentation fault". This would typically indicate a 
problem with the app not the MPI. Maybe you ran out of stack space? 
(ulimit -s).

Have you tried a different/lower number of ranks?

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Problems compiling OpenMPI 1.4 with PGI 9.0-3

2010-01-07 Thread Peter Kjellstrom
On Wednesday 06 January 2010, Tim Miller wrote:
> Hi All,
>
> I am trying to compile OpenMPI 1.4 with PGI 9.0-3 and am getting the
> following error in configure:
>
> checking for functional offsetof macro... no
> configure: WARNING: Your compiler does not support offsetof macro
> configure: error: Configure: Cannot continue
>
> I have searched around and found that this error occurs because of a
> problem in the configure scripts when PGI 10 is used, but I'm using 9.0-3
> which should not have the configure script issue. Here is the output of
> pgcc -V:
>
> pgcc 9.0-3 64-bit target on x86-64 Linux -tp k8-64e
> Copyright 1989-2000, The Portland Group, Inc.  All Rights Reserved.
> Copyright 2000-2009, STMicroelectronics, Inc.  All Rights Reserved.
>
> I'm not sure what's wrong here as other people have reported being able to
> build OpenMPI with PGI 9. Does anyone have any ideas?

Maybe a late enough PGI-9 behaves like PGI-10. You could try the 1.4.1-rc1 
which should work with PGI-10 and see if it fixes your problems too.

/Peter

> Thanks,
> Tim Miller


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] (no subject)

2009-10-30 Thread Peter Kjellstrom
On Friday 30 October 2009, Konstantinos Angelopoulos wrote:
> good part of the day,
>
> I am trying to run a parallel program (that used to run in a cluster) in my
> double core pc. Could openmpi simulate the distribution of the parallel
> jobs  to my 2 processors

If your program is an MPI program then, yes, OpenMPI on your PC would allow 
you to use both cores (assuming your job can fit on the PC of course).

> meaning will qsub work even if it is not a real 
> cluster?

qsub has nothing to do with MPI it belongs to the work load management system 
or batch queue system. You could install this on your PC as well (see for 
example torque, SGE or slurm).

/Peter

> thank you for reading my message and for any answer.
>
> Konstantinos Angelopoulos


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Openmpi setup with intel compiler.

2009-09-30 Thread Peter Kjellstrom
On Wednesday 30 September 2009, vighn...@aero.iitb.ac.in wrote:
...
> during
> configuring with Intel 9.0 compiler the installation gives following
> error.
>
> [root@test_node openmpi-1.3.3]# make all install
...
> make[3]: Entering directory `/tmp/openmpi-1.3.3/orte'
> test -z "/share/apps/mpi/openmpi/intel/lib" || /bin/mkdir -p
> "/share/apps/mpi/openmpi/intel/lib"
>  /bin/sh ../libtool   --mode=install /usr/bin/install -c  'libopen-rte.la'
> '/share/apps/mpi/openmpi/intel/lib/libopen-rte.la'
> libtool: install: error: cannot install `libopen-rte.la' to a directory
> not ending in /share/apps/mpi/openmpi/pgi/lib

The line above indicates that you've somehow attempted this from a dirty tree 
and/or environment (dirty from the previous pgi installation...).

Try a clean environment, clean build tree. Source the icc/ifort-vars.sh files 
from your intel install dir, set CC, CXX, FC, F77 and do:
 "./configure --prefix=... && make && make install"

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] profile the performance of a MPI code: how much traffic is being generated?

2009-09-30 Thread Peter Kjellstrom
On Tuesday 29 September 2009, Rahul Nabar wrote:
> On Tue, Sep 29, 2009 at 10:40 AM, Eugene Loh  wrote:
> > to know.  It sounds like you want to be able to watch some % utilization
> > of a hardware interface as the program is running.  I *think* these tools
> > (the ones on the FAQ, including MPE, Vampir, and Sun Studio) are not of
> > that class.
>
> You are correct. A real time tool would be best that sniffs at the MPI
> traffic. Post mortem profilers would be the next best option I assume.
> I was trying to compile MPE but gave up. Too many errors. Trying to
> decide if I should prod on or look at another tool.

Not MPI aware, but, you could watch network traffic with a tool such as 
collectl in real-time.

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-09-23 Thread Peter Kjellstrom
On Wednesday 23 September 2009, Rahul Nabar wrote:
> On Tue, Aug 18, 2009 at 5:28 PM, Gerry Creager  
wrote:
> > Most of that bandwidth is in marketing...  Sorry, but it's not a high
> > performance switch.
>
> Well, how does one figure out what exactly is a "hih performance
> switch"?

IMHO 1G Ethernet won't be enough ("high performance" or not). Get yourself 
some cheap IB HCAs and a switch. The only chance you have with Ethernet is to 
run some sort of bypass proto (OpenMX etc.) and tune your NICs.

/Peter

> I've found this an exceedingly hard task. Like the OP posted 
> the Dell 6248 is rated to give more than a fully subscribed backbone
> capacity. Nor I do not know any good third party test lab nor do I
> know any switch load testing benchmarks that'd take a switch through
> its paces.
>
> So, how does one go about selecting a good switch? "The most expensive
> the better" is somewhat a unsatisfying option!


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Help: Infiniband interface hang

2009-09-03 Thread Peter Kjellstrom
Could you guys please trim your e-mails. No one wants to scroll by 100K-200K 
old context to see the update (not to mention wasting storage space for 
people.)

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Roman Martonak wrote:
> I tried to run with the first dynamic rules file that Pavel proposed
> and it works, the time per one MD step on 48 cores decreased from 2.8
> s to 1.8 s as expected. It was clearly the basic linear algorithm that
> was causing the problem. I will check the performance of bruck and
> pairwise on my HW. It would be nice if it could be tuned further.

I'm guessing you'll see even better performance if you change 8192 to 131072 
in that config file. That moves up the cross over point between "bruck" 
and "pair wise".

/Peter

> Thanks
>
> Roman


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
> > Disabling basic_linear seems like a good idea but your config file sets
> > the cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to
> > result in a message size of that value divided by the number of ranks).
> >
> > In my testing bruck seems to win clearly (at least for 64 ranks on my IB)
> > up to 2048. Hence, the following line may be better:
> >
> >  131072 2 0 0 # switch to pair wise for size 128K/nranks
> >
> > Disclaimer: I guess this could differ quite a bit for nranks!=64 and
> > different btls.
>
> Sounds strange for me. From the code is looks that we take the threshold as
> is without dividing by number of ranks.

Interesting, I may have had to little or too much coffe but the figures in my 
previous e-mail (3rd run, bruckto2k_pair) was run with the above line. And it 
very much looks like it switched at 128K/64=2K, not at 128K (which would have 
been above my largest size of 3000 and as such equiv. to all_bruck).

I also ran tests with:
 8192 2 0 0 # ...
And it seemed to switch between 10 Bytes and 500 Bytes (most likely then at 
8192/64=128).

My testprogram calls MPI_Alltoall like this:
  time1 = MPI_Wtime();
  for (i = 0; i < repetitions; i++) {
MPI_Alltoall(sbuf, message_size, MPI_CHAR,
 rbuf, message_size, MPI_CHAR, MPI_COMM_WORLD);
  }
  time2 = MPI_Wtime();

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Pavel Shamis (Pasha) wrote:
> > With the file Pavel has provided things have changed to the following.
> > (maybe someone can confirm)
> >
> > If message size < 8192
> > bruck
> > else
> > pairwise
> > end
>
> You are right here. Target of my conf file is disable basic_linear for
> medium message side.

Disabling basic_linear seems like a good idea but your config file sets the 
cut-off at 128 Bytes for 64-ranks (the field you set to 8192 seems to result 
in a message size of that value divided by the number of ranks).

In my testing bruck seems to win clearly (at least for 64 ranks on my IB) up 
to 2048. Hence, the following line may be better:

 131072 2 0 0 # switch to pair wise for size 128K/nranks

Disclaimer: I guess this could differ quite a bit for nranks!=64 and different 
btls.

Here are some figures for this part of the package size range:

all_bruck
bw for   10  x 10 B :  13.7 Mbytes/s time was: 922.0 �s
bw for   10  x 500 B :  45.9 Mbytes/stime was:  13.7 ms
bw for   10  x 1000 B : 122.7 Mbytes/s   time was:  10.3 ms
bw for   10  x 1500 B :  86.9 Mbytes/s   time was:  21.8 ms
bw for   10  x 2000 B : 120.1 Mbytes/s   time was:  21.0 ms
bw for   10  x 2047 B :  92.6 Mbytes/s   time was:  27.9 ms
bw for   10  x 2048 B : 107.3 Mbytes/s   time was:  24.1 ms
bw for   10  x 2400 B :  93.7 Mbytes/s   time was:  32.3 ms
bw for   10  x 2800 B :  73.0 Mbytes/s   time was:  48.3 ms
bw for   10  x 2900 B :  79.5 Mbytes/s   time was:  45.9 ms
bw for   10  x 2925 B :  89.3 Mbytes/s   time was:  41.3 ms
bw for   10  x 2950 B :  72.7 Mbytes/s   time was:  51.1 ms
bw for   10  x 2975 B :  75.2 Mbytes/s   time was:  49.8 ms
bw for   10  x 3000 B :  74.9 Mbytes/s   time was:  50.5 ms
bw for   10  x 3100 B :  95.9 Mbytes/s   time was:  40.7 ms
totaltime was: 479.5 ms
all_pair
bw for   10  x 10 B : 414.2 kbytes/s time was:  30.4 ms
bw for   10  x 500 B :  19.8 Mbytes/stime was:  31.9 ms
bw for   10  x 1000 B :  43.3 Mbytes/s   time was:  29.1 ms
bw for   10  x 1500 B :  63.3 Mbytes/s   time was:  29.9 ms
bw for   10  x 2000 B :  81.2 Mbytes/s   time was:  31.0 ms
bw for   10  x 2047 B :  82.3 Mbytes/s   time was:  31.3 ms
bw for   10  x 2048 B :  83.0 Mbytes/s   time was:  31.1 ms
bw for   10  x 2400 B :  93.6 Mbytes/s   time was:  32.3 ms
bw for   10  x 2800 B : 105.0 Mbytes/s   time was:  33.6 ms
bw for   10  x 2900 B : 107.7 Mbytes/s   time was:  33.9 ms
bw for   10  x 2925 B : 108.1 Mbytes/s   time was:  34.1 ms
bw for   10  x 2950 B : 109.6 Mbytes/s   time was:  33.9 ms
bw for   10  x 2975 B : 111.1 Mbytes/s   time was:  33.7 ms
bw for   10  x 3000 B : 112.1 Mbytes/s   time was:  33.7 ms
bw for   10  x 3100 B : 114.5 Mbytes/s   time was:  34.1 ms
totaltime was: 484.1 ms
bruckto2k_pair
bw for   10  x 10 B :  11.9 Mbytes/s time was:   1.1 ms
bw for   10  x 500 B : 100.3 Mbytes/stime was:   6.3 ms
bw for   10  x 1000 B : 115.9 Mbytes/s   time was:  10.9 ms
bw for   10  x 1500 B : 117.2 Mbytes/s   time was:  16.1 ms
bw for   10  x 2000 B :  95.7 Mbytes/s   time was:  26.3 ms
bw for   10  x 2047 B :  96.6 Mbytes/s   time was:  26.7 ms
bw for   10  x 2048 B :  82.2 Mbytes/s   time was:  31.4 ms
bw for   10  x 2400 B :  94.1 Mbytes/s   time was:  32.1 ms
bw for   10  x 2800 B : 105.6 Mbytes/s   time was:  33.4 ms
bw for   10  x 2900 B : 108.4 Mbytes/s   time was:  33.7 ms
bw for   10  x 2925 B : 108.3 Mbytes/s   time was:  34.0 ms
bw for   10  x 2950 B : 109.9 Mbytes/s   time was:  33.8 ms
bw for   10  x 2975 B : 111.5 Mbytes/s   time was:  33.6 ms
bw for   10  x 3000 B : 108.3 Mbytes/s   time was:  34.9 ms
bw for   10  x 3100 B : 114.7 Mbytes/s   time was:  34.0 ms
totaltime was: 388.4 ms

These figures were run on a freshly compiled OpenMPI-1.3.2. The numbers for 
bruck at smalla package sizes vary a bit from run to run.

/Peter

> Pasha.


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Peter Kjellstrom
On Wednesday 20 May 2009, Rolf Vandevaart wrote:
...
> If I am understanding what is happening, it looks like the original
> MPI_Alltoall made use of three algorithms.  (You can look in
> coll_tuned_decision_fixed.c)
>
> If message size < 200 or communicator size > 12
>bruck
> else if message size < 3000
>basic linear
> else
>pairwise
> end

And 3000 was the observed threshold for bad behaviour so it seems very likely 
that "basic linear" was the culprit. My testing would suggest that "pairwise" 
was a good choice for ~3000 (but maybe bruck, as configured by Pavel, is good 
too).

/Peter

> With the file Pavel has provided things have changed to the following.
> (maybe someone can confirm)
>
> If message size < 8192
>bruck
> else
>pairwise
> end
>
> Rolf


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] scaling problem with openmpi

2009-05-19 Thread Peter Kjellstrom
On Tuesday 19 May 2009, Roman Martonak wrote:
> On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom <c...@nsc.liu.se> wrote:
> > On Tuesday 19 May 2009, Roman Martonak wrote:
> > ...
> >> openmpi-1.3.2                           time per one MD step is 3.66 s
> >>    ELAPSED TIME :    0 HOURS  1 MINUTES 25.90 SECONDS
> >>  = ALL TO ALL COMM           102033. BYTES               4221.  =
> >>  = ALL TO ALL COMM             7.802  MB/S          55.200 SEC  =
...
> With TASKGROUP=2 the summary looks as follows
...
>  = ALL TO ALL COMM   231821. BYTES   4221.  =
>  = ALL TO ALL COMM82.716  MB/S  11.830 SEC  =

Wow, according to this it takes 1/5th the time to do the same number (4221) of 
alltoalls if the size is (roughly) doubled... (ten times better performance 
with the larger transfer size)

Something is not quite right, could you possibly try to run just the alltoalls 
like I suggested in my previous e-mail?

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speed evolution

2009-05-07 Thread Peter Kjellstrom
On Thursday 07 May 2009, nee...@crlindia.com wrote:
> Thanks Pasha for sharing IB Roadmaps with us. But i am more interested in
> to find out latency figures since they often matter more than bit rate.
>
> Could there be rough if not accurate the latency figures being targeted in
> IB World?

The (low level verbs) latency has AFAIR changed only a few times:

1) started at 5-6us with PCI-X Infinihost3
2) dropped to 3-4us with PCI-express Infinihost3
3) dropped to ~1us with PCI-express ConnectX

Disclaimer: rough figures and only for Mellanox chips.

/Peter

> Regards
>
> Neeraj Chourasia


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Factor of 10 loss in performance with 1.3.x

2009-04-07 Thread Peter Kjellstrom
On Tuesday 07 April 2009, Eugene Loh wrote:
> Iain Bason wrote:
> > But maybe Steve should try 1.3.2 instead?  Does that have your
> > improvements in it?
>
> 1.3.2 has the single-queue implementation and automatic sizing of the sm
> mmap file, both intended to fix problems at large np.  At np=2, you
> shouldn't expect to see much difference.
>
> >> And the slowdown doesn't seem to be observed by anyone other than
> >> Steve and his colleague?
> >
> > It would be useful to know who else has compared these two revisions.
>
> I just ran Netpipe and found that it gave a comparable sm latency as
> other pingpong tests.  So, in my mind, the question is why Steve sees
> latencies that are about 10 usec on a platform that can give 1 usec.
> There seems to be something tricky about reproducing that 10-usec
> slowdown.  I have trouble buying that it's just, "sm latency degraded
> from 1 usec to 10 usec when we went from 1.2 to 1.3".  If it were as
> simple as that, we would all have been aware of the performance
> regression.  There is some other special ingredient here (other than
> OMPI rev) that we're missing.


Maybe it's not btl layer related at all. Could be something completely 
different like maybe messed up processor affinity.


/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] MPI can not open file?

2009-04-07 Thread Peter Kjellstrom
On Tuesday 07 April 2009, Bernhard Knapp wrote:
> Hi
>
> I am trying to get a parallel job of the gromacs software started. MPI
> seems to boot fine but unfortunately it seems not to be able to open a
> specified file although it is definitly in the directory where the job
> is started.

Do all the nodes (in your machinefile) see the same filesystem(s)?

Have you tried a trivial mpi-program (like MPI_init, open("...), MPI_fin..)?

I have compiled and executed gromacs (4.0.2) sucessfully with several OpenMPI 
versions.

/Peter

> I also changed the file permissions to 777 but it does not 
> affect the result. Any suggestions?
>
> cheers
> Bernhard
...
> Program mdrun, VERSION 4.0.3
> Source code file: gmxfio.c, line: 736
>
> Can not open file:
> 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.tpr


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Heterogeneous OpenFabrics hardware

2009-01-27 Thread Peter Kjellstrom
On Tuesday 27 January 2009, Jeff Squyres wrote:
> It is worth clarifying a point in this discussion that I neglected to
> mention in my initial post: although Open MPI may not work *by
> default* with heterogeneous HCAs/RNICs, it is quite possible/likely
> that if you manually configure Open MPI to use the same verbs/hardware
> settings across all your HCAs/RNICs (assuming that you use a set of
> values that is compatible with all your hardware) that MPI jobs
> spanning multiple different kinds of HCAs or RNICs will work fine.
>
> See this post on the devel list for a few more details:
>
>  http://www.open-mpi.org/community/lists/devel/2009/01/5314.php

So is it correct that each rank will check its HCA-model and then pick up 
suitable settings for that HCA?

If so maybe OpenMPI could fall back to a very conservative settings if more 
than one HCA model was detected among the ranks. Or would this require 
communication in a stage where that would be complicated and/or ugly?

/Peter


signature.asc
Description: This is a digitally signed message part.


[OMPI users] MPI_Gather bug with reproducer code attached

2008-11-16 Thread Peter Kjellstrom
Problem description:
Elements from all ranks are gathered correctly except for the
element belonging to the root/target rank of the gather operation
when using certain custom MPI-datatypes (see reproducer code).

The bug can be toggled by commenting/uncommenting line 11 in the .F90-file.

Even though all this is for MPI_Gather the same seems to go for MPI_Gatherv 
too.

I have verified the bad behaviour with several OpenMPI versions from 1.2.3 to 
1.3b2. Correct behaviour has been observed on mvapich2 and PlatformMPI. Both 
gfortran and ifort has been tried.

Attached files:
 BUILD  Build instructions
 RUNRun instructions
 mpi_gather_test.F90Reproducer source code
 4rank_bad_output.txt   Bad output
 4rank_expected_output.txt  Good output

/Peter
mpif90.openmpi -o  mpi_gather_test.local_ompils mpi_gather_test.F90
mpirun.openmpi -np 4 ./mpi_gather_test.local_ompils  | sort -nk 2
Module global
  implicit none
  include 'mpif.h'

! Handle for MPI_Type_create_struct
  Integer :: my_mpi_struct

  Type my_fortran_struct
! With the following line the bug bites, with it commented out the
! behaviour is as expected
 Integer  :: unused_data
 Integer  :: used_data
  End Type my_fortran_struct

End Module global


! -


Program mpi_gather_test
  use global

  Integer:: i
  Integer:: nranks
  Integer, Parameter :: gather_target = 1
  Integer:: rank
  Integer:: ierror

  Type (my_fortran_struct), Pointer :: source_vector (:)
  Type (my_fortran_struct), Pointer :: dest_vector(:)

  call MPI_Init ( ierror )
  call MPI_Comm_rank ( MPI_COMM_WORLD, rank, ierror )
  call MPI_Comm_size ( MPI_COMM_WORLD, nranks, ierror )

  Allocate (source_vector(1), STAT = ierror)
  Allocate (dest_vector(1:nranks), STAT = ierror)

! Each rank initializes the data to be gathered to its rank number
! for tracing purposes (So we can see what goes where)
  source_vector(:)%used_data = rank

! Each rank initializes the target buffer with tracing data. The
! expectation is that on the root rank this will be completely over-
! written while on the rest of the ranks it will be unchanged.
  do i = 1, nranks
 dest_vector(i)%used_data = 10 * i + rank * 100 + 1000
  enddo

! Call the subroutine below that creates the MPI-datatype.
  call create_datatype ( ierror )

! Run the actual gather.
  call MPI_Gather (source_vector, 1,  my_mpi_struct, &
   dest_vector,   1,  my_mpi_struct, &
   gather_target, MPI_COMM_WORLD, ierror)

! Output the content of the used_data part of the dest_vector on
! all ranks. On the root-rank of the gather it is expected that the
! initial data is overwritten with the data from the source_vectors
! gathered from all ranks.
  do i = 1, nranks
 print *, 'rank:', rank, 'element:', i, 'dest_vector%used_data: ', &
  dest_vector(i)%used_data
  enddo
  
  call MPI_Finalize (ierror)
end program mpi_gather_test


! -


subroutine create_datatype (ierror)
  use global

  integer, Intent (Out) :: ierror

  integer (kind=MPI_ADDRESS_KIND) :: start, loc_used_data, loc_ub
  integer (kind=MPI_ADDRESS_KIND) :: disp (3)
  integer :: lengths (3), types (3), ext_size

  Type (my_fortran_struct)  :: template (2)

  ierror = 0

! Get the offsets (displacements) from the template vector of
! my_fortran_struct type
  call MPI_Get_address (template(1), start, ierror)
  call MPI_Get_address (template(1)%used_data, loc_used_data, ierror)
  call MPI_Get_address (template(2), loc_ub, ierror)

  disp (1) = 0
  disp (2) = loc_used_data - start
  disp (3) = loc_ub- start

  lengths (1:3) = 1

  types (1) = MPI_LB
  types (2) = MPI_INTEGER
  types (3) = MPI_UB

! Create the MPI-type
  call MPI_Type_create_struct (3, lengths, disp, types, &
   my_mpi_struct, ierror)

  call MPI_Type_commit (my_mpi_struct, ierror)

end subroutine create_datatype
 rank:   0 element:   1 dest_vector%used_data: 1010
 rank:   0 element:   2 dest_vector%used_data: 1020
 rank:   0 element:   3 dest_vector%used_data: 1030
 rank:   0 element:   4 dest_vector%used_data: 1040
 rank:   1 element:   1 dest_vector%used_data:0
 rank:   1 element:   2 dest_vector%used_data: 1120
 rank:   1 element:   3 dest_vector%used_data:2
 rank:   1 element:   4 dest_vector%used_data:3
 rank:   2 element:   1 dest_vector%used_data: 1210
 rank:   2 element:   2 dest_vector%used_data: 1220
 rank:   2 element:   

Re: [OMPI users] SLURM vs. Torque? [OT]

2007-10-22 Thread Peter Kjellstrom
On Monday 22 October 2007, Bill Johnstone wrote:
> Hello All.
>
> We are starting to need resource/scheduling management for our small
> cluster, and I was wondering if any of you could provide comments on
> what you think about Torque vs. SLURM?  On the basis of the appearance
> of active development as well as the documentation, SLURM seems to be
> superior, but can anyone shed light on how they compare in use?

I won't attempt a full analysis but here are two small (random) crumbs of 
information.

1) Slurm keeps the name of stuff sepparate from the contact address 
(ControlMachine=hostname, ControlAddr=IP/whatever). This alone wins my heart 
any day of the week.

2) The scheduler can be a weak point for slurm. If you can live with the built 
in trivial one then great. If you need more and happen to find something that 
is free and works (or writes one yourself) then let me know ;-)

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] Performance tuning: focus on latency

2007-07-25 Thread Peter Kjellstrom
On Wednesday 25 July 2007, Jeff Squyres wrote:
> On Jul 25, 2007, at 7:45 AM, Biagio Cosenza wrote:
> > Jeff, I did what you suggested
> >
> > However no noticeable changes seem to happen. Same peaks and same
> > latency times.
>
> Ok.  This suggests that Nagle may not be the issue here.

My guess would be that there are some nasty dead animals burried in the 
network. The op mentioned 200 ms, that's enough time to cross continents, not 
a time you'd expect in the same sentence as "cluster" and "latency".

...on the other hand we don't really know what the benchmark case was.

/Peter


signature.asc
Description: This is a digitally signed message part.


Re: [OMPI users] IB bandwidth vs. kernels

2007-01-18 Thread Peter Kjellstrom
On Thursday 18 January 2007 13:08, Scott Atchley wrote:
...
> The best uni-directional performance I have heard of for PCIe 8x IB
> DDR is ~1,400 MB/s (11.2 Gb/s)

This is on par with what I have seen.

> with Lustre, which is about 55% of the 
> theoretical 20 Gb/s advertised speed.

I think this should be calculated against 16 Gbps, not 20 Gbps.

> The ~900 MB/s (7.2 Gb/s) 
> mentioned above is, of course, ~72% of advertised speed. If any IB
> folks have any better numbers, please correct me.

Using MPI (over a non idle multi-level switch) I get 940 * 10^6 Bytes/s which 
is 94% of peak for that IB 4x SDR.

> The data throughput limit for 8x PCIe is ~12 Gb/s. The theoretical
> limit is 16 Gb/s, but each PCIe packet has a whopping 20 byte
> overhead. If the adapter uses 64 byte packets, then you see 1/3 of
> the throughput go to overhead.

AFAIK the datafield of a pci-express packet is 0-4096 bytes and the header a 
bit more than 20 bytes (including things such as start/stop frame bytes, 
LCRC/ECRC..). This gives a maximum speed over 4x PCIe of 993.3 10^6 Bytes/s 
(8 Gbps after coding minus header waste for a full 4096 byte payload).

In short, the SDR IB equipment I have seen has easily reached 90%+ while 
PCI-express on the platforms I've tried has been limited to ~75%. Current IB 
DDR HCAs are probably limited by (at least) PCI-express 8x.

/Peter


pgphFCXjmUXlv.pgp
Description: PGP signature


Re: [OMPI users] IB bandwidth vs. kernels

2007-01-18 Thread Peter Kjellstrom
On Thursday 18 January 2007 09:52, Robin Humble wrote:
...
> is ~10Gbit the best I can expect from 4x DDR IB with MPI?
> some docs @HP suggest up to 16Gbit (data rate) should be possible, and
> I've heard that 13 or 14 has been achieved before. but those might be
> verbs numbers, or maybe horsepower >> 4 cores of 2.66GHz core2 is
> required?

The 16 Gbit/s number is the theoretical peak, IB is coded 8/10 so out of the 
20 Gbit/s 16 is what you get. On SDR this number is (of course) 8 Gbit/s 
achievable (which is ~1000 MB/s) and I've seen well above 900 on MPI (this on 
8x PCIe, 2x margin).

The same setup on 4x PCIe stops at a bit over 700 MB/s (for a certain PCIe 
chipset) so it makes some sense that an IB 4x DDR on 8x PICe would be limited 
to about 1500 MB/s (on that platform). All this ignoring possible MPI bottle 
necks above 900 MB/s and assuming the IB fabric can reach 95%+ of peak on DDR 
as it does on SDR...

/Peter


pgpbr0t20Dzp_.pgp
Description: PGP signature


Re: [OMPI users] OpenMPI on HPUX?

2007-01-16 Thread Peter Kjellstrom
On Tuesday 16 January 2007 15:37, Brian W. Barrett wrote:
> Open MPI will not run on PA-RISC processors.

HPUX runs on IA-64 too.

/Peter


pgpdAr7FqFgzB.pgp
Description: PGP signature