from:"Gleb Natapov"

Re: [OMPI users] Memory manager

2008-05-20 Thread Gleb Natapov

On Tue, May 20, 2008 at 12:17:02PM +1000, Terry Frankcombe wrote:
> To tell you all what noone wanted to tell me, yes, it does seem to be
> the memory manager.  Compiling everything with
> --with-memory-manager=none returns the vmem use to the more reasonable
> ~100MB per process (down from >8GB).
> 
> I take it this may affect my peak bandwidth over infiniband.  What's the
> general feeling about how bad this is?
You will not be able to use "-mca mpi_leave_pinned 1" parameter and your
micro benchmark performance will be bad. Real application will see the
difference only if it reuses communication buffers frequently.

> 
> 
> On Tue, 2008-05-13 at 13:12 +1000, Terry Frankcombe wrote:
> > Hi folks
> > 
> > I'm trying to run an MPI app on an infiniband cluster with OpenMPI
> > 1.2.6.
> > 
> > When run on a single node, this app is grabbing large chunks of memory
> > (total per process ~8.5GB, including strace showing a single 4GB grab)
> > but not using it.  The resident memory use is ~40MB per process.  When
> > this app is compiled in serial mode (with conditionals to remove the MPI
> > calls) the memory use is more like what you'd expect, 40MB res and
> > ~100MB vmem.
> > 
> > Now I didn't write it so I'm not sure what extra stuff the MPI version
> > does, and we haven't tracked down the large memory grabs.
> > 
> > Could it be that this vmem is being grabbed by the OpenMPI memory
> > manager rather than directly by the app?
> > 
> > Ciao
> > Terry
> > 
> > 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] build OpenMPI with OpenIB

2008-03-07 Thread Gleb Natapov

On Fri, Mar 07, 2008 at 10:36:42AM +, Yuan Wan wrote:
> 
> Hi all,
> 
> I want to build OpenMPI-1.2.5 on my Infiniband cluster which has OFED-2.1 
> installed.
> 
> I configured OpenMPI as:
> 
> ./configure --prefix=/exports/home/local/Cluster-Apps/openmpi/gcc/64/1.2.5 \
> --enable-shared --enable-static --enable-debug \
> --with-openib=/usr/local/Cluster-Apps/infinipath/2.1/ofed
> 
> 
> And 'ompi_info | grep openib' only shows:
> 
>   MCA btl: openib (MCA v1.0, API v1.0.1, Component v1.2.5)
> 
> I cannot see:
> 
>   MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0)
This is OK. There is no such component in Open MPI any more.

> 
> No idea why and if this will cause failure.
> 
> 
> When I tried to run a MPI code with the option "--mca btl openib,self", It 
> failed to run with the following messages:
> 
> 
> mpirun --mca btl openib,self -np 4 ./hello
What is the output of ibv_devinfo on your hosts?

--
Gleb.

Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

2008-02-29 Thread Gleb Natapov

On Thu, Feb 28, 2008 at 04:53:11PM -0500, George Bosilca wrote:
> In this particular case, I don't think the solution is that obvious. If 
> you look at the stack in the original email, you will notice how we get 
> into this. The problem here, is that the FREE_LIST_WAIT is used to get a 
> fragment to store an unexpected message. If this macro return NULL (in 
> other words the PML is unable to store the unexpected message), what do 
> you expect to happen ? Drop the message ? Ask the BTL to hold it for a 
> while ? How about ordering ?
>
In all cases where we use FREE_LIST_WAIT from a callback today a solution
will not be simple otherwise it would be already implemented. In this
particular case if we will wait till memory allocation fails it is too
late to do anything useful, so printing helpful message and aborting is
good enough. In order to not get into the situation when all memory is
occupied by unexpected messages we either will have to implement some
kind of flow control in OB1 (and became more spec compliant in the
process) or declare all those programs that exhibit that kind of
behaviour "unrealistic" like we do now. 

> It is unfortunate to say it, only few days after we had the discussion  
> about the flow control, but the only correct solution here is to add PML 
> level flow control ...
>
>   george.
>
> On Feb 28, 2008, at 2:55 PM, Christian Bell wrote:
>
>> On Thu, 28 Feb 2008, Gleb Natapov wrote:
>>
>>> The trick is to call progress only from functions that are called
>>> directly by a user process. Never call progress from a callback  
>>> functions.
>>> The main offenders of this rule are calls to OMPI_FREE_LIST_WAIT().  
>>> They
>>> should be changed to OMPI_FREE_LIST_GET() and dial with NULL return  
>>> value.
>>
>> Right -- and it should be easy to find more offenders by having an
>> assert statement soak in the builds for a while (or by default in
>> debug mode).
>>
>> Was if it was ever part of the (or a) design to allow re-entrant
>> calls to progress from the same calling thread ?  It can be done but
>> callers have to have a holistic view of how other components require
>> and make the progress happen -- this isn't compatible with the Open
>> MPI model of independent dynamically loadable components.
>>
>> -- 
>> christian.b...@qlogic.com
>> (QLogic Host Solutions Group, formerly Pathscale)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

2008-02-28 Thread Gleb Natapov

On Wed, Feb 27, 2008 at 10:01:06AM -0600, Brian W. Barrett wrote:
> The only solution to this problem is to suck it up and audit all the code 
> to eliminate calls to opal_progress() in situations where infinite  
> recursion can result.  It's going to be long and painful, but there's no  
> quick fix (IMHO).
>
The trick is to call progress only from functions that are called
directly by a user process. Never call progress from a callback functions.
The main offenders of this rule are calls to OMPI_FREE_LIST_WAIT(). They
should be changed to OMPI_FREE_LIST_GET() and dial with NULL return value.

--
Gleb.

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2008 at 08:07:59AM -0500, Richard Treumann wrote:
> There is no misunderstanding of the MPI standard or the definition of
> blocking in the bug3 example.  Both bug 3 and the example I provided are
> valid MPI.
> 
> As you say, blocking means the send buffer can be reused when the MPI_Send
> returns.  This is exactly what bug3 is count on.
> 
> MPI is a reliable protocol which means that once MPI_Send returns, the
> application can assume the message will be delivered once a matching recv
> is posted.  There are only two ways I can think of for MPI to keep that
> guarantee.
> 1) Before return from MPI_Send, copy the envelop and data to some buffer
> that will be preserved until the MPI_Recv gets posted
> 2) delay the return from MPI_Send until the MPI_Recv is posted and then
> move data from the intact send buffer to the posted receive buffer. Return
> from MPI_Send
> 
> The requirement in the standard is that if libmpi takes option 1, the
> return from MPI_Send cannot occur unless there is certainty the buffer
> space exists. Libmpi cannot throw the message over the wall and fail the
> job if it cannit be buffered.
As I said Open MPI has flow control on transport layer to prevent messages
from been dropped by network. This mechanism should allow program like
yours to work, but bug3 is another story because it generate huge
amount of unexpected messages and Open MPI has no mechanism to prevent
unexpected messages to blow memory consumption. Your point is that
according to MPI spec this is not a valid behaviour. I am not going to
argue with that especially as you can get desired behaviour by setting
eager limit to zero.

> users-boun...@open-mpi.org wrote on 02/05/2008 02:28:27 AM:
> 
> > On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> > > Bug3 is a test-case derived from a real, scalable application (desmond
> > > for molecular dynamics) that several experienced MPI developers have
> > > worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> > > openmpi silently sends them in the background and overwhelms process 0
> > > due to lack of flow control.
> > MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
> > send buffer can be reused. MPI spec section 3.4.
> >
> > --
> >  Gleb.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] openmpi credits for eager messages

2008-02-05 Thread Gleb Natapov

On Mon, Feb 04, 2008 at 04:23:13PM -0500, Sacerdoti, Federico wrote:
> Bug3 is a test-case derived from a real, scalable application (desmond
> for molecular dynamics) that several experienced MPI developers have
> worked on. Note the MPI_Send calls of processes N>0 are *blocking*; the
> openmpi silently sends them in the background and overwhelms process 0
> due to lack of flow control.
MPI_Send is *blocking* in MPI sense of the word i.e when MPI_Send returns
send buffer can be reused. MPI spec section 3.4.

--
Gleb.

Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Gleb Natapov

On Mon, Feb 04, 2008 at 02:54:46PM -0500, Richard Treumann wrote:
> In my example, each sender task 1 to n-1 will have one rendezvous message
> to task 0 at a time.  The MPI standard suggests descriptors be small enough
> and  there be enough descriptor space for reasonable programs . The
> standard is clear that unreasonable programs can run out of space and fail.
> The standard does not try to quantify reasonableness.
You are right about your example, but I was not talking specifically about it.
Your example should work with Open MPI over IB/TCP because while rank 0 sleeps
without calling progress, transport layer flow control should throttle senders.
(SM doesn't have flow control that is why it fails.) What I was trying to say 
that
in MPI a process can't fully control its resource usage.

--
Gleb.

Re: [OMPI users] openmpi credits for eager messages

2008-02-04 Thread Gleb Natapov

On Mon, Feb 04, 2008 at 09:08:45AM -0500, Richard Treumann wrote:
> To me, the MPI standard is clear that a program like this:
> 
> task 0:
> MPI_Init
> sleep(3000);
> start receiving messages
> 
> each of tasks 1 to n-1:
> MPI_Init
> loop 5000 times
>MPI_Send(small message to 0)
> end loop
> 
> May send some small messages eagerly if there is space at task 0 but must
> block each task 1 to  n-1 before allowing task 0 to run out of eager buffer
> space.  Doing this requires a token or credit management system in which
> each task has credits for known buffer space at task 0. Each task will send
> eagerly to task 0 until the sender runs out of credits and then must switch
> to rendezvous protocol.
And rendezvous messages are not free either. So this approach will only
postpone failure a little bit.

--
Gleb.

Re: [OMPI users] mixed myrinet/non-myrinet nodes

2008-01-15 Thread Gleb Natapov

On Tue, Jan 15, 2008 at 09:49:40AM -0500, M Jones wrote:
> Hi,
> 
>We have a mixed environment in which roughly 2/3 of the nodes
> in our cluster have myrinet (mx 1.2.1), while the full cluster has
> gigE.  Running open-mpi exclusively on myrinet nodes or exclusively
> on non-myrinet nodes is fine, but mixing the two nodes types
> results in a runtime error (PML add procs failed), no matter what --mca 
> flags I try to use to push the traffic onto tcp (note that
> --mca mtl ^mx --mca btl ^mx does appear to use tcp, as long as all
> of the nodes have myrinet cards, but not in the mixed case).
What error do you get in this case? What version of Open MPI are you using?

--
Gleb.

Re: [OMPI users] Ideal MTU in Infiniband

2008-01-10 Thread Gleb Natapov

On Thu, Jan 10, 2008 at 06:23:50PM +0530, Parag Kalra wrote:
> Hello all,
> 
> Any ideas?
Yes. The idea is that Open MPI knows what best. Run it with a default
value. Usually bigger MTU is better, but some HW has bugs. Open MPI
knows this and choses the best value for your HW.

> 
> --
> Parag Kalra
> 
> 
> On Jan 10, 2008 4:15 AM, Parag Kalra  wrote:
> 
> > Hello all,
> >
> > I am using Open MPI with Infiniband configured.
> >
> > What should be the ideal MTU size for infiniband?
> >
> > --
> > PARAG . A . KALRA
> >
> >
> >
> >

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] Gigabit ethernet (PCI Express) and openmpi v1.2.4

2007-12-17 Thread Gleb Natapov

On Sun, Dec 16, 2007 at 06:49:30PM -0500, Allan Menezes wrote:
> Hi,
>  How many PCI-Express Gigabit ethernet cards does OpenMPI version 1.2.4 
> support with a corresponding linear increase in bandwith measured with 
> netpipe NPmpi and openmpi mpirun?
> With two PCI express cards I get a B/W of 1.75Gbps for 892Mbps each ans 
> for three pci express cards ( one built into the motherboard) i get 
> 1.95Gbps. They all are around 890Mbs indiviually measured with netpipe 
> and NPtcp and NPmpi and openmpi. For two it seems there is a linear 
> increase in b/w but not for three pci express gigabit eth cards.
> I have tune the cards using netpipe and $HOME/.openmpi/mca-params.conf 
> file for latency and percentage b/w .
> Please advise.
What is in your $HOME/.openmpi/mca-params.conf? May be are hitting your
chipset limit here. What is your HW configuration? Can you try to run
NPtcp on each interface simultaneously and see what BW do you get.

--
Gleb.

Re: [OMPI users] Does MPI_Bsend always use the buffer?

2007-12-11 Thread Gleb Natapov

On Tue, Dec 11, 2007 at 10:27:32AM -0500, Bradley, Peter C. (MIS/CFD) wrote:
> In OpenMPI, does MPI_Bsend always copy the message to the user-specified
> buffer, or will it avoid the copy in situations where it knows the send can
> complete?
If the message size if smaller than eager limit Open MPI will not use
user-specified buffer for it.

--
Gleb.

Re: [OMPI users] machinefile and rank

2007-11-07 Thread Gleb Natapov

On Tue, Nov 06, 2007 at 09:22:50PM -0500, Jeff Squyres wrote:
> Unfortunately, not yet.  I believe that this kind of functionality is  
> slated for the v1.3 series -- is that right Ralph/Voltaire?
> 
Yes, the file format will be different, but arbitrary mapping will be
possible.

> 
> On Nov 5, 2007, at 11:22 AM, Karsten Bolding wrote:
> 
> > Hello
> >
> > I'm using a machinefile like:
> > n03
> > n04
> > n03
> > n03
> > n03
> > n02
> > n01
> > ..
> > ..
> > ..
> >
> > the order of the entries is determined by an external program for load
> > balancing reasons. When the job is started the ranks do not correspond
> > to entries in the machinefile. Is there a way to force that entry  
> > one in
> > the machinefile gets rank=0, sencond entry gets rank=1 etc.
> >
> >
> > Karsten
> >
> >
> > -- 
> > --
> > Karsten BoldingBolding & Burchard Hydrodynamics
> > Strandgyden 25 Phone: +45 64422058
> > DK-5466 AsperupFax:   +45 64422068
> > DenmarkEmail: kars...@bolding-burchard.com
> >
> > http://www.findvej.dk/Strandgyden25,5466,11,3
> > --
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] IB latency on Mellanox ConnectX hardware

2007-10-18 Thread Gleb Natapov

On Wed, Oct 17, 2007 at 05:43:14PM -0400, Jeff Squyres wrote:
> Several users have noticed poor latency with Open MPI when using the  
> new Mellanox ConnectX HCA hardware.  Open MPI was getting about 1.9us  
> latency with 0 byte ping-pong benchmarks (e.g., NetPIPE or  
> osu_latency).  This has been fixed in OMPI v1.2.4.
> 
> Short version:
> --
> 
> Open MPI v1.2.4 (and newer) will get around 1.5us latency with 0 byte  
> ping-pong benchmarks on Mellanox ConnectX HCAs.  Prior versions of  
> Open MPI can also achieve this low latency by setting the  
> btl_openib_use_eager_rdma MCA parameter to 1.

Actually setting btl_openib_use_eager_rdma to 1 will not help. The
reason is that it is 1 by default anyway, but Open MPI disables eager
rdma because it can't find HCA description in the ini file and cannot
distinguish between default value and value that user set explicitly.

> 
> Longer version:
> ---
> 
> Until OMPI v1.2.4, Open MPI did not include specific configuration  
> information for ConnectX hardware, which forced Open MPI to choose  
> the conservative/safe configuration of not using RDMA for short  
> messages (using send/receive semantics instead).  This increases  
> point-to-point latency in benchmarks.
> 
> OMPI v1.2.4 (and newer) includes the relevant configuration  
> information that enables short message RDMA by default on Mellanox  
> ConnectX hardware.  This significantly improves Open MPI's latency on  
> popular MPI benchmark applications.
> 
> The same performance can be achieved on prior versions of Open MPI by  
> setting the btl_openib_use_eager_rdma MCA parameter to 1.  The main  
> difference between v1.2.4 and prior versions is that the prior  
> versions do not set this MCA parameter value by default for ConnectX  
> hardware (because ConnectX did not exist when prior versions of Open  
> MPI were released).
> 
> This information is also now described on the FAQ:
> 
> http://www.open-mpi.org/faq/?category=openfabrics#mellanox-connectx- 
> poor-latency
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] Multiple threads

2007-10-01 Thread Gleb Natapov

On Mon, Oct 01, 2007 at 10:39:12AM +0200, Olivier DUBUISSON wrote:
> Hello,
> 
> I compile openmpi 1.2.3 with options ./configure  --with-threads=posix
> --enable-mpi-thread --enable-progress-threads --enable-smp-locks.
> 
> My program has 2 threads (main thread and an other). When i run it, i
> can see 4 threads. I think that two threads are the progress threads, is
> it right ?
> 
> Is it possible to disable these progress threads ?
> 
> I tried to compile openmpi with options ./configure
> --with-threads=posix --enable-mpi-thread --disable-progress-threads
 ^
It should be --enable-mpi-threads.

> --enable-smp-locks, but when i run my program, i get the message :
> Error! Cannot set MPI thread support to the desired value (asked for
> MPI_THREAD_SERIALIZED and got MPI_THREAD_SINGLE).
> 
> 
--
Gleb.

Re: [OMPI users] SKaMPI hangs on collectives and onesided

2007-09-20 Thread Gleb Natapov

On Wed, Sep 19, 2007 at 08:54:51PM -0400, Jelena Pjesivac-Grbovic wrote:
> The suggestion will probably work, but it is not a solution.
> "choosing barrier synchronization" is not recommended by SKaMPI team and 
> that it reduces accuracy of the benchmark.
I know. I just want to be sure that this is the same problem as in
ticket #1015.

> The problem is either at  pml ob1 level or in btl ib level - and it has to 
> do with many messages being sent at the same time.  You can reproduce this 
> type of problem at 4 - 5 nodes over IB (on odin) using bcast or reduce 
> using small segment sizes (1KB, less than eager size for ib). (I do not 
> think I saw it on 2 nodes).  I haven't tried it on onesided operations, but 
> if it happens there too - I am even more likely to believe in my theory :)
The problem is that short request may be completed on MPI level event
before data is put on the wire. Later application has to enter MPI
library to progress the request, but funky SKaMPI synchronisation doesn't
do it. If this behaviour is correct in regards to MPI spec then Open MPI
has to be fixed. We don't see this problem more often because usually
applications call MPI_Finalize at some point in time and we have a
barrier there so all outstanding request are progressed.

>
> Thanks,
> Jelena
>
> Gleb Natapov wrote:
>> On Wed, Sep 19, 2007 at 01:58:35PM -0600, Edmund Sumbar wrote:
>>   
>>> I'm trying to run skampi-5.0.1-r0191 under PBS
>>> over IB with the command line
>>>
>>>mpirun -np 2 ./skampi -i coll.ski -o coll_ib.sko
>>> 
>> Can you add choose_barrier_synchronization()
>> to coll.ski and try again? It looks like this one:
>> https://svn.open-mpi.org/trac/ompi/ticket/1015
>>
>>   
>>> The pt2pt and mmisc tests run to completion.
>>> The coll and onesided tests, on the other hand,
>>> start to produce output but then seem to hang.
>>> Actually, the cpus appear to be busy doing
>>> something (I don't know what), but output stops.
>>> The tests should only last the order of minutes
>>> but I end up deleting the job after about 15 min.
>>>
>>> All test run to completion with --mca btl tcp,self
>>>
>>> Any suggestions as to how to diagnose this problem?
>>> Are there any known issues with OpenMPI/IB and the
>>> SKaMPI benchmark?
>>>
>>> (BTW, skampi works with mvapich2)
>>>
>>> System details follow...
>>>
>>> -- 
>>> Ed[mund [Sumbar]]
>>> AICT Research Support, Univ of Alberta
>>>
>>>
>>> $ uname -a
>>> Linux opteron-cluster.nic.ualberta.ca 2.6.21-smp #1 SMP Tue Aug 7 
>>> 12:45:20 MDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> $ ./configure --prefix=/usr/local/openmpi-1.2.3 --with-tm=/opt/torque 
>>> --with-openib=/usr/lib --with-libnuma=/usr/lib64
>>>
>>> $ ompi_info
>>>  Open MPI: 1.2.3
>>> Open MPI SVN revision: r15136
>>>  Open RTE: 1.2.3
>>> Open RTE SVN revision: r15136
>>>  OPAL: 1.2.3
>>> OPAL SVN revision: r15136
>>>Prefix: /usr/local/openmpi-1.2.3
>>>   Configured architecture: x86_64-unknown-linux-gnu
>>> Configured by: esumbar
>>> Configured on: Mon Sep 17 10:00:35 MDT 2007
>>>Configure host: opteron-cluster.nic.ualberta.ca
>>>  Built by: esumbar
>>>  Built on: Mon Sep 17 10:05:09 MDT 2007
>>>Built host: opteron-cluster.nic.ualberta.ca
>>>C bindings: yes
>>>  C++ bindings: yes
>>>Fortran77 bindings: yes (all)
>>>Fortran90 bindings: yes
>>>   Fortran90 bindings size: small
>>>C compiler: gcc
>>>   C compiler absolute: /usr/bin/gcc
>>>  C++ compiler: g++
>>> C++ compiler absolute: /usr/bin/g++
>>>Fortran77 compiler: gfortran
>>>Fortran77 compiler abs: /usr/bin/gfortran
>>>Fortran90 compiler: gfortran
>>>Fortran90 compiler abs: /usr/bin/gfortran
>>>   C profiling: yes
>>> C++ profiling: yes
>>>   Fortran77 profiling: yes
>>>   Fortran90 profiling: yes
>>>C++ exceptions: no
>>>Thread support: posix (mpi: no, progress: no)
>>>Internal debug support: no
>>>   MPI parameter check: runtime
>>> Memory profiling support: no
>>&g

Re: [OMPI users] SKaMPI hangs on collectives and onesided

2007-09-19 Thread Gleb Natapov

On Wed, Sep 19, 2007 at 01:58:35PM -0600, Edmund Sumbar wrote:
> I'm trying to run skampi-5.0.1-r0191 under PBS
> over IB with the command line
> 
>mpirun -np 2 ./skampi -i coll.ski -o coll_ib.sko
Can you add 
choose_barrier_synchronization()
to coll.ski and try again? It looks like this one:
https://svn.open-mpi.org/trac/ompi/ticket/1015

> 
> The pt2pt and mmisc tests run to completion.
> The coll and onesided tests, on the other hand,
> start to produce output but then seem to hang.
> Actually, the cpus appear to be busy doing
> something (I don't know what), but output stops.
> The tests should only last the order of minutes
> but I end up deleting the job after about 15 min.
> 
> All test run to completion with --mca btl tcp,self
> 
> Any suggestions as to how to diagnose this problem?
> Are there any known issues with OpenMPI/IB and the
> SKaMPI benchmark?
> 
> (BTW, skampi works with mvapich2)
> 
> System details follow...
> 
> -- 
> Ed[mund [Sumbar]]
> AICT Research Support, Univ of Alberta
> 
> 
> $ uname -a
> Linux opteron-cluster.nic.ualberta.ca 2.6.21-smp #1 SMP Tue Aug 7 12:45:20 
> MDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> 
> $ ./configure --prefix=/usr/local/openmpi-1.2.3 --with-tm=/opt/torque 
> --with-openib=/usr/lib --with-libnuma=/usr/lib64
> 
> $ ompi_info
>  Open MPI: 1.2.3
> Open MPI SVN revision: r15136
>  Open RTE: 1.2.3
> Open RTE SVN revision: r15136
>  OPAL: 1.2.3
> OPAL SVN revision: r15136
>Prefix: /usr/local/openmpi-1.2.3
>   Configured architecture: x86_64-unknown-linux-gnu
> Configured by: esumbar
> Configured on: Mon Sep 17 10:00:35 MDT 2007
>Configure host: opteron-cluster.nic.ualberta.ca
>  Built by: esumbar
>  Built on: Mon Sep 17 10:05:09 MDT 2007
>Built host: opteron-cluster.nic.ualberta.ca
>C bindings: yes
>  C++ bindings: yes
>Fortran77 bindings: yes (all)
>Fortran90 bindings: yes
>   Fortran90 bindings size: small
>C compiler: gcc
>   C compiler absolute: /usr/bin/gcc
>  C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
>Fortran77 compiler: gfortran
>Fortran77 compiler abs: /usr/bin/gfortran
>Fortran90 compiler: gfortran
>Fortran90 compiler abs: /usr/bin/gfortran
>   C profiling: yes
> C++ profiling: yes
>   Fortran77 profiling: yes
>   Fortran90 profiling: yes
>C++ exceptions: no
>Thread support: posix (mpi: no, progress: no)
>Internal debug support: no
>   MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>   libltdl support: yes
> Heterogeneous support: yes
>   mpirun default --prefix: no
> MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3)
>MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.3)
> MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3)
> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3)
> MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.3)
> MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.3)
> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>  MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.3)
>  MCA coll: self (MCA v1.0, API v1.0, Component v1.2.3)
>  MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.3)
>  MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.3)
>MCA io: romio (MCA v1.0, API v1.0, Component v1.2.3)
> MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.3)
> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.3)
>MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA btl: openib (MCA v1.0, API v1.0.1, Component v1.2.3)
>   MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.3)
>   MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.3)
>   MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
>  MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.3)
>   MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.3)
>MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.3)
>MCA errmgr: orted (MCA v1.0, API v1.3,

Re: [OMPI users] OpenMPI and Port Range

2007-08-31 Thread Gleb Natapov

On Fri, Aug 31, 2007 at 10:49:10AM +0200, Sven Stork wrote:
> On Friday 31 August 2007 09:07, Gleb Natapov wrote:
> > On Fri, Aug 31, 2007 at 08:04:00AM +0100, Simon Hammond wrote:
> > > On 31/08/2007, Lev Givon <l...@columbia.edu> wrote:
> > > > Received from George Bosilca on Thu, Aug 30, 2007 at 07:42:52PM EDT:
> > > > > I have a patch for this, but I never felt a real need for it, so I
> > > > > never push it in the trunk. I'm not completely convinced that we need
> > > > > it, except in some really strange situations (read grid). Why do you
> > > > > need a port range ? For avoiding firewalls ?
> > > 
> > > We are planning on using OpenMPI as the basis for running MPI jobs
> > > across a series of workstations overnight. The workstations are locked
> > > down so that only a small number of ports are available for use. If we
> > > try to use anything else its disaster.
> > > 
> > > Unfortunately this is really an organizational policy above anything
> > > else and its very difficult to get it to change.
> > > 
> > > 
> > As workaround you can write application that will bind to all ports that
> > are not allowed to be used by MPI before running MPI job.
> 
> Another option could be (if that match your policy) to limit the dynamic port 
> range that is used by your OS. By this all application (unless they ask for 
> an specific port) will get ports in this limited port range. If so the 
> following link might be interesting for you:
> 
> http://www.ncftp.com/ncftpd/doc/misc/ephemeral_ports.html
> 
I was sure it is possible to set a port range on Linux, but didn't know how.
This is much better workaround.

--
Gleb.

Re: [OMPI users] OpenMPI and Port Range

2007-08-31 Thread Gleb Natapov

On Fri, Aug 31, 2007 at 08:17:36AM +0100, Simon Hammond wrote:
> On 31/08/2007, Gleb Natapov <gl...@voltaire.com> wrote:
> > On Fri, Aug 31, 2007 at 08:04:00AM +0100, Simon Hammond wrote:
> > > On 31/08/2007, Lev Givon <l...@columbia.edu> wrote:
> > > > Received from George Bosilca on Thu, Aug 30, 2007 at 07:42:52PM EDT:
> > > > > I have a patch for this, but I never felt a real need for it, so I
> > > > > never push it in the trunk. I'm not completely convinced that we need
> > > > > it, except in some really strange situations (read grid). Why do you
> > > > > need a port range ? For avoiding firewalls ?
> > >
> > > We are planning on using OpenMPI as the basis for running MPI jobs
> > > across a series of workstations overnight. The workstations are locked
> > > down so that only a small number of ports are available for use. If we
> > > try to use anything else its disaster.
> > >
> > > Unfortunately this is really an organizational policy above anything
> > > else and its very difficult to get it to change.
> > >
> > >
> > As workaround you can write application that will bind to all ports that
> > are not allowed to be used by MPI before running MPI job.
> 
> Sounds very drastic, thanks for the advice. I'll give it a go. Do you
> think it might be easy to add this to the source code at sometime
> though?
>
It just workaround. Proper solution would be of cause adding an option for this.

--
Gleb.

Re: [OMPI users] OpenMPI and Port Range

2007-08-31 Thread Gleb Natapov

On Fri, Aug 31, 2007 at 08:04:00AM +0100, Simon Hammond wrote:
> On 31/08/2007, Lev Givon  wrote:
> > Received from George Bosilca on Thu, Aug 30, 2007 at 07:42:52PM EDT:
> > > I have a patch for this, but I never felt a real need for it, so I
> > > never push it in the trunk. I'm not completely convinced that we need
> > > it, except in some really strange situations (read grid). Why do you
> > > need a port range ? For avoiding firewalls ?
> 
> We are planning on using OpenMPI as the basis for running MPI jobs
> across a series of workstations overnight. The workstations are locked
> down so that only a small number of ports are available for use. If we
> try to use anything else its disaster.
> 
> Unfortunately this is really an organizational policy above anything
> else and its very difficult to get it to change.
> 
> 
As workaround you can write application that will bind to all ports that
are not allowed to be used by MPI before running MPI job.

--
Gleb.

Re: [OMPI users] Basic problems with OpenMPI

2007-08-29 Thread Gleb Natapov

On Wed, Aug 29, 2007 at 03:22:54PM +0530, Amit Kumar Saha wrote:
> Hi Glib,
> 
> i am sending a sample trace of my program:
> 
> amit@ubuntu-desktop-1:~/mpi-exec$ mpirun --np 3 --hostfile
> mpi-host-file HellMPI
> 
> amit@debian-desktop-1's password: [ubuntu-desktop-1:28575] [0,0,0]
> ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275
> [ubuntu-desktop-1:28575] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1164
> [ubuntu-desktop-1:28575] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c at line 90
> [ubuntu-desktop-1:28575] ERROR: A daemon on node ubuntu-desktop-2
> failed to start as expected.
> [ubuntu-desktop-1:28575] ERROR: There may be more information available from
> [ubuntu-desktop-1:28575] ERROR: the remote shell (see above).
> [ubuntu-desktop-1:28575] ERROR: The daemon exited unexpectedly with status 
> 255.
> [ubuntu-desktop-1:28575] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [ubuntu-desktop-1:28575] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1196
> --
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
> 
> --
> 
> this is what I get when i run the program.
> 
> However when i use "--np 2 " it works perfectly which of course means
> that it is not a problem with "debian-desktop-1" as the above output
> may show.
> 
The above output shows that you have a problem on host ubuntu-desktop-2.
Have you setup login without a password from ubuntu-desktop-1 to
ubuntu-desktop-2?


--
Gleb.

Re: [OMPI users] Basic problems with OpenMPI

2007-08-29 Thread Gleb Natapov

On Wed, Aug 29, 2007 at 02:49:35PM +0530, Amit Kumar Saha wrote:
> Hi gleb,
> 
> 
> > Have you installed Open MPI at the same place on all nodes? What command
> > line are you using to run app on more then one host?
> 
> this is a sample run
> 
> amit@ubuntu-desktop-1:~/mpi-exec$ mpirun --np 2 --hostfile
> mpi-host-file HellMPI
> amit@ubuntu-desktop-2's password:
> HellMPI: error while loading shared libraries: liborte.so.0: cannot
> open shared object file: No such file or directory
> 
HellMPI compiled with Open MPI 1.1 mpicc. Version 1.2 has libopen-rte.so and not
liborte.so.

--
Gleb.

Re: [OMPI users] Basic problems with OpenMPI

2007-08-29 Thread Gleb Natapov

On Wed, Aug 29, 2007 at 02:32:58PM +0530, Amit Kumar Saha wrote:
> Hi all,
> 
> I have installed OpenMPI 1.2.3 on all my hosts (3).
> 
> Now when I try to start a simple demo program ("hello world") using
> ./a.out I get the error. When I run my program using "mpirun" on more
> than one host it gives me similar error:
> 
> error while loading shared libraries: libopen-rte.so.0: cannot open
> shared object file: No such file or directory
> 
> However when I do a mpirun a.out , it gives me no error.
> 
> Please suggest
> 
Have you installed Open MPI at the same place on all nodes? What command
line are you using to run app on more then one host?

--
Gleb.

Re: [OMPI users] Basic problems with OpenMPI

2007-08-29 Thread Gleb Natapov

On Wed, Aug 29, 2007 at 01:03:30PM +0530, Amit Kumar Saha wrote:
> Also, is open MPI 1.1 compatible with MPI 1.2.3, I mean to ask is
> whether a MPI executable generated using 1.1 is executable by 1.2.3?
No. They are not compatible.

--
Gleb.

Re: [OMPI users] Basic problems with OpenMPI

2007-08-29 Thread Gleb Natapov

On Wed, Aug 29, 2007 at 12:26:54PM +0530, Amit Kumar Saha wrote:
> Hello all,
> 
> I have installed Open MPI 1.2.3 from source on Debian 4.0. I did the
> "make all install" using root privileges.
> 
> Now when I try to execute a simple program , I get the following:
> 
> debian-desktop-1:/home/amit/junk/mpi-codes# mpirun --np 1 --hostfile
> hostfile ./a.out
> ./a.out: error while loading shared libraries: libmpi.so.0: cannot
> open shared object file: No such file or directory
> 
> I get the error whether I do it as "normal user" or "root user"
> 
> Please suggest.
> 
Where have you installed it? If in /usr/local/ then try to run
mpirun --prefix /usr/local/ --np 1 --hostfile hostfile ./a.out

If this helps then you may want to re-run configure script with flag
--enable-orterun-prefix-by-default and recompile.

--
Gleb.

Re: [OMPI users] Basic problems with OpenMPI

2007-08-29 Thread Gleb Natapov

On Wed, Aug 29, 2007 at 11:42:29AM +0530, Amit Kumar Saha wrote:
> hello all,
> 
> I am just trying to get started with OpenMPI (version 1.1) on Linux.
Vesrion 1.1 is old an no longer supported.

> 
> When I try to run a simple MPI - "Hello World" program, here is what i get:
> 
> amit@ubuntu-desktop-1:~/junk/mpi-codes$ mpirun -np 1 --hostfile
> mpi-host-file ./a.out
> libibverbs: Fatal: couldn't read uverbs ABI version.
> --
> [0,1,0]: OpenIB on host ubuntu-desktop-1 was unable to find any HCAs.
> Another transport will be used instead, although this may result in
> lower performance.
> --
> Processor 0 of 1: Hello World!
> 
> Please explain the statements above.
Open MPI has Infiniband module compiled but there is no IB device found
on your host. Try to add "--mca btl ^openib" string to your command
line.

> 
> Also, when I am trying to launch the above process on 2 processors,
> instead of one, it gives me:
> 
> Failed to find or execute the following executable:
> 
> Host:   ubuntu-desktop-2
> Executable: ./a.out
> 
> Cannot continue.
> 
> Does that mean I have to place a copy of the executable on the other
> node as well? Where should I place the executable?
> 
Yes. At the same location on each host.

--
Gleb.

Re: [OMPI users] OpenMPI fails to initalize the openib btl when run from SGE

2007-08-22 Thread Gleb Natapov

On Wed, Aug 22, 2007 at 03:31:20PM +0300, Noam Meltzer wrote:
> Hi,
> 
> I am running openmpi-1.2.3 compiled for 64bit on RHEL4u4.
> I also have a Voltaire InfiniBand interconnect.
> When I manually run jobs using the following command:
> 
> /opt/local/openmpi-1.2.3-gcc4/bin/orterun -np 8 -hostfile ~/myHostList 
> -mca btl self,openib /tcc/eandm/performance/igor/main.exe.openmpi123
> 
> The job is executed just fine..
> 
> Though, when run through SGE I have the weirdest problem, and get the 
> following error (on all hosts in my list):
> --
> The OpenIB BTL failed to initialize while trying to create an internal
> queue.  This typically indicates a failed OpenFabrics installation or
> faulty hardware.  The failure occured here:
> 
> Host:node4.grid.technion.ac.il
> OMPI source: btl_openib.c:828
> Function:ibv_create_cq()
> Error:   Invalid argument (errno=22)
> Device:  mthca0
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --
> 
> To send a job to the grid I use the following command:
> qrsh -cwd -q noam.q -pe orte 8 ./myScript
> 
> while "myScript" looks like:
> 
> #!/bin/bash
> /opt/local/openmpi-1.2.3-gcc4/bin/orterun -np $NSLOTS -mca btl 
> self,openib /tcc/eandm/performance/igor/main.exe.openmpi123
> 
> If I change "openib" to "tcp" (in myScript) everything works just fine.
> 
> Any ideas?
> 
Perhaps SGE doesn't set locked memory limit properly.

--
Gleb.

Re: [OMPI users] opal_init_Segmentation Fault

2007-07-17 Thread Gleb Natapov

On Tue, Jul 17, 2007 at 07:17:58AM -0400, Jeff Squyres wrote:
> Unfortunately, this looks like a problem with your gcc installation  
> -- a compiler should never seg fault when it's trying to compile C  
> source code.
> 
> FWIW: the file in question that it's trying to compile is actually  
> from GNU Libtool (which is included in Open MPI).
> 
> You should probably investigate your C compiler to ensure that it's  
> working properly.
gcc SEGV usually a first sign of faulty memory. Run memtest ASAP.

> 
> 
> On Jul 17, 2007, at 7:06 AM, Igor Miskovski wrote:
> 
> > Hello,
> >
> > When i try to install OpenMPI on Linux Suse 10.2 on AMDX2 Dual Core  
> > processor i get the following message:
> >
> > make[3]: Entering directory `/home/igor/openmpi-1.2.3/opal/libltdl'
> > if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H - 
> > I. -I. -I.  - 
> > D
> > LT_CONFIG_H='< config.h>' -DLTDL -I. -I. -Ilibltdl -I./libltdl -I./ 
> > libltdl   - 
> > O3- 
> > DNDEBUG  -MT dlopen.lo -MD -MP -MF ".deps/dlopen.Tpo" -c -o  
> > dlopen.lo `test - 
> > f 
> > 'loaders/dlopen.c' || echo './'`loaders/dlopen.c; \
> > then mv -f ".deps/dlopen.Tpo" ".deps/dlopen.Plo"; else rm - 
> > f ".deps/ 
> > dlop
> > en.Tpo"; exit 1; fi
> > libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I. -I. "-DLT_CONFIG_H=<  
> > config.h>" - 
> > D   LTDL - 
> > I. -I. -Ilibltdl -I./libltdl -I./libltdl -O3 -DNDEBUG -MT dlopen.lo  
> > -MD -M   P - 
> > MF .deps/dlopen.Tpo -c loaders/dlopen.c  -fPIC -DPIC -o .libs/dlopen.o
> > loaders/dlopen.c: In function 'dlopen_LTX_get_vtable':
> > loaders/dlopen.c:84: internal compiler error: Segmentation fault
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > See http://bugs.opensuse.org> for instructions.
> > make[3]: *** [dlopen.lo] Error 1
> > make[3]: Leaving directory `/home/igor/openmpi-1.2.3/opal/libltdl'
> > make[2]: *** [all] Error 2
> > make[2]: Leaving directory `/home/igor/openmpi-1.2.3/opal/libltdl'
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory `/home/igor/openmpi-1.2.3/opal'
> > make: *** [all-recursive] Error 1
> >
> > Can somebody help me?
> >
> > Thanks,
> > Igor Miskovski
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] Processes stuck in MPI_BARRIER

2007-06-20 Thread Gleb Natapov

On Tue, Jun 19, 2007 at 11:24:24AM -0700, George Bosilca wrote:
> 1. I don't believe the OS to release the binding when we close the  
> socket. As an example on Linux the kernel sockets are release at a  
> later moment. That means the socket might be still in use for the  
> next run.
>
This is not Linux specific. This is required by TCP RFC. Socket that
initiated close will remain in TIME_WAIT state for round-trip time.

--
Gleb.

Re: [OMPI users] OpenMPI/OpenIB/IMB hangs[Scanned]

2007-01-19 Thread Gleb Natapov

On Fri, Jan 19, 2007 at 05:51:49PM +, Arif Ali wrote:
> >>I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed  
> >>miserably.
> >>
> >
> >Can you describe what happened there?  Is it failing in a different way?
> >  
> Here's the output
> 
> #---
> # Intel (R) MPI Benchmark Suite V2.3, MPI-1 part
> #---
> # Date : Fri Jan 19 17:33:52 2007
> # Machine : ppc64# System : Linux
> # Release : 2.6.16.21-0.8-ppc64
> # Version : #1 SMP Mon Jul 3 18:25:39 UTC 2006
> 
> #
> # Minimum message length in bytes: 0
> # Maximum message length in bytes: 4194304
> #
> # MPI_Datatype : MPI_BYTE
> # MPI_Datatype for reductions : MPI_FLOAT
> # MPI_Op : MPI_SUM
> #
> #
> 
> # List of Benchmarks to run:
> 
> # PingPong
> # PingPing
> # Sendrecv
> # Exchange
> # Allreduce
> # Reduce
> # Reduce_scatter
> # Allgather
> # Allgatherv
> # Alltoall
> # Bcast
> # Barrier
> 
> #---
> # Benchmarking PingPong
> # #processes = 2
> # ( 58 additional processes waiting in MPI_Barrier)
> #---
> #bytes #repetitions t[usec] Mbytes/sec
> 0 1000 1.76 0.00
> 1 1000 1.88 0.51
> 2 1000 1.89 1.01
> 4 1000 1.91 2.00
> 8 1000 1.88 4.05
> 16 1000 2.02 7.55
> 32 1000 2.05 14.88
> [0,1,4][btl_openib_component.c:1153:btl_openib_component_progress] from 
> node03 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR 
> status number 10 for wr_id 268969528 opcode 128
> [0,1,28][btl_openib_component.c:1153:btl_openib_component_progress] from 
> node09 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR 
> status number 10 for wr_id 268906808 opcode 128
> [0,1,58][btl_openib_component.c:1153:btl_openib_component_progress] from 
> node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR 
> status number 10 for wr_id 268919352 opcode 256614836
> [0,1,0][btl_openib_component.c:1153:btl_openib_component_progress] from 
> node02 to: node03 error polling HP CQ with status WORK REQUEST FLUSHED 
> ERROR status number 5 for wr_id 276070200 opcode 0
> [0,1,59][btl_openib_component.c:1153:btl_openib_component_progress] from 
> node16 to: node02 error polling HP CQ with status REMOTE ACCESS ERROR 
> status number 10 for wr_id 268919352 opcode 256614836
> mpirun noticed that job rank 0 with PID 0 on node node02 exited on 
> signal 15 (Terminated).
> 55 additional processes aborted (not shown)
does this happen with btl_openib_flags=1? Does this also happen without
this setting. This doesn't happen with OpenMPI-1.2b3 right?


--
Gleb.

Re: [OMPI users] IB bandwidth vs. kernels

2007-01-18 Thread Gleb Natapov

On Thu, Jan 18, 2007 at 07:17:13AM -0500, Robin Humble wrote:
> On Thu, Jan 18, 2007 at 11:08:04AM +0200, Gleb Natapov wrote:
> >On Thu, Jan 18, 2007 at 03:52:19AM -0500, Robin Humble wrote:
> >> On Wed, Jan 17, 2007 at 08:55:31AM -0700, Brian W. Barrett wrote:
> >> >On Jan 17, 2007, at 2:39 AM, Gleb Natapov wrote:
> >> >> On Wed, Jan 17, 2007 at 04:12:10AM -0500, Robin Humble wrote:
> >> >>> basically I'm seeing wildly different bandwidths over InfiniBand 4x DDR
> >> >>> when I use different kernels.
> >> >> Try to load ib_mthca with tune_pci=1 option on those kernels that are
> >> >> slow.
> >> >when an application has high buffer reuse (like NetPIPE), which can  
> >> >be enabled by adding "-mca mpi_leave_pinned 1" to the mpirun command  
> >> >line.
> >> thanks! :-)
> >> tune_pci=1 makes a huge difference at the top end, and
> >Well this is broken BIOS then. Look here for more explanation:
> >https://staging.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/docs/mthca_release_notes.txt
> >search for "tune_pci=1".
> 
> ok. thanks :-/
> 
> >> -mca mpi_leave_pinned 1 adds lots of midrange bandwidth.
> >> 
> >> latencies (~4us) and the low end performance are all unchanged.
> >> 
> >> see attached for details.
> >> most curves are for 2.6.19.2 except the last couple (tagged as old)
> >> which are for 2.6.9-42.0.3.ELsmp and for which tune_pci changes nothing.
> >> 
> >> why isn't tune_pci=1 the default I wonder?
> >> files in /sys/module/ib_mthca/ tell me it's off by default in
> >> 2.6.9-42.0.3.ELsmp, but the results imply that it's on... maybe PCIe
> >> handling is very different in that kernel.
> >This is explained in the link above.
> 
> hmmm...
> but (sorry to harp on about this) /sys/module/ib_mthca/tune_pci is 0
> for 2.6.9-42.0.3.ELsmp.
> and even if that's lying, then mthca_tune_pci() appears identically
> invoked in mthca_main.c from both 2.6.9-42.0.3.ELsmp and 2.6.19.2.
> mthca_main.c is the only place in infiniband/hw/mthca that
> pci_write_config_word() is called from, so you'd think that's got to be
> how PCIe for IB was setup.
I really don't know details and I don't have sources of older module to
check, but in latest kernel sources tune_pci parameter is checked inside
mthca_tune_pci(). If you want to know more details you can ask openib
mailing list.

> 
> basically it's not clear to me how or if tune_pci is being set in
> 2.6.9-42.0.3.ELsmp, nor why it's any different to 2.6.19.2 :-/
> 
> maybe it's some other level in the kernel setting up PCIe differently?
> but that would presumably be unrelated to OFED.
BIOS should configure MaxReadReq to maximum value supported by chipset.
Linux shouldn't touch this value at all.

> 
> is there a way to check pci burst settings from userland? or BIOS?
You can see PCI settings with lspci. Newest lspci decode this value for
you, with older once you'll have to dump PCI config space to the file
and decode it by yourself.

> 
> BTW, the card appears to be Voltaire and system is SGI xe (210 and 240)
> if that helps. /sys/class/infiniband/mthca0/board_id is VLT0050010001
> not that I'm blaming anyone! :-)
The hardware and firmware are produced by Mellanox :)

--
Gleb.

Re: [OMPI users] IB bandwidth vs. kernels

2007-01-18 Thread Gleb Natapov

On Thu, Jan 18, 2007 at 03:52:19AM -0500, Robin Humble wrote:
> On Wed, Jan 17, 2007 at 08:55:31AM -0700, Brian W. Barrett wrote:
> >On Jan 17, 2007, at 2:39 AM, Gleb Natapov wrote:
> >> On Wed, Jan 17, 2007 at 04:12:10AM -0500, Robin Humble wrote:
> >>> basically I'm seeing wildly different bandwidths over InfiniBand 4x DDR
> >>> when I use different kernels.
> >> Try to load ib_mthca with tune_pci=1 option on those kernels that are
> >> slow.
> >when an application has high buffer reuse (like NetPIPE), which can  
> >be enabled by adding "-mca mpi_leave_pinned 1" to the mpirun command  
> >line.
> 
> thanks! :-)
> tune_pci=1 makes a huge difference at the top end, and
Well this is broken BIOS then. Look here for more explanation:
https://staging.openfabrics.org/svn/openib/gen2/branches/1.1/ofed/docs/mthca_release_notes.txt
search for "tune_pci=1".

> -mca mpi_leave_pinned 1 adds lots of midrange bandwidth.
> 
> latencies (~4us) and the low end performance are all unchanged.
> 
> see attached for details.
> most curves are for 2.6.19.2 except the last couple (tagged as old)
> which are for 2.6.9-42.0.3.ELsmp and for which tune_pci changes nothing.
> 
> why isn't tune_pci=1 the default I wonder?
> files in /sys/module/ib_mthca/ tell me it's off by default in
> 2.6.9-42.0.3.ELsmp, but the results imply that it's on... maybe PCIe
> handling is very different in that kernel.
This is explained in the link above.

> 
> is ~10Gbit the best I can expect from 4x DDR IB with MPI?
> some docs @HP suggest up to 16Gbit (data rate) should be possible, and
> I've heard that 13 or 14 has been achieved before. but those might be
> verbs numbers, or maybe horsepower >> 4 cores of 2.66GHz core2 is
> required?
> 
> >It would be interesting to know if the bandwidth differences appear  
> >when the leave pinned protocol is used.  My guess is that they will  
> 
> yeah, it definitely makes a difference in the 10kB to 10mB range.
> at around 100kB there's 2x the bandwidth when using pinned.
> 
> thanks again!
> 
> >   Brian Barrett
> >   Open MPI Team, CCS-1
> >   Los Alamos National Laboratory
> 
> how's OpenMPI on Cell? :)
> 
> cheers,
> robin
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] IB bandwidth vs. kernels

2007-01-17 Thread Gleb Natapov

Hi Robin,

On Wed, Jan 17, 2007 at 04:12:10AM -0500, Robin Humble wrote:
> 
> so this isn't really an OpenMPI questions (I don't think), but you guys
> will have hit the problem if anyone has...
> 
> basically I'm seeing wildly different bandwidths over InfiniBand 4x DDR
> when I use different kernels.
> I'm testing with netpipe-3.6.2's NPmpi, but a home-grown pingpong sees
> the same thing.
> 
> the default 2.6.9-42.0.3.ELsmp (and also sles10's kernel) gives ok
> bandwidth (50% of peak I guess is good?) at ~10 Gbit/s, but a pile of
> newer kernels (2.16.19.2, 2.6.20-rc4, 2.6.18-1.2732.4.2.el5.OFED_1_1(*))
> all max out at ~5.3 Gbit/s.
> 
> half the bandwidth! :-(
> latency is the same.
Try to load ib_mthca with tune_pci=1 option on those kernels that are
slow.

> 
> the same OpenMPI (1.1.1 from OSCAR, rebuild for openib support) and
> NPmpi was used with all kernels.
> I see an intermediate bandwidth if one kernel is the 'fast' 2.6.9 and
> another is a 'slow', so they don't appear to be using completely
> different protocols.
> it doesn't make any difference if I try to make extra-sure it's using
> openib with:
>   mpirun --mca btl openib --mca btl_tcp_if_exclude lo,eth0 ...
> 
> OS is CentOS 4.4 x86_64 which AFAICT includes packages based on OFED 1.0.
> lspci says the PCIe card is:
>   InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
> and dmesg says that all kernels are using
>   ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)
> but also winges that 'HCA FW version 1.0.700 is old'.
> 
> any ideas?
> very odd that all new kernels (including for RHEL5) are slow.
> 
> will OFED 1.1 make any difference? it didn't build cleanly when I
> tried, but I can try and try again...
> 
> thanks for any hints.
> 
> cheers,
> robin
> 
> (*) rhel5 + OFED 1.1 test kernel, rebuilt for centos4.4 from src.rpm at
>   
> http://people.redhat.com/dledford/Infiniband/kernel/2.6.18/1.2732.4.2.el5.OFED_1_1/x86_64/
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] mpool_gm_module error

2006-12-12 Thread Gleb Natapov

On Tue, Dec 12, 2006 at 12:58:00PM -0800, Reese Faucette wrote:
> > Well I have no luck in finding a way to up the amount the system will
> > allow GM to use.  What is a recommended solution? Is this even a
> > problem in most cases?  Like am i encountering a corner case?
> 
> upping the limit was not what i'm suggesting as a fix, just pointing out 
> that it is kind of low and even with a fully working ompi or mpich-gm.  ompi 
> should still work, even if the IOMMU limit is low.
> 
> Since you are running 1 thread per CPU (== 2 total), it is possible (likely) 
> that the 1st thread is grabbing all the available registerable memory, 
> leaving not even enough for the second thread to even start.  I recommend 
> you try the "mpool_rdma_rcache_size_limit" that Gleb mentions - the 
> equivalent setting is used in MPICH-GM in similar situations.  Set this to 
> about 180 MB and run with that.
> 
> Gleb - I assume that when registration needs exceed 
> "mpool_rdma_rcache_size_limit", that previously registered memory is 
> unregistered much as virtual memory is swapped out?
> 
If previously registered memory is in use than registration returns
error to upper layer and operation is retried late. Otherwise unused memory
is unregistered. The code for mpool_rdma_rcache_size_limit is not on
trunk yet. It is on tmp branch /tmp/gleb-mpool, I don't know if /tmp is
open to everyone. If not I can send the patch.

--
Gleb.

Re: [OMPI users] mpool_gm_module error

2006-12-11 Thread Gleb Natapov

On Mon, Dec 11, 2006 at 02:52:40PM -0500, Brock Palen wrote:
> On Dec 11, 2006, at 2:45 PM, Reese Faucette wrote:
> 
> >> Also I have no idea what the memory window question is, i will
> >> look it up on google.
> >>
> >> aon075:~ root# dmesg | grep GM
> >> GM: gm_register_memory will be able to lock 96000 pages (375 MBytes)
> >
> > This just answered it - there is 375MB available for GM to  
> > register, which
> > is the IOMMU window size available to the GM driver.  This is quite  
> > small,
> > and it would be very helpful if this could be increased.  helpful  
> > == much
> > better performance for your jobs.
> Always a plus
> 
> >
> > It's possible that OMPI is not managing registered space well, as is
> > required when the aggregate registered memory needed by a job is  
> > larger than
> > the memory available to be registered.  I do some research locally  
> > into
> > OMPIs management of registered memory when using GM.  If it comes  
> > to it,
> > would you be willing to run an OMPI with some debug statements in  
> > it for me?
> > thanks,
> Sure i can run them.
I added to OMPI possibility to limit the amount of registered memory.
There is a new parameter mpool_rdma_rcache_size_limit that controls how
much memory can be pinned at once. But this work is not yet on the trunk.
If you can checkout /tmp/gleb-mpool branch and test it this would be
great.

> 
> > -reese
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] multiple LIDs

2006-12-06 Thread Gleb Natapov

On Wed, Dec 06, 2006 at 12:14:35PM +0530, Chevchenkovic Chevchenkovic wrote:
> Hi,
>   Actually I was wondering why there is a facility for having multiple
> LIDs for the same port. This led me to the entire series of questions.
>It is still not very clear to, as to what is the advantage of
> assigning multiple LIDs to the same port. Does it give some
> performance advantages?
Each LID has its own path through the fabric (ideally), this is the way to
lower a congestion.

> -Chev
> 
> 
> On 12/5/06, Jeff Squyres <jsquy...@cisco.com> wrote:
> > There are two distinct layers of software being discussed here:
> >
> > - the PML (basically the back-end to MPI_SEND and friends)
> > - the BTL (byte transfer layer, the back-end bit movers for the ob1
> > and dr PMLs -- this distinction is important because there is nothing
> > in the PML design that forces the use of BTL's; indeed, there is at
> > least one current PML that does not use BTL's as the back-end bit
> > mover [the cm PML])
> >
> > The ob1 and dr PMLs know nothing about how the back-end bitmovers
> > work (BTL components) -- the BTLs are given considerable freedom to
> > operate within their specific interface contracts.
> >
> > Generally, ob1/dr queries each BTL component when Open MPI starts
> > up.  Each BTL responds with whether it wants to run or not.  If it
> > does, it gives back the one or more modules (think of a module as an
> > "instance" of a component).  Typically, these modules correspond to
> > multiple NICs / HCAs / network endpoints.  For example, if you have 2
> > ethernet cards, the tcp BTL will create and return 2 modules.  ob1 /
> > dr will treat these as two paths to send data (reachability is
> > computed as well, of course -- ob1/dr will only send data down btls
> > for which the target peer is reachable).  In general, ob1/dr will
> > round-robin across all available BTL modules when sending large
> > messages (as Gleb has described).  See http://www.open-mpi.org/papers/
> > euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/
> > dr protocols.
> >
> > The openib BTL can return multiple modules if multiple LIDs are
> > available.  So the ob1/dr doesn't know that these are not physical
> > devices -- it just treats each module as an equivalent mechanism to
> > send data.
> >
> > This is actually somewhat lame as a scheme, and we talked internally
> > about doing something more intelligent.  But we decided to hold off
> > and let people (like you!) with real-world apps and networks give
> > this stuff a try and see what really works (and what doesn't work)
> > before trying to implement anything else.
> >
> > So -- all that explanation aside -- we'd love to hear your feedback
> > with regards to the multi-LID stuff in Open MPI.  :-)
> >
> >
> >
> > On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:
> >
> > >  Thanks for that.
> > >
> > >  Suppose,  if there there are multiple interconnects, say ethernet and
> > > infiniband  and a million byte of data is to be sent, then in this
> > > case the data will be sent through infiniband (since its a fast path
> > > .. please correct me here if i m wrong).
> > >
> > >   If there are mulitple such sends, do you mean to say that each send
> > > will go  through  different BTLs in a RR manner if they are connected
> > > to the same port?
> > >
> > >  -chev
> > >
> > >
> > > On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
> > >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic
> > >> Chevchenkovic wrote:
> > >>> Hi,
> > >>>  It is not clear from the code as mentioned by you from
> > >>> ompi/mca/pml/ob1/  where exactly the selection of BTL bound to a
> > >>> particular LID occurs. Could you please specify the file/function
> > >>> name
> > >>> for the same?
> > >> There is no such code there. OB1 knows nothing about LIDs. It does RR
> > >> over all available interconnects. It can do RR between ethernet, IB
> > >> and Myrinet for instance. BTL presents each LID as different
> > >> virtual HCA
> > >> to OB1 and it does round-robin between them without event knowing
> > >> this
> > >> is the same port of the same HCA.
> > >>
> > >> Can you explain what are you trying to achieve?
> > >>
> > >>>  -chev
> > >>>
> > >>>
> >

Re: [OMPI users] multiple LIDs

2006-12-04 Thread Gleb Natapov

On Mon, Dec 04, 2006 at 11:57:07PM +0530, Chevchenkovic Chevchenkovic wrote:
>  Thanks for that.
> 
>  Suppose,  if there there are multiple interconnects, say ethernet and
> infiniband  and a million byte of data is to be sent, then in this
> case the data will be sent through infiniband (since its a fast path
> .. please correct me here if i m wrong).
With default parameters yes. But you can tweak Open MPI to split
message between interconnects.

> 
>   If there are mulitple such sends, do you mean to say that each send
> will go  through  different BTLs in a RR manner if they are connected
> to the same port?
One message can be split between multiple BTLs.

> 
>  -chev
> 
> 
> On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
> > On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic Chevchenkovic wrote:
> > > Hi,
> > >  It is not clear from the code as mentioned by you from
> > > ompi/mca/pml/ob1/  where exactly the selection of BTL bound to a
> > > particular LID occurs. Could you please specify the file/function name
> > > for the same?
> > There is no such code there. OB1 knows nothing about LIDs. It does RR
> > over all available interconnects. It can do RR between ethernet, IB
> > and Myrinet for instance. BTL presents each LID as different virtual HCA
> > to OB1 and it does round-robin between them without event knowing this
> > is the same port of the same HCA.
> >
> > Can you explain what are you trying to achieve?
> >
> > >  -chev
> > >
> > >
> > > On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
> > > > On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic Chevchenkovic 
> > > > wrote:
> > > > > Also could you please tell me which part of the openMPI code needs to
> > > > > be touched so that I can do some modifications in it to incorporate
> > > > > changes regarding LID selection...
> > > > >
> > > > It depend what do you want to do. The part that does RR over all
> > > > available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code doesn't
> > > > aware of the fact that it is doing RR over different LIDs and not
> > > > different NICs (yet?).
> > > >
> > > > The code that controls what LIDs will be used is in
> > > > ompi/mca/btl/openib/btl_openib_component.c.
> > > >
> > > > > On 12/4/06, Chevchenkovic Chevchenkovic <chevchenko...@gmail.com> 
> > > > > wrote:
> > > > > > Is it possible to control the LID where the send and recvs are
> > > > > > posted.. on either ends?
> > > > > >
> > > > > > On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote:
> > > > > > > On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic 
> > > > > > > Chevchenkovic
> > > > > > wrote:
> > > > > > > > Hi,
> > > > > > > >  I had this query. I hope some expert replies to it.
> > > > > > > > I have 2 nodes connected point-to-point using infiniband cable. 
> > > > > > > > There
> > > > > > > > are multiple LIDs for each of the end node ports.
> > > > > > > >When I give an MPI_Send, Are the sends are posted on 
> > > > > > > > different LIDs
> > > > > > > > on each of the end nodes OR they are they posted on the same 
> > > > > > > > LID?
> > > > > > > >  Awaiting your reply,
> > > > > > > It depend what version of Open MPI your are using. If you are 
> > > > > > > using
> > > > > > > trunk or v1.2 beta then all available LIDs are used in RR 
> > > > > > > fashion. The
> > > > > > early
> > > > > > > versions don't support LMC.
> > > > > > >
> > > > > > > --
> > > > > > >   Gleb.
> > > > > > > ___
> > > > > > > users mailing list
> > > > > > > us...@open-mpi.org
> > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > > > >
> > > > > >
> > > > > ___
> > > > > users mailing list
> > > > > us...@open-mpi.org
> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > > --
> > > >Gleb.
> > > > ___
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > --
> >Gleb.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] multiple LIDs

2006-12-04 Thread Gleb Natapov

On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic Chevchenkovic wrote:
> Also could you please tell me which part of the openMPI code needs to
> be touched so that I can do some modifications in it to incorporate
> changes regarding LID selection...
> 
It depend what do you want to do. The part that does RR over all
available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code doesn't
aware of the fact that it is doing RR over different LIDs and not
different NICs (yet?).

The code that controls what LIDs will be used is in
ompi/mca/btl/openib/btl_openib_component.c.

> On 12/4/06, Chevchenkovic Chevchenkovic <chevchenko...@gmail.com> wrote:
> > Is it possible to control the LID where the send and recvs are
> > posted.. on either ends?
> >
> > On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote:
> > > On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic Chevchenkovic
> > wrote:
> > > > Hi,
> > > >  I had this query. I hope some expert replies to it.
> > > > I have 2 nodes connected point-to-point using infiniband cable. There
> > > > are multiple LIDs for each of the end node ports.
> > > >When I give an MPI_Send, Are the sends are posted on different LIDs
> > > > on each of the end nodes OR they are they posted on the same LID?
> > > >  Awaiting your reply,
> > > It depend what version of Open MPI your are using. If you are using
> > > trunk or v1.2 beta then all available LIDs are used in RR fashion. The
> > early
> > > versions don't support LMC.
> > >
> > > --
> > >   Gleb.
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >
> >
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] multiple LIDs

2006-12-04 Thread Gleb Natapov

On Mon, Dec 04, 2006 at 01:02:48AM +0530, Chevchenkovic Chevchenkovic wrote:
> Is it possible to control the LID where the send and recvs are
> posted.. on either ends?
No, but you can control how much LIDs will be used from available LIDs.
This can be configured with "btl_openib_max_lmc" parameter.

> 
> On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote:
> > On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic Chevchenkovic wrote:
> > > Hi,
> > >  I had this query. I hope some expert replies to it.
> > > I have 2 nodes connected point-to-point using infiniband cable. There
> > > are multiple LIDs for each of the end node ports.
> > >When I give an MPI_Send, Are the sends are posted on different LIDs
> > > on each of the end nodes OR they are they posted on the same LID?
> > >  Awaiting your reply,
> > It depend what version of Open MPI your are using. If you are using
> > trunk or v1.2 beta then all available LIDs are used in RR fashion. The early
> > versions don't support LMC.
> >
> > --
> > Gleb.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] multiple LIDs

2006-12-03 Thread Gleb Natapov

On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic Chevchenkovic wrote:
> Hi,
>  I had this query. I hope some expert replies to it.
> I have 2 nodes connected point-to-point using infiniband cable. There
> are multiple LIDs for each of the end node ports.
>When I give an MPI_Send, Are the sends are posted on different LIDs
> on each of the end nodes OR they are they posted on the same LID?
>  Awaiting your reply,
It depend what version of Open MPI your are using. If you are using
trunk or v1.2 beta then all available LIDs are used in RR fashion. The early
versions don't support LMC.

--
Gleb.

Re: [OMPI users] How to set paffinity on a multi-cpu node?

2006-12-01 Thread Gleb Natapov

On Fri, Dec 01, 2006 at 09:35:09AM -0500, Brock Palen wrote:
> On Dec 1, 2006, at 9:23 AM, Gleb Natapov wrote:
> 
> > On Fri, Dec 01, 2006 at 04:14:31PM +0200, Gleb Natapov wrote:
> >> On Fri, Dec 01, 2006 at 11:51:24AM +0100, Peter Kjellstrom wrote:
> >>> On Saturday 25 November 2006 15:31, shap...@isp.nsc.ru wrote:
> >>>> Hello,
> >>>> i cant figure out, is there a way with open-mpi to bind all
> >>>> threads on a given node to a specified subset of CPUs.
> >>>> For example, on a multi-socket multi-core machine, i want to use
> >>>> only a single core on each CPU.
> >>>> Thank You.
> >>>
> >>> This might be a bit naive but, if you spawn two procs on a dual  
> >>> core dual
> >>> socket system then the linux kernel should automagically schedule  
> >>> them this
> >>> way.
> >>>
> >>> I actually think this applies to most of the situations discussed  
> >>> in this
> >>> thread. Explicitly assigning processes to cores may actually get  
> >>> it wrong
> >>> more often than the normal linux scheduler.
> >>>
> >> If you run two single threaded ranks on the dual core dual socket  
> >> node
> >> you better be placing them on the same core. Shared memory  
> >> communication
> Isn't this only valid with NUMA systems?  (large systems or AMD  
> Opteron)  The intel multicores each must communicate along the bus to  
> the north-bridge and back again.  So all cores have the same path to  
> memory.  Correct me if im wrong.  Though working on this would be  
> good, i dont expect all systems to stick with bus, and more and more  
> will be NUMA in the future.
AFAIK Core 2 Duo has shared L2 cache so shared memory communication should be
much faster if two ranks are on the same socket. But I don't have such a
setup to test the theory.

> 
> On another note for systems that use pbs (and maybe other resource  
> managers)  It gives out the cpus in the hostlist  (hostname/0  
> hostname/1 etc)   Why cant OMPI read that info if its available?
> 
> Im prob totally off on these comments.
> 
> Brock
> 
> > I mean "same socket" here and not "same core" of cause.
> >
> >> will be much faster (especially if two cores shares cache).
> >>
> >>> /Peter (who may be putting a bit too much faith in the linux  
> >>> scheduler...)
> >> Linux scheduler does its best assuming the processes are  
> >> unrelated. This is
> >> not the case with MPI ranks.
> >>
> >> --
> >>Gleb.
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > --
> > Gleb.
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] How to set paffinity on a multi-cpu node?

2006-12-01 Thread Gleb Natapov

On Fri, Dec 01, 2006 at 04:14:31PM +0200, Gleb Natapov wrote:
> On Fri, Dec 01, 2006 at 11:51:24AM +0100, Peter Kjellstrom wrote:
> > On Saturday 25 November 2006 15:31, shap...@isp.nsc.ru wrote:
> > > Hello,
> > > i cant figure out, is there a way with open-mpi to bind all
> > > threads on a given node to a specified subset of CPUs.
> > > For example, on a multi-socket multi-core machine, i want to use
> > > only a single core on each CPU.
> > > Thank You.
> > 
> > This might be a bit naive but, if you spawn two procs on a dual core dual 
> > socket system then the linux kernel should automagically schedule them this 
> > way.
> > 
> > I actually think this applies to most of the situations discussed in this 
> > thread. Explicitly assigning processes to cores may actually get it wrong 
> > more often than the normal linux scheduler.
> > 
> If you run two single threaded ranks on the dual core dual socket node
> you better be placing them on the same core. Shared memory communication
I mean "same socket" here and not "same core" of cause.

> will be much faster (especially if two cores shares cache).
> 
> > /Peter (who may be putting a bit too much faith in the linux scheduler...)
> Linux scheduler does its best assuming the processes are unrelated. This is
> not the case with MPI ranks.
> 
> --
>   Gleb.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

Re: [OMPI users] How to set paffinity on a multi-cpu node?

2006-12-01 Thread Gleb Natapov

On Fri, Dec 01, 2006 at 11:51:24AM +0100, Peter Kjellstrom wrote:
> On Saturday 25 November 2006 15:31, shap...@isp.nsc.ru wrote:
> > Hello,
> > i cant figure out, is there a way with open-mpi to bind all
> > threads on a given node to a specified subset of CPUs.
> > For example, on a multi-socket multi-core machine, i want to use
> > only a single core on each CPU.
> > Thank You.
> 
> This might be a bit naive but, if you spawn two procs on a dual core dual 
> socket system then the linux kernel should automagically schedule them this 
> way.
> 
> I actually think this applies to most of the situations discussed in this 
> thread. Explicitly assigning processes to cores may actually get it wrong 
> more often than the normal linux scheduler.
> 
If you run two single threaded ranks on the dual core dual socket node
you better be placing them on the same core. Shared memory communication
will be much faster (especially if two cores shares cache).

> /Peter (who may be putting a bit too much faith in the linux scheduler...)
Linux scheduler does its best assuming the processes are unrelated. This is
not the case with MPI ranks.

--
Gleb.

Re: [OMPI users] dma using infiniband protocol

2006-11-02 Thread Gleb Natapov

On Thu, Nov 02, 2006 at 10:37:24AM -0800, Brian Budge wrote:
> Hi all -
> 
> I'm wondering how DMA is handled in OpenMPI when using the infiniband
> protocol.  In particular, will I get a speed gain if my read/write buffers
> are already pinned via mlock?
> 
No you will not. mlock has nothing to do with memory registration that
is needed for RDMA. If you'll allocate your read/write buffers with
MPI_Alloc_mem() that will help because this function register memory
for you.

--
Gleb.

Re: [OMPI users] Error Polling HP CQ Status on PPC64 LInux with IB

2006-06-19 Thread Gleb Natapov

What version of OpenMPI are you using?

On Mon, Jun 19, 2006 at 07:06:54AM -0700, Owen Stampflee wrote:
> I'm currently working on getting OpenMPI + OpenIB 1.0 (might be an RC)
> working on our 8 node Xserve G5 cluster running Linux kernel version
> 2.6.16 and get the following errors:
> 
> Process 1 on node-192-168-111-249
> Process 0 on node-192-168-111-248
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 1 for wr_id 270995584 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270995868 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270996152 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270996436 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270996720 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270997004 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270997288 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270997572 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 271077504 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 271077788 opcode -1286736
> 
> [0,1,1][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 271078072 opcode -1286736
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 9 for wr_id 270991488 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270995584 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270995868 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270996152 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270996436 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270996720 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270997004 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270997288 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 270997572 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 271077504 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 271077788 opcode -6639584
> 
> [0,1,0][btl_openib_component.c:587:mca_btl_openib_component_progress]
> error polling HP CQ with status 5 for wr_id 271078072 opcode -6639584
> 
> mpirun: killing job...
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Gleb.

45 matches

Mail list logo