Re: [OMPI users] Is Iprobe fast when there is no message to recieve

2009-10-03 Thread Ashley Pittman
On Sat, 2009-10-03 at 07:05 -0400, Jeff Squyres wrote:
> That being said, if you just want to send a quick "notify" that an  
> event has occurred, you might want to use a specific tag and/or  
> communicator for these extraordinary messages.  Then, when the event  
> occurs, send a very short message on this special tag/communicator  
> (potentially even a 0-byte message).

> You can MPI_TEST for  
> the completion of this short/0-byte receive very quickly.  You can  
> then send the actual data of the event in a different non-blocking  
> receive that is only checked if the short "alert" message is received.

In general I would say that Iprobe is a bad thing to use, as Jeff says
post a receive in advance and then call test on this receive rather than
using Iprobe.

>From your description it sounds like a zero byte send is all you need
which should be fast in all cases.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] use additional interface for openmpi

2009-10-03 Thread Jeff Squyres

On Sep 29, 2009, at 9:58 AM,   wrote:


> Open MPI should just "figure it out" and do the Right Thing at run-
> time -- is that not happening?
you are right it should.
But I want to exclude any traffic from OpenMPI communications, like  
NFS, traffic from other jobs and so on.

And use only special ethernet interface for this purpose.

So I have OpenMPI 1.3.3 installed on all nodes and head node in the  
same directory.

OS is the same on all cluster - debian 5.0
On nodes I have two interfaces eth0 - for NFS and so on...
and eht1 for OpenMPI.
On head node I have 5 interfaces: eth0 for NFS, eth4 for OpenMPI
Network is next:
1) Head node eht0 + nodes eht0: 192.168.0.0/24
2) Head node eth4 + nodes eth1: 192.168.1.0/24

So how I can configure OpenMPI for using only network 2) for my  
purpose?




Try using "--mca btl_tcp_if_exclude eth0 --mca oob_tcp_if_exclude  
eth0".  This will tell all machines not to use eth0.  The only other  
network available is eth4 or eth1, so it should do the Right thing.


Note that Open MPI has *two* TCP subsystems: the one used for MPI  
communications and the one used for "out of band" communications.  BTL  
is the MPI communication subsystem; "oob" is the Out of Band  
communications subsystem.



Other problem is next:
I try to run some examples. But unfortunately it is not work.
My be it is not correctly configured network.

I can submit any jobs only on one host from this host.
When I submit from head node for example to other nodes it hangs   
without any messages.
And on node where I want to calculate I see that here is started  
orted daemon.

(I use default config files)

Below is examples:
mpirun -v --mca btl self,sm,tcp --mca btl_base_verbose 30 --mca  
btl_tcp_if_include eth0 -np 2 -host n10,n11 cpi

no output, no calculations, only orted daemon on nodes

mpirun -v --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 - 
host n10,n11 cpi

the same as abowe

mpirun -v --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 - 
host n00,n00 cpi

n00 is head node - it works and produces output.



It sounds like OMPI is getting confused between the non-uniform  
networks.  I have heard reports of OMPI not liking networks with  
different interface names, but it's not immediately obvious why the  
interface names are relevant to OMPI's selection criteria (and not  
enough details are available in the reports I heard before).


Try the *_if_exclude methods above and see if that works for you.  If  
not, let us know.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] job fails with "Signal: Bus error (7)"

2009-10-03 Thread Jeff Squyres
Bus error usually means that there was an invalid address passed as a  
pointer somewhere in the code -- it's not usually a communications  
error.


Without more information, it's rather difficult to speculate on what  
happened here.  Did you get corefiles?  If so, are there useful  
backtraces available?



On Oct 1, 2009, at 6:01 AM, Sangamesh B wrote:


Hi,

 A fortran application which is compiled with ifort-10.1 and  
open mpi 1.3.1 on Cent OS 5.2 fails after running 4 days with  
following error message:


[compute-0-7:25430] *** Process received signal ***

[compute-0-7:25433] *** Process received signal ***
[compute-0-7:25433] Signal: Bus error (7)
[compute-0-7:25433] Signal code:  (2)
[compute-0-7:25433] Failing at address: 0x4217b8
[compute-0-7:25431] *** Process received signal ***

[compute-0-7:25431] Signal: Bus error (7)
[compute-0-7:25431] Signal code:  (2)
[compute-0-7:25431] Failing at address: 0x4217b8
[compute-0-7:25432] *** Process received signal ***
[compute-0-7:25432] Signal: Bus error (7)

[compute-0-7:25432] Signal code:  (2)
[compute-0-7:25432] Failing at address: 0x4217b8
[compute-0-7:25430] Signal: Bus error (7)
[compute-0-7:25430] Signal code:  (2)
[compute-0-7:25430] Failing at address: 0x4217b8

[compute-0-7:25431] *** Process received signal ***
[compute-0-7:25431] Signal: Segmentation fault (11)
[compute-0-7:25431] Signal code:  (128)
[compute-0-7:25431] Failing at address: (nil)
[compute-0-7:25430] *** Process received signal ***

[compute-0-7:25433] *** Process received signal ***
[compute-0-7:25433] Signal: Segmentation fault (11)
[compute-0-7:25433] Signal code:  (128)
[compute-0-7:25433] Failing at address: (nil)
[compute-0-7:25432] *** Process received signal ***

[compute-0-7:25432] Signal: Segmentation fault (11)
[compute-0-7:25432] Signal code:  (128)
[compute-0-7:25432] Failing at address: (nil)
[compute-0-7:25430] Signal: Segmentation fault (11)
[compute-0-7:25430] Signal code:  (128)

[compute-0-7:25430] Failing at address: (nil)
--
mpirun noticed that process rank 3 with PID 25433 on node  
compute-0-7.local exited on signal 11 (Segmentation fault).




--
This job is run with 4 open mpi processes, on the nodes which have  
interconnected with Gigabit.

The same job runs well on the nodes with infiniband connectivity.

What could be the reason for this? Is this due to loose physical  
connectivities, as its giving a bus error?

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] MPI_Comm_accept()/connect() errors

2009-10-03 Thread Jeff Squyres

On Oct 1, 2009, at 7:00 AM, Blesson Varghese wrote:

The following is the information regarding the error. I am running  
Open MPI 1.2.5 on Ubuntu 4.2.4, kernel version 2.6.24


Is there any chance that you can upgrade to the Open MPI v1.3 series?

--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Are there ways to reduce the memory used by OpenMPI?

2009-10-03 Thread Jeff Squyres

On Oct 1, 2009, at 2:56 PM, Blosch, Edwin L wrote:

Are there are tuning parameters than I can use to reduce the amount  
of memory used by OpenMPI?  I would very much like to use OpenMPI  
instead of MVAPICH, but I’m on a cluster where memory usage is the  
most important consideration. Here are three results which capture  
the problem:


With the “leave_pinned” behavior turned on, I get good performance  
(19.528, lower is better)


mpirun --prefix /usr/mpi/intel/openmpi-1.2.8 --machinefile


FWIW, there have been a lot of improvements in Open MPI since the 1.2  
series (including some memory reduction work) -- is it possible for  
you to upgrade to the latest 1.3 release?


/var/spool/torque/aux/7972.fwnaeglingio -np 28 --mca btl ^tcp  --mca  
mpi_leave_pinned 1 --mca mpool_base_use_mem_hooks 1 -x  
LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 /tmp/7972.fwnaeglingio/ 
falconv4_ibm_openmpi -cycles 100 -ri restart.0 -ro /tmp/ 
7972.fwnaeglingio/restart.0

Compute rate (processor-microseconds/cell/cycle):   19.528
Total memory usage:38155.3477 MB (38.1553 GB)


Turning off the leave_pinned behavior, I get considerably slower  
performance (28.788), but the memory usage is unchanged (still 38 GB)


mpirun --prefix /usr/mpi/intel/openmpi-1.2.8 --machinefile /var/ 
spool/torque/aux/7972.fwnaeglingio -np 28 -x LD_LIBRARY_PATH -x  
MPI_ENVIRONMENT=1 /tmp/7972.fwnaeglingio/falconv4_ibm_openmpi - 
cycles 100 -ri restart.0 -ro /tmp/7972.fwnaeglingio/restart.0

Compute rate (processor-microseconds/cell/cycle):   28.788
Total memory usage:38335.7656 MB (38.3358 GB)


I would guess that you are continually re-using the same communication  
buffers -- doing so will definitely be better with  
mpi_leave_pinned=1.  Note, too, that mpi_leave_pinned is on by default  
for OpenFabrics networks in the Open MPI 1.3 series.


Using MVAPICH, the performance is in the middle (23.6), but the  
memory usage is reduced by 5 to 6 GB out of 38 GB, a significant  
decrease to me.


/usr/mpi/intel/mvapich-1.1.0/bin/mpirun_rsh -ssh -np 28 -hostfile / 
var/spool/torque/aux/7972.fwnaeglingio LD_LIBRARY_PATH="/usr/mpi/ 
intel/mvapich-1.1.0/lib/shared:/usr/mpi/intel/openmpi-1.2.8/lib64:/ 
appserv/intel/fce/10.1.008/lib:/appserv/intel/cce/10.1.008/lib"  
MPI_ENVIRONMENT=1 /tmp/7972.fwnaeglingio/falconv4_ibm_mvapich - 
cycles 100 -ri restart.0 -ro /tmp/7972.fwnaeglingio/restart.0

Compute rate (processor-microseconds/cell/cycle):   23.608
Total memory usage:32753.0586 MB (32.7531 GB)


I didn’t see anything in the FAQ that discusses memory usage other  
than the impact of the “leave_pinned” option, which apparently does  
not affect the memory usage in my case.  But I figure there must be  
a justification why OpenMPI would use 6 GB more than MVAPICH on the  
same case.



Try the 1.3 series; we do have a bunch of knobs in there for memory  
usage -- there were significant changes/advancements in the 1.3 series  
with regards to how OpenFabrics buffers are registered.  Get a  
baseline on that memory usage, and then let's see what you want to do  
from there.


--
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] Is Iprobe fast when there is no message to recieve

2009-10-03 Thread Jeff Squyres
Keep in mind that MPI says you do have to eventually receive the  
message -- so just checking if it's there is not enough (eventually).   
Iprobe is definitely one way.  You could also post a non-blocking  
receive (persistent or not) and MPI_TEST to see if it has completed.


However, if the message is long, MPI implementations like Open MPI  
*may* require multiple invocations of the progression engine to  
actually receive the entire message (e.g., it may get fragmented by  
the sender and use a rendezvous protocol, therefore having multiple  
states in the progression logic, each of which may only advance one or  
two states in each call to MPI_TEST).


That being said, if you just want to send a quick "notify" that an  
event has occurred, you might want to use a specific tag and/or  
communicator for these extraordinary messages.  Then, when the event  
occurs, send a very short message on this special tag/communicator  
(potentially even a 0-byte message).  Open MPI will send short  
messages eagerly and not require multiple states through a progression  
machine (heck, just about all MPI's do this).  You can MPI_TEST for  
the completion of this short/0-byte receive very quickly.  You can  
then send the actual data of the event in a different non-blocking  
receive that is only checked if the short "alert" message is received.


There are a small number of cases (e.g., resource exhaustion) where  
Open MPI will have to fall back out of the eager send mode for short  
messages, but in general, sending a short message with an alert and a  
larger message with the actual data to be processed might be a good  
choice.



On Oct 1, 2009, at 10:43 PM, Peter Lonjers wrote:

I am not sure if this is the right place the ask this question but  
here

it goes.

Simplified abstract version of the question.
I have 2 MPI processes and I want one to make an occasional signal to
the other process.  These signals will not happen at predictable  
times.
I want the other process sitting in some kind of work loop to be  
able to

make a very fast check to see if a signal has been sent to it.

What is the best way to do this.

Actual problem
I am working on a realistic neural net simulator. The neurons are  
split

into groups with one group to each processor to simulate them.
Occasionally a neuron will spike and have to send that message to
neurons on a different processor. This is a relatively rare event. The
receiving neurons need to be able to make a very fast check to see if
there is a message from neurons on another processor.

The way I am doing it now is to use simple send and receive commands.
The receiving cell does an iprobe check on every loop through  the
simulation for every cell that connects to it to see if there is a
message(spike) from that cell. If the iprobe says there is a message  
is

does a receive on that message.

This seems convoluted though. I do not actually need to receive the
message just know that a message is there. And it seems like depending
on how Iprobe works there might be a faster method.

Is Iprobe fast if there is no message to receive?
Would persistent connections work better?



Anyway any help would be greatly appreciated.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com