[OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-06-17 Thread Tony Ladd
I am getting the following error with openmpi-1.1b1

mca_btl_tcp_frag_send: writev failed with errno=110

1) This does not ever happen with other MPI's I have tried like MPICH and
LAM
2) It only seems to happen with large numbers of cpus, 32 and occasionally
16, and with larger messages sizes. In this case it ws 128K.
3) It only seems to happen with dual cpus on each node.
4) My configuration is default with (in openmpi-mca-params.conf): 
pls_rsh_agent = rsh 
btl = tcp,self 
btl_tcp_if_include = eth1 
I also set --mca btl_tcp_eager_limit 131072 when running the program, though
leaving this out does not eliminate the problem.

My program is a communication test; it sends bidirectional point to point
messages among N cpus. In one test it exchanges messages between pairs of
cpus, in another it reads from the node on its left and sends to the node on
its right (a kind of ring), and in a third it uses MPI_ALL_REDUCE.

Finally: the tcp driver in openmpi seems not nearly as good as the one in
LAM. I got higher throughput with far fewer dropouts with LAM.

Tony


---
Tony Ladd
Professor, Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005

Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu 




Re: [OMPI users] pls:rsh: execv failed with errno=2

2006-06-17 Thread Eric Thibodeau
Hello Jeff,

Fristly, don't worry about jumping in late, I'll send you a skid rope 
;) Secondly, thanks for your nice little artilces on clustermonkey.net (good 
refresher on MPI). And finally, down to my issues, thanks for clearing out the 
--prefix LD_LIBRARY_PATH and all. The ebuild I made/mangled for Openmpi under 
Gentoo was modified by some of the devs to follow some of the lib Vs lib64 
reqs. I might change them to be identicall (only $PREFIX/lib) across platforms 
since multi-arch MPI will be hell to get working with a changing 
LD_LIBRARY_PATH.

After some recommendations, I tried openmpi-1.1b3r10389 on the AMD64 arch and 
got my MPI app running on that single sual Opteron node, I still have to figure 
out the --prefix/PATH/LD_LIBRARY_PATH mess to get the app to spawn across that 
dual Opteron node and 2 single Athlon nodes (cross arch with the variying 
LD_LIBRARY_PATH). But that's another issue for the moment (a bit of fiddling on 
my side to get orte to be recognized on the nodes)

As for the sparc-sun-solaris2.8 , I tried compiling openmpi-1.1b3r10389 but it 
bombs with both gcc or the SUN cc:

Making all in asm
source='asm.c' object='asm.lo' libtool=yes \
DEPDIR=.deps depmode=none /bin/bash ../.././config/depcomp \
/bin/bash ../../libtool --tag=CC --mode=compile 
/export/lca/appl/Forte/SUNWspro/WS6U2/bin/cc -DHAVE_CONFIG_H  -I. -I. 
-I../../opal/include -I../../orte/include -I../../ompi/include 
-I../../ompi/include   -I../..   -O -DNDEBUG  -mt -c -o asm.lo asm.c
 /export/lca/appl/Forte/SUNWspro/WS6U2/bin/cc -DHAVE_CONFIG_H -I. -I. 
-I../../opal/include -I../../orte/include -I../../ompi/include 
-I../../ompi/include -I../.. -O -DNDEBUG -mt -c asm.c  -KPIC -DPIC -o 
.libs/asm.o
"../../opal/include/opal/sys/atomic.h", line 486: #error: Atomic arithmetic on 
pointers not supported
cc: acomp failed for asm.c
*** Error code 1

I was told by one of the system's admin that the SUN Enterprise machine (12 
proc) has "special" considerations when using semaphores (it's hardware 
implemented O_o! ), I'm only mentionning this due to the error message (Atomic 
arithmetic ...)

So, I got half my problem resolved with the upgrade, any suggestions for 
compiling OpenMPI on this _old_ but very educationnal SMP machine?

Eric

Le vendredi 16 juin 2006 17:32, Jeff Squyres (jsquyres) a écrit :
> Sorry for jumping in late...
> 
> The /lib vs. /lib64 thing as part of --prefix was definitely broken until 
> recently.  This behavior has been fixed in the 1.1 series.  Specifically, 
> OMPI will take the prefix that you provided and append the basename of the 
> local $libdir.  So if you configured OMPI with something like:
> 
>  shell$ ./configure --libdir=/some/path/lib64 ...
> 
> And then you run:
> 
>  shell$ mpirun --prefix /some/path ...
> 
> Then OMPI will add /some/path/lib64 to the remote LD_LIBRARY_PATH.  The 
> previous behavior would always add "/lib" to the remote LD_LIBRARY_PATH, 
> regardless of what the local $libdir was (i.e., it ignored the basename of 
> your $libdir).  
> 
> If you have a situation more complicated than this (e.g., your $libdir is 
> different than your prefix by more than just the basename), then --prefix is 
> not the solution for you.  Instead, you'll need to set your $PATH and 
> $LD_LIBRARY_PATH properly on all nodes (e.g., in your shell startup files). 
> Specifically, --prefix is meant to be an easy workaround for common 
> configurations where $libdir is a subdirectory under $prefix.
> 
> Another random note: invoking mpirun with an absolute path (e.g., 
> /path/to/bin/mpirun) is exactly the same as specifying --prefix /path/to -- 
> so you don't have to do both.
> 
> 
[..SNIP..]